Comment Analysis using NLP

Published on . Written by

Comment Analysis using NLP

Seen Jarvis of Iron Man and wanted your own AI, so exploring the world of ML and AI. Well, it is not that in a project you could reach there but when your own computer reaches the capacity to tell about how the message is good or bad about just a statement then you are a step further to it. May be someday this would help to bring Jarvis to you.


Skyfi Labs Projects
This is the Project you should do to practice in Natural Language Processing to move ahead from basic ML libraries like pandas and NumPy to reach a place of Data Scientist.

Read more..

SLNOTE
OUTLINE:

In Machine learning the problem is not with the algorithm so much we just need proper data and we will be making our own data on own using web scraping and then use it for training and testing then use it for review/comment analysis. Here we will be using nltk library for natural language processing. We will use other libraries too, like bs4, sklearn, etc. The thing which would be not joyful will be the time of rendering of data and data extraction, otherwise, it is a very simple code to understand.

PREREQUISITE:

The given libraries should be installed in your python package:-

  • request
  • time
  • string
  • bs4 ( for beautifulsoup)
  • csv
  • pandas
  • nltk
  • sklearn

SLLATEST
REQUIREMENTS:

Hardware:

  • One Laptop or Desktop with any OS (Linux/Windows/iOS)
Software/Technology:

  • A Browser ( Chrome/Firefox/Opera/Safari/IE) to run the program. I prefer chrome or firefox as it provides a console to see behind the work of program like like the IDE of python ( to use just right-click on the mouse and select ‘Inspect’ or Ctrl+Shift+I to open it and switch to console tag in this window)
  • For this project, I prefer an IDE ( Interactive Development Environment ) that too Jupyter Notebook. There are many benefits using it, one is while programming you can get the result of parts of the whole program. 
  • But if wanted you can use a text editor ( Visual Studio(VS) Code/Atom/Sublime/Notepad ) to code. I prefer VS code as it completes the code from its libraries and themes are attractive.
  • Internet Connection
IMPLEMENTATION:

We will complete it in three parts:

1) Data extraction using web scraping 

2) Training and Testing of our model

3) Practically checking a comment

DATA EXTRACTION

1. We need a large data so we go to the place where we can get it, I got the data from the review section of iPhone 6 on Flipkart site ( direct link below ):

'https://www.flipkart.com/apple-iphone-6-gold-32-gb/product-reviews/itmewxhuufbzchrn?pid=MOBEWXHUSBXVJ7NZ&lid=LSTMOBEWXHUSBXVJ7NZPXN7ZL&marketplace=FLIPKART' 

2. This opens the first page of Iphone 6 , we will first extract the data from here required for us

3. we need to understand what we are extracting, two things one is rating other is a review section, we will take ‘3’ as neutral, ‘< 3’ as negative and ‘> 3’ as positive comments

4. while we could individually extract each thing rating and review, but I am extracting from the common way

5. we use the div class under which these both come, here it is ‘col _390CkK _1gY8H-' 

5. import the beautifulsoup from bs4. Then using get method from requests we extract whole page and save in a variable 

response = requests.get(“URL”)

requesting website and downloading its content using get method

6. After this we use instantiating soup object which accepts what to find and how to find

soup = BeautifulSoup(response.text, 'lxml')

NOTE: if lxml produces error use 'html.parser'

7. Using find all method to find all occurrences of class col _390CkK _1gY8H- and put data in variable content/reviews (whatever you want )

8. We add the data individually in a dummy variable following way:

for i in reviews:

a.append(i.get_text())

9. Now using looping we add the review and rating in different list ( the name should predefined ) with conditioning like:

for i in a:

if int(i[0])<3:

view.append(i[1:])

rate.append("negative")

elif int(i[0])==3:

view.append(i[1:])

rate.append("neutral")

else:

view.append(i[1:])

rate.append("positive")

10. Now we need to extract all the data from other pages too

11. Now we need to put all code at one place and amend the url in following way:

'https://www.flipkart.com/apple-iphone-6-gold-32-gb/product-reviews/itmewxhuufbzchrn?pid=MOBEWXHUSBXVJ7NZ&lid=LSTMOBEWXHUSBXVJ7NZPXN7ZL&marketplace=FLIPKART&page=' + str(i)

12. put this in for loop where i in range 2 to 1649 (because we have 1648 pages to add the data)

Note: It is a large data set so scraping may take hours about 2.5 hours

13. Now we need to change whole data in csv form

14. using file handling we open using with open a file in writing format and declare a writer in following way:

hello = csv.writer(file name)

15. Declare first row as ‘reviews’ , ‘rating’

16. Now with for loop add all data from a list in the file

Note: remember to declare your file with .csv extension

Your CSV data is ready. It is a large data set close to 1 crore reviews, It takes time rendering this data so can reduce the page number

If you want, you can get a better data set from kaggle and other sites since this can be a biased review collection, just modify the complete code

TRAINING AND TESTING MODEL

1. we will be using nltk and sklearn libraries with string aand other packages within them

2. first we read the the csv file using pandas library

3. we import nltk , string library and stopwords from nltk.corpus and then declare function to remove punctuations and stop words from our data

4. now we import the countvectorizer and TfidTransformer from sklearn.feature extraction .text and also we import multinomialNB from sklearn.naive_bayes

5. we need to divide our data for training and testing so we import trai_test_split from sklearn.model_selection and divide data in following way

msg_train,msg_test,label_train,label_test = train_test_split(csv['reviews'],csv['ratings'],test_size=0.2,random_state=101)

6. Now we import Pipeline from sklearn.pipline library

7. now we pipeline our data in our model following way:

model = Pipeline([

('bow',CountVectorizer(analyzer=text_process)),

('tfidf',TfidfTransformer()),

('classifier',MultinomialNB())

])

8. Now we train our data with following step:

pipeline.fit(msg_train,label_train)

9. this may take hours as per dataset size and processor of system

10. after train we cross-check with test:

predictions = pipeline.predict(msg_test)

11. Now print the report:

print(classification_report(label_test,predictions))

12. now to check our own comment we do following way:

pipeline.predict([‘comment’])

The work is done program is ready but in RAM if you want to fix this you can using ‘pickle’ library or joblib from sklearn , just google it for more visit my github:

https://github.com/vaibhavkumar-779/


SLDYK
Kit required to develop Comment Analysis using NLP:
Technologies you will learn by working on Comment Analysis using NLP:


Any Questions?


Subscribe for more project ideas