With the rise in e-commerce, popularity of shopping vendors like Amazon is on rise.
Often at times, customers express their opinion or sentiment by giving feedback or reviews.
These reviews, or feedbacks are in the form of text.
Sentiment analysis is the process of determining the opinion, reviews or feeling expressed as either positive, negative or neutral.
Capturing the exact sentiment of a review through text is a challenging task.
In this notebook, various preprocessing techniques like HTML tags and URLs removal, punctuation, whitespace, special character removal and stemming are used to clean the reviews.
The preprocessed data is represented using feature selection techniques like term frequency-inverse document frequency (TF–IDF).
The classifiers like Decision Tree (DT), Support Vector Machine (SVM), Linear regression (RF) and Naive Bayes (NB) are used to classify sentiment of Amazon book reviews.
Finally, (i) comparison of various classifiers based on F1 Score and Accuracy, (ii) Tune the selected model using grid-search and (iii) Performed classification on unseen data.
import random
class Sentiment:
NEGATIVE = "NEGATIVE"
NEUTRAL = "NEUTRAL"
POSITIVE = "POSITIVE"
#instead of selectin using indexing from the text and score we can create a class
class Review:
def __init__(self, text, score):
self.text = text
self.score = score
self.sentiment = self.get_sentiment()
def get_sentiment(self):
if self.score <= 2:
return Sentiment.NEGATIVE
elif self.score == 3:
return Sentiment.NEUTRAL
else: #Score of 4 or 5
return Sentiment.POSITIVE
class ReviewContainer:
def __init__(self, reviews):
self.reviews = reviews
def get_text(self):
return [x.text for x in self.reviews]
def get_sentiment(self):
return [x.sentiment for x in self.reviews]
def evenly_distribute(self):
negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
positive_shrunk = positive[:len(negative)]
self.reviews = negative + positive_shrunk
random.shuffle(self.reviews)
import json
file_name = 'Books_small_10000.json'
with open(file_name) as f:
for line in f:
review = json.loads(line)
print(review['reviewText'])
print(review['overall'])
break
I bought both boxed sets, books 1-5. Really a great series! Start book 1 three weeks ago and just finished book 5. Sloane Monroe is a great character and being able to follow her through both private life and her PI life gets a reader very involved! Although clues may be right in front of the reader, there are twists and turns that keep one guessing until the last page! These are books you won't be disappointed with. 5.0
reviews = []
with open(file_name) as f:
for line in f:
review = json.loads(line)
reviews.append(Review(review['reviewText'], review['overall']))
reviews[5].sentiment
'POSITIVE'
from sklearn.model_selection import train_test_split
training, test = train_test_split(reviews, test_size=0.33, random_state=42)
train_container = ReviewContainer(training)
test_container = ReviewContainer(test)
train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()
test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()
print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))
436 436
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# This book is great !
# This book was so bad
vectorizer = TfidfVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)
test_x_vectors = vectorizer.transform(test_x)
print(train_x[0])
print(train_x_vectors[0].toarray())
The book isn't in very good condition. The pages are yellowed. The spine is well worn. Not that good a shape. [[0. 0. 0. ... 0. 0. 0.]]
from sklearn import svm
clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)
test_x[0]
clf_svm.predict(test_x_vectors[0])
array(['NEGATIVE'], dtype='<U8')
from sklearn.tree import DecisionTreeClassifier
clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)
clf_dec.predict(test_x_vectors[0])
array(['POSITIVE'], dtype='<U8')
from sklearn.naive_bayes import GaussianNB
clf_gnb = DecisionTreeClassifier()
clf_gnb.fit(train_x_vectors, train_y)
clf_gnb.predict(test_x_vectors[0])
array(['POSITIVE'], dtype='<U8')
from sklearn.linear_model import LogisticRegression
clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y)
clf_log.predict(test_x_vectors[0])
array(['NEGATIVE'], dtype='<U8')
# Mean Accuracy
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors, test_y))
print(clf_log.score(test_x_vectors, test_y))
0.8076923076923077 0.6298076923076923 0.6370192307692307 0.8052884615384616
# F1 Scores
from sklearn.metrics import f1_score
f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])
#f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE])
array([0.80582524, 0.80952381])
test_set = ['very fun', "bad book do not buy", 'horrible waste of time']
new_test = vectorizer.transform(test_set)
clf_svm.predict(new_test)
array(['POSITIVE', 'NEGATIVE', 'NEGATIVE'], dtype='<U8')
from sklearn.model_selection import GridSearchCV
parameters = {'kernel': ('linear', 'rbf'), 'C': (1,4,8,16,32)}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(train_x_vectors, train_y)
GridSearchCV(cv=5, estimator=SVC(), param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')})
print(clf.score(test_x_vectors, test_y))
0.8100961538461539
import pickle
with open('sentiment_classifier.pkl', 'wb') as f:
pickle.dump(clf, f)
with open('sentiment_classifier.pkl', 'rb') as f:
loaded_clf = pickle.load(f)
print(test_x[1])
loaded_clf.predict(test_x_vectors[1])
Some of the characters seem like an amalgamation of familiar video game characters and stereotypes but over Id like to live in this universe. Very good series.
array(['POSITIVE'], dtype='<U8')
http://jmcauley.ucsd.edu/data/amazon/
This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.
This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).
References1!..