
Before we start building any model in Natural Language Processing it is necessary to understand the dataset thoroughly. This post looks into different features and and combination features to get better understanding of customer reviews. Let’s get started.
In this post we’ll perform Exploratory Data Analysis on Amazon Customer reviews dataset. The dataset can be obtained from here
Overview
- The Data
- Text Preprocessing & Cleaning
- Univariate Distribution of Features
- Distribution of n-grams
- Bivariate Distribution of Features
- Topic Modeling
- Word Cloud
- Avg Reading time of Reviews
The Data
Dataset contains reviews of various products manufactured by Amazon, like Kindle, Fire TV, Echo, etc. The dataset has about 34,000+ rows, each containing review text, username, product name, rating, and other information for each product.
data = pd.read_csv("./data/amazon1.csv") data.shape data.head()

Text Preprocessing & Cleaning
Drop the Unwanted Columns
data.drop('id',axis=1,inplace=True) data.drop('asins',axis=1,inplace=True) data.drop('keys',axis=1,inplace=True) data.drop('manufacturer',axis=1,inplace=True) data.drop('reviews.id',axis=1,inplace=True) data.drop('reviews.sourceURLs',axis=1,inplace=True) data.drop('reviews.userCity',axis=1,inplace=True) data.drop('reviews.userProvince',axis=1,inplace=True) data.drop('reviews.didPurchase',axis=1,inplace=True) data.drop('reviews.dateAdded',axis=1,inplace=True) data.drop('reviews.dateSeen',axis=1,inplace=True)
Remove the empty Rows
data = data[~data['reviews.text'].isnull()] data = data[~data['reviews.doRecommend'].isnull()] data = data[~data['reviews.numHelpful'].isnull()] data = data[~data['reviews.rating'].isnull()]
Add Category id for easier Visualization
The name column has lengthy names of products that have been reviewed. We’ll use nemeric numbers so that it can easily fit in graphs.
category_dict={} for i,key in enumerate(dict(data['categories'].value_counts()).keys()): category_dict[key]=i data['cat_id'] = data['categories'].apply(lambda category:category_dict[category])
Check the data type of columns
data.dtypes

Format Date
data['reviews.date'] = pd.to_datetime(data['reviews.date']) data['reviews.month'] = data['reviews.date'].dt.month data['reviews.year'] = data['reviews.date'].dt.year
Encode Reviews.recommend column
data['reviews.doRecommend'] = data['reviews.doRecommend'].apply(lambda t:1 if t==True else 0)
Clean the review.text column
def preprocess(reviewtext): reviewtext = reviewtext.str.replace("(<br/>)","") reviewtext = reviewtext.str.replace("\\w*\\d\\w*","") # Digits & Word Containing digits reviewtext = reviewtext.str.replace("[%s]"%re.escape(string.punctuation),"") #Punctuations reviewtext = reviewtext.str.replace(" +","") #Extra Spaces return reviewtext data['reviews.text'] = preprocess(data['reviews.text'])
Calculate polarity of Reviews using TextBlob
data['polarity']=data['reviews.text'].map(lambda text: TextBlob(text).sentiment.polarity)
Add new features based on length & word count of reviews
data['review_len'] = data['reviews.text'].astype(str).apply(len) data['word_count'] = data['reviews.text'].apply(lambda x: len(str(x).split()))
Check few sample reviews
for index,text in enumerate(data['reviews.text'][30:35]): print('Review %d:\\n'%(index+1),text)

Univariate Distribution of Features
Distribution of Polarity
sns.distplot(data["polarity"],hist=True)

From the chart it is clear that most of the reviews have positive polarity but are close to 0.5 and not 1. There are very few reviews which are negative.
Distribution of review rating
plt.hist(data['reviews.rating']) plt.show()

Most users have given rating of five and 4 which clearly shoes that users are happy with the product they have purchased.
Distribution of review Length
plt.hist(data['review_len'],bins=100) plt.show()

Distribution of Product Categories
sns.countplot(data['cat_id']) plt.show()

Most of the reviews have been written on category zero which is about Tablet Computers.
Distribution of month of reviews
df_y_m=data.groupby(['reviews.year','reviews.month'])['reviews.text'].agg('count').reset_index() df_y_m['y_m'] = df_y_m[['reviews.year','reviews.month']].astype(str).agg('-'.join, axis=1) fig,ax =plt.subplots(figsize = (16,5)) fig = sns.barplot(x = "y_m", y = "reviews.text", data = df_y_m, estimator = sum, ci = None, ax=ax) ax.set_xticklabels(labels=df_y_m['y_m'], rotation=45, ha='right') plt.show()

Most of the reviews are written in the months of December and January for both 2015 and 2016. This may be due the purchasing pattern of users in the month of December.
Distribution of Ratings Across Categories
df_cat = data[data['cat_id']<=10] fig,ax = plt.subplots(figsize=(14,6)) fig = sns.boxplot(x = 'cat_id',y='reviews.rating', data = df_cat) plt.show()

Except category 2,3 & 8, all the other categories’ median ratings are 5. Overall, the ratings are high and sentiment are positive in this review data set.
Category 2,3 & 8 have only 5 ratings without any outlier
Distribution of Review length Across Categories
sns.set_style("whitegrid") fig,ax = plt.subplots(figsize=(14,6)) fig = sns.boxplot(x = 'cat_id',y='review_len', data = df_cat) #Displaying Median Value medians = df_cat.groupby(['cat_id'])['review_len'].median() vertical_offset = df_cat['review_len'].median() * 0.05 for xtick in fig.get_xticks(): fig.text(xtick,medians[xtick] + vertical_offset,medians[xtick], horizontalalignment='center',color='w',weight='semibold') plt.ylim(0,1000) plt.show()

Except category 7 & 8, all have median review length of 100
Distribution of Sentiment Score Across Categories
fig,ax = plt.subplots(figsize=(16,6)) fig = sns.boxplot(x = 'cat_id',y='polarity', data = df_cat) medians = df_cat.groupby(['cat_id'])['polarity'].median() vertical_offset = df_cat['polarity'].median() * 0.05 for xtick in fig.get_xticks(): fig.text(xtick,medians[xtick] + vertical_offset,medians[xtick], horizontalalignment='center',color='w',weight='semibold') plt.show()

n-grams
Distribution of top Unigrams
def get_top_ngrams(df,n=None): vec = CountVectorizer().fit(df) bag_of_words = vec.transform(df) sum_words = bag_of_words.sum(axis=0) words_freq = [(word,sum_words[0,i]) for word,i in vec.vocabulary_.items()] #words_freq = list(cv_fit.vocabulary_.items()) words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True) return words_freq[:n] common_words = get_top_ngrams(data['reviews.text'],25) df_ngram = pd.DataFrame(common_words,columns=['word','count']) df_ngram = df_ngram.groupby('word').sum()['count'].sort_values(ascending=False).reset_index() fig,ax = plt.subplots(figsize=(14,6)) fig = sns.barplot(x = 'word',y='count',data=df_ngram,ci = None, ax=ax) ax.set_xticklabels(labels=df_ngram['word'], rotation=45, ha='right') plt.show()

The top unigrams are filled with English stop words like the, it. and, to etc.
Distribution of top unigrams (after removing stop words)
def get_top_ngrams(df,n=None): vec = CountVectorizer(stop_words='english').fit(df) bag_of_words = vec.transform(df) sum_words = bag_of_words.sum(axis=0) words_freq = [(word,sum_words[0,i]) for word,i in vec.vocabulary_.items()] #words_freq = list(cv_fit.vocabulary_.items()) words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True) return words_freq[:n] common_words = get_top_ngrams(data['reviews.text'],25) df_ngram = pd.DataFrame(common_words,columns=['word','count']) df_ngram = df_ngram.groupby('word').sum()['count'].sort_values(ascending=False).reset_index() fig,ax = plt.subplots(figsize=(14,6)) fig = sns.barplot(x = 'word',y='count',data=df_ngram,ci = None, ax=ax) ax.set_xticklabels(labels=df_ngram['word'], rotation=45, ha='right') plt.show()

Top unigrams mostly talk about positive words about the products
Distribution of top bigrams (after removing stop words)
def get_top_ngrams(df,n=None): vec = CountVectorizer(ngram_range=(2,2), stop_words='english').fit(df) bag_of_words = vec.transform(df) sum_words = bag_of_words.sum(axis=0) words_freq = [(word,sum_words[0,i]) for word,i in vec.vocabulary_.items()] #words_freq = list(cv_fit.vocabulary_.items()) words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True) return words_freq[:n] common_words = get_top_ngrams(data['reviews.text'],25) df_ngram = pd.DataFrame(common_words,columns=['word','count']) df_ngram = df_ngram.groupby('word').sum()['count'].sort_values(ascending=False).reset_index() fig,ax = plt.subplots(figsize=(14,6)) fig = sns.barplot(x = 'word',y='count',data=df_ngram,ci = None, ax=ax) ax.set_xticklabels(labels=df_ngram['word'], rotation=45, ha='right') plt.show()

Distribution of top trigrams (before removing stop words)
def get_top_ngrams(df,n=None): vec = CountVectorizer(ngram_range=(3,3)).fit(df) bag_of_words = vec.transform(df) sum_words = bag_of_words.sum(axis=0) words_freq = [(word,sum_words[0,i]) for word,i in vec.vocabulary_.items()] #words_freq = list(cv_fit.vocabulary_.items()) words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True) return words_freq[:n] common_words = get_top_ngrams(data['reviews.text'],25) df_ngram = pd.DataFrame(common_words,columns=['word','count']) df_ngram = df_ngram.groupby('word').sum()['count'].sort_values(ascending=False).reset_index() fig,ax = plt.subplots(figsize=(14,6)) fig = sns.barplot(x = 'word',y='count',data=df_ngram,ci = None, ax=ax) ax.set_xticklabels(labels=df_ngram['word'], rotation=45, ha='right') plt.show()

Distribution of top trigrams (after removing stop words)
def get_top_ngrams(df,n=None): vec = CountVectorizer(ngram_range=(3,3), stop_words='english').fit(df) bag_of_words = vec.transform(df) sum_words = bag_of_words.sum(axis=0) words_freq = [(word,sum_words[0,i]) for word,i in vec.vocabulary_.items()] #words_freq = list(cv_fit.vocabulary_.items()) words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True) return words_freq[:n] common_words = get_top_ngrams(data['reviews.text'],25) df_ngram = pd.DataFrame(common_words,columns=['word','count']) df_ngram = df_ngram.groupby('word').sum()['count'].sort_values(ascending=False).reset_index() fig,ax = plt.subplots(figsize=(14,6)) fig = sns.barplot(x = 'word',y='count',data=df_ngram,ci = None, ax=ax) ax.set_xticklabels(labels=df_ngram['word'], rotation=45, ha='right') plt.show()

Bivariate Visualization of Features
Distribution of sentiment polarity score by recommendations
x1 = data.loc[data['reviews.doRecommend']==1,'polarity'] x0 = data.loc[data['reviews.doRecommend']==0,'polarity'] fig, ax = plt.subplots() for i in [x1,x0]: sns.distplot(i, ax=ax, kde=False) plt.show()

The reviews that were not recommended to others have their polarity distributed around zero while reviews that were recommended to others have polarity distributed around 0.5
Distribution of review lengths by recommendations
df_revlen = data[data['review_len']<500] x1 = df_revlen.loc[df_revlen['reviews.doRecommend']==1,'review_len'] x0 = df_revlen.loc[df_revlen['reviews.doRecommend']==0,'review_len'] fig, ax = plt.subplots() for i in [x1,x0]: sns.distplot(i, ax=ax, kde=False) plt.show()

2D Density jointplot of sentiment polarity vs. rating
g = sns.jointplot(data=data, x="polarity", y="reviews.rating") g.plot_joint(sns.kdeplot, color="r", zorder=0, levels=6) g.plot_marginals(sns.rugplot, color="r", height=-.15, clip_on=False)

Most of the reviews have concentrated around rating of 5 and polarity 0.25 to 0.5 giving a sense of overall positive reviews for the products purchased.
2D Density jointplot of sentiment polarity vs. Category
sns.jointplot(data=data, x="polarity", y="cat_id",kind='kde')

Most of the positive reviews are given form categories 0 and the polarity of such reviews are around 0.25 to 0.5. Other categories like 1,2,3 have similar distribution in terms of polarity.
Topic Modeling
Visualizing sentiments through Scattertext
Let’s analyze words used by reviews according “recommended” column and outputs some notable term associations
nlp = spacy.load('en_core_web_sm') corpus = st.CorpusFromPandas(data, category_col='reviews.doRecommend', text_col='reviews.text', nlp=nlp).build() print(list(corpus.get_scaled_f_scores_vs_background().index[:10]))
corpus.get_term_freq_df()[:10]

This shows terms associated with both recommended and not recommended reviews.
Following are the top terms that are mostly associated with reviews that had been recommended by users to others.
term_freq_df = corpus.get_term_freq_df() term_freq_df['Recc Score'] = corpus.get_scaled_f_scores(1.0) print(list(term_freq_df.sort_values(by='Recc Score', ascending=False).index[:10]))

Following are the terms that are mostly associated with reviews that had NOT been recommended by users to others.
term_freq_df['No Recc Score'] = corpus.get_scaled_f_scores(0.0) print(list(term_freq_df.sort_values(by='No Recc Score', ascending=False).index[:10]))

Topic modeling using LSA
reindexed_data = data['reviews.text'] tfidf_vectorizer = TfidfVectorizer(stop_words='english', use_idf=True, smooth_idf=True) reindexed_data = reindexed_data.values document_term_matrix = tfidf_vectorizer.fit_transform(reindexed_data) n_topics = 10 lsa_model = TruncatedSVD(n_components=n_topics) lsa_topic_matrix = lsa_model.fit_transform(document_term_matrix)
#<https://gist.github.com/susanli2016/3f88f5aab3f844cc53a44817386d06ce#file-topic_model_lsa-py> def get_keys(topic_matrix): ''' returns an integer list of predicted topic categories for a given topic matrix ''' keys = topic_matrix.argmax(axis=1).tolist() return keys def keys_to_counts(keys): ''' returns a tuple of topic categories and their accompanying magnitudes for a given list of keys ''' count_pairs = Counter(keys).items() categories = [pair[0] for pair in count_pairs] counts = [pair[1] for pair in count_pairs] return (categories, counts) def get_top_n_words(n, keys, document_term_matrix, tfidf_vectorizer): ''' returns a list of n_topic strings, where each string contains the n most common words in a predicted category, in order ''' top_word_indices = [] for topic in range(n_topics): temp_vector_sum = 0 for i in range(len(keys)): if keys[i] == topic: temp_vector_sum += document_term_matrix[i] temp_vector_sum = temp_vector_sum.toarray() top_n_word_indices = np.flip(np.argsort(temp_vector_sum)[0][-n:],0) top_word_indices.append(top_n_word_indices) top_words = [] for topic in top_word_indices: topic_words = [] for index in topic: temp_word_vector = np.zeros((1,document_term_matrix.shape[1])) temp_word_vector[:,index] = 1 the_word = tfidf_vectorizer.inverse_transform(temp_word_vector)[0][0] topic_words.append(the_word.encode('ascii').decode('utf-8')) top_words.append(" ".join(topic_words)) return top_words lsa_keys = get_keys(lsa_topic_matrix) lsa_categories, lsa_counts = keys_to_counts(lsa_keys) top_n_words_lsa = get_top_n_words(5, lsa_keys, document_term_matrix, tfidf_vectorizer) for i in range(len(top_n_words_lsa)): print("Topic {}: ".format(i+1), top_n_words_lsa[i])

Let’s see the top 3 words used in each of the topics
top_3_words = get_top_n_words(3, lsa_keys, document_term_matrix, tfidf_vectorizer) labels = ['Topic {}: \\n'.format(i) + top_3_words[i] for i in lsa_categories] fig, ax = plt.subplots(figsize=(16,8)) ax.bar(lsa_categories, lsa_counts); ax.set_xticks(lsa_categories); ax.set_xticklabels(labels); ax.set_ylabel('Number of review text'); ax.set_title('LSA topic counts'); plt.show();

The model has not been able to create distinct topics and most the words are associated with topic 0. Hence we can’t conclude anything out of this.
Word Cloud
Let’s create wordcloud by each category of products.
df_grouped=data[['name','reviews.text']].groupby(by='name').agg(lambda x:' '.join(x)) cv = CountVectorizer(analyzer='word',stop_words='english') cv_data = cv.fit_transform(df_grouped['reviews.text']) df_dtm = pd.DataFrame(cv_data.toarray(),columns=cv.get_feature_names()) df_dtm.index = df_grouped.index def generate_wordcloud(data,title): wc = WordCloud(width=400, height=330, max_words=150,colormap="Dark2").generate_from_frequencies(data) plt.figure(figsize=(10,8)) plt.imshow(wc, interpolation='bilinear') plt.axis("off") plt.title('\\n'.join(wrap(title,60)),fontsize=13) plt.show() #Transposing document term matrix df_dtm=df_dtm.transpose()
# Plotting word cloud for each product for index,product in enumerate(df_dtm.columns): generate_wordcloud(df_dtm[product].sort_values(ascending=False),product)

Avg Reading time of Reviews
Text Standard
data['text_standard']=data['reviews.text'].apply(lambda x: textstat.text_standard(x)) print('Text Standard of upvoted reviews=>',data[data['reviews.numHelpful']>1]['text_standard'].mode()) print('Text Standard of not upvoted reviews=>',data[data['reviews.numHelpful']<=1]['text_standard'].mode())

Reading Time
data['reading_time']=data['reviews.text'].apply(lambda x: textstat.reading_time(x)) print('Reading Time of upvoted reviews=>',data[data['reviews.numHelpful']>1]['reading_time'].mean()) print('Reading Time of not upvoted reviews=>',data[data['reviews.numHelpful']<=1]['reading_time'].mean())

The reading time of upvoted reviews is twice that of not upvoted reviews. It means that people usually find longer reviews helpful.
Conclusion
- The amazon reviews dataset has around 34,000 reviews
- Most of the reviews have positive polarity but are close to 0.5 and not 1. There are very few reviews which are negative. Most users have given rating of five and 4 which clearly shows that users are happy with the product they have purchased. Most of the reviews have been written on category zero which is about Tablet Computers.
- Most of the reviews are written in the months of December and January for both 2015 and 2016. This may be due the purchasing pattern of users in the month of December.
- Except category 2,3 & 8, all the other categories’ median ratings are 5. Overall, the ratings are high and sentiment are positive in this review data set. Category 2,3 & 8 have only 5 ratings without any outlier
- Top unigrams mostly talk about positive words about the products
- Most of the reviews have concentrated around rating of 5 and polarity 0.25 to 0.5 giving a sense of overall positive reviews for the products purchased. Most of the positive reviews are given for categories 0 and the polarity of such reviews are around 0.25 to 0.5. Other categories like 1,2,3 have similar distribution in terms of polarity.
- The terms associated with recommended reviews are . – ‘loves it’, ‘very easy’, ‘and easy’, ‘loves’, ‘she loves’, ‘love this’. The terms associated with not recommended reviews are – ‘returning’, ‘returned’, ‘return’, ‘i returned’, ‘to return’, ‘returned it’.
- The model has not been able to create distinct topics and most the words are associated with topic 0.
- The reading time of upvoted reviews is twice that of not upvoted reviews. It means that people usually find longer reviews helpful.
That’s it! . You can find the entire code for the above analysis here
References
- https://dair.ai/Exploratory_Data_Analysis_for_Text_Data/
- https://www.kaggle.com/shivamb/seconds-from-disaster-text-eda-and-analysis
- https://medium.com/@Rahulvks/always-start-with-text-eda-in-classification-problem-8df73748701c
- https://www.analyticsvidhya.com/blog/2020/04/beginners-guide-exploratory-data-analysis-text-data/
- https://neptune.ai/blog/exploratory-data-analysis-natural-language-processing-tools
- https://towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualization-for-text-data-29fb1b96fb6a
- https://medium.com/analytics-vidhya/visualizing-phrase-prominence-and-category-association-with-scattertext-and-pytextrank-f7a5f036d4d2