Exploratory Data Analysis (EDA) on NLP Text Data

Before we start building any model in Natural Language Processing it is necessary to understand the dataset thoroughly. This post looks into different features and and combination features to get better understanding of customer reviews. Let’s get started.

In this post we’ll perform Exploratory Data Analysis on Amazon Customer reviews dataset. The dataset can be obtained from here

Overview

  • The Data
  • Text Preprocessing & Cleaning
  • Univariate Distribution of Features
  • Distribution of n-grams
  • Bivariate Distribution of Features
  • Topic Modeling
  • Word Cloud
  • Avg Reading time of Reviews

The Data

Dataset contains reviews of various products manufactured by Amazon, like Kindle, Fire TV, Echo, etc. The dataset has about 34,000+ rows, each containing review text, username, product name, rating, and other information for each product.

data = pd.read_csv("./data/amazon1.csv")
data.shape
data.head()

Text Preprocessing & Cleaning

Drop the Unwanted Columns

data.drop('id',axis=1,inplace=True)
data.drop('asins',axis=1,inplace=True)
data.drop('keys',axis=1,inplace=True)

data.drop('manufacturer',axis=1,inplace=True)
data.drop('reviews.id',axis=1,inplace=True)
data.drop('reviews.sourceURLs',axis=1,inplace=True)
data.drop('reviews.userCity',axis=1,inplace=True)
data.drop('reviews.userProvince',axis=1,inplace=True)
data.drop('reviews.didPurchase',axis=1,inplace=True)

data.drop('reviews.dateAdded',axis=1,inplace=True)
data.drop('reviews.dateSeen',axis=1,inplace=True)


Remove the empty Rows

data = data[~data['reviews.text'].isnull()]
data = data[~data['reviews.doRecommend'].isnull()]
data = data[~data['reviews.numHelpful'].isnull()]
data = data[~data['reviews.rating'].isnull()]

Add Category id for easier Visualization

The name column has lengthy names of products that have been reviewed. We’ll use nemeric numbers so that it can easily fit in graphs.

category_dict={}
for i,key in enumerate(dict(data['categories'].value_counts()).keys()):
    category_dict[key]=i

data['cat_id'] = data['categories'].apply(lambda category:category_dict[category])

Check the data type of columns

data.dtypes

Format Date

data['reviews.date'] = pd.to_datetime(data['reviews.date'])
data['reviews.month'] = data['reviews.date'].dt.month
data['reviews.year'] = data['reviews.date'].dt.year

Encode Reviews.recommend column

data['reviews.doRecommend'] = data['reviews.doRecommend'].apply(lambda t:1 if t==True else 0)

Clean the review.text column

def preprocess(reviewtext):
    reviewtext = reviewtext.str.replace("(<br/>)","")
    reviewtext = reviewtext.str.replace("\\w*\\d\\w*","") # Digits &amp; Word Containing digits
    reviewtext = reviewtext.str.replace("[%s]"%re.escape(string.punctuation),"") #Punctuations
    reviewtext = reviewtext.str.replace(" +","") #Extra Spaces
    
    return reviewtext

data['reviews.text'] = preprocess(data['reviews.text'])

Calculate polarity of Reviews using TextBlob

data['polarity']=data['reviews.text'].map(lambda text: TextBlob(text).sentiment.polarity)

Add new features based on length & word count of reviews

data['review_len'] = data['reviews.text'].astype(str).apply(len)
data['word_count'] = data['reviews.text'].apply(lambda x: len(str(x).split()))

Check few sample reviews

for index,text in enumerate(data['reviews.text'][30:35]):
    print('Review %d:\\n'%(index+1),text)

Univariate Distribution of Features

Distribution of Polarity

sns.distplot(data["polarity"],hist=True)

From the chart it is clear that most of the reviews have positive polarity but are close to 0.5 and not 1. There are very few reviews which are negative.

Distribution of review rating

plt.hist(data['reviews.rating'])
plt.show()

Most users have given rating of five and 4 which clearly shoes that users are happy with the product they have purchased.

Distribution of review Length

plt.hist(data['review_len'],bins=100)
plt.show()

Distribution of Product Categories

sns.countplot(data['cat_id'])
plt.show()

Most of the reviews have been written on category zero which is about Tablet Computers.

Distribution of month of reviews

df_y_m=data.groupby(['reviews.year','reviews.month'])['reviews.text'].agg('count').reset_index()

df_y_m['y_m'] = df_y_m[['reviews.year','reviews.month']].astype(str).agg('-'.join, axis=1)

fig,ax =plt.subplots(figsize = (16,5))

fig = sns.barplot(x = "y_m", y = "reviews.text", data = df_y_m, 
                  estimator = sum, ci = None, ax=ax)

ax.set_xticklabels(labels=df_y_m['y_m'], rotation=45, ha='right')

plt.show()

Most of the reviews are written in the months of December and January for both 2015 and 2016. This may be due the purchasing pattern of users in the month of December.

Distribution of Ratings Across Categories

df_cat = data[data['cat_id']<=10]
fig,ax = plt.subplots(figsize=(14,6))

fig = sns.boxplot(x = 'cat_id',y='reviews.rating', data = df_cat)

plt.show()

Except category 2,3 & 8, all the other categories’ median ratings are 5. Overall, the ratings are high and sentiment are positive in this review data set.

Category 2,3 & 8 have only 5 ratings without any outlier

Distribution of Review length Across Categories

sns.set_style("whitegrid")
fig,ax = plt.subplots(figsize=(14,6))

fig = sns.boxplot(x = 'cat_id',y='review_len', data = df_cat)

#Displaying Median Value
medians = df_cat.groupby(['cat_id'])['review_len'].median()
vertical_offset = df_cat['review_len'].median() * 0.05

for xtick in fig.get_xticks():
    fig.text(xtick,medians[xtick] + vertical_offset,medians[xtick], 
            horizontalalignment='center',color='w',weight='semibold')

plt.ylim(0,1000)

plt.show()

Except category 7 & 8, all have median review length of 100

Distribution of Sentiment Score Across Categories

fig,ax = plt.subplots(figsize=(16,6))

fig = sns.boxplot(x = 'cat_id',y='polarity', data = df_cat)

medians = df_cat.groupby(['cat_id'])['polarity'].median()
vertical_offset = df_cat['polarity'].median() * 0.05

for xtick in fig.get_xticks():
    fig.text(xtick,medians[xtick] + vertical_offset,medians[xtick], 
            horizontalalignment='center',color='w',weight='semibold')

plt.show()

n-grams

Distribution of top Unigrams

def get_top_ngrams(df,n=None):
    vec = CountVectorizer().fit(df)
    
    bag_of_words = vec.transform(df)
    
    sum_words = bag_of_words.sum(axis=0)
    
    words_freq = [(word,sum_words[0,i]) for word,i in vec.vocabulary_.items()]
    #words_freq = list(cv_fit.vocabulary_.items())
    words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)
    
    return words_freq[:n]

common_words = get_top_ngrams(data['reviews.text'],25)

df_ngram = pd.DataFrame(common_words,columns=['word','count'])
df_ngram = df_ngram.groupby('word').sum()['count'].sort_values(ascending=False).reset_index()

fig,ax = plt.subplots(figsize=(14,6))
fig = sns.barplot(x = 'word',y='count',data=df_ngram,ci = None, ax=ax)

ax.set_xticklabels(labels=df_ngram['word'], rotation=45, ha='right')
plt.show()

The top unigrams are filled with English stop words like the, it. and, to etc.

Distribution of top unigrams (after removing stop words)

def get_top_ngrams(df,n=None):
    vec = CountVectorizer(stop_words='english').fit(df)
    
    bag_of_words = vec.transform(df)
    
    sum_words = bag_of_words.sum(axis=0)
    
    words_freq = [(word,sum_words[0,i]) for word,i in vec.vocabulary_.items()]
    #words_freq = list(cv_fit.vocabulary_.items())
    words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)
    
    return words_freq[:n]

common_words = get_top_ngrams(data['reviews.text'],25)

df_ngram = pd.DataFrame(common_words,columns=['word','count'])
df_ngram = df_ngram.groupby('word').sum()['count'].sort_values(ascending=False).reset_index()

fig,ax = plt.subplots(figsize=(14,6))
fig = sns.barplot(x = 'word',y='count',data=df_ngram,ci = None, ax=ax)

ax.set_xticklabels(labels=df_ngram['word'], rotation=45, ha='right')
plt.show()

Top unigrams mostly talk about positive words about the products

Distribution of top bigrams (after removing stop words)

def get_top_ngrams(df,n=None):
    vec = CountVectorizer(ngram_range=(2,2), stop_words='english').fit(df)
    
    bag_of_words = vec.transform(df)
    
    sum_words = bag_of_words.sum(axis=0)
    
    words_freq = [(word,sum_words[0,i]) for word,i in vec.vocabulary_.items()]
    #words_freq = list(cv_fit.vocabulary_.items())
    words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)
    
    return words_freq[:n]

common_words = get_top_ngrams(data['reviews.text'],25)

df_ngram = pd.DataFrame(common_words,columns=['word','count'])
df_ngram = df_ngram.groupby('word').sum()['count'].sort_values(ascending=False).reset_index()

fig,ax = plt.subplots(figsize=(14,6))
fig = sns.barplot(x = 'word',y='count',data=df_ngram,ci = None, ax=ax)

ax.set_xticklabels(labels=df_ngram['word'], rotation=45, ha='right')
plt.show()

Distribution of top trigrams (before removing stop words)

def get_top_ngrams(df,n=None):
    vec = CountVectorizer(ngram_range=(3,3)).fit(df)
    
    bag_of_words = vec.transform(df)
    
    sum_words = bag_of_words.sum(axis=0)
    
    words_freq = [(word,sum_words[0,i]) for word,i in vec.vocabulary_.items()]
    #words_freq = list(cv_fit.vocabulary_.items())
    words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)
    
    return words_freq[:n]

common_words = get_top_ngrams(data['reviews.text'],25)

df_ngram = pd.DataFrame(common_words,columns=['word','count'])
df_ngram = df_ngram.groupby('word').sum()['count'].sort_values(ascending=False).reset_index()

fig,ax = plt.subplots(figsize=(14,6))
fig = sns.barplot(x = 'word',y='count',data=df_ngram,ci = None, ax=ax)

ax.set_xticklabels(labels=df_ngram['word'], rotation=45, ha='right')
plt.show()

Distribution of top trigrams (after removing stop words)

def get_top_ngrams(df,n=None):
    vec = CountVectorizer(ngram_range=(3,3), stop_words='english').fit(df)
    
    bag_of_words = vec.transform(df)
    
    sum_words = bag_of_words.sum(axis=0)
    
    words_freq = [(word,sum_words[0,i]) for word,i in vec.vocabulary_.items()]
    #words_freq = list(cv_fit.vocabulary_.items())
    words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)
    
    return words_freq[:n]

common_words = get_top_ngrams(data['reviews.text'],25)

df_ngram = pd.DataFrame(common_words,columns=['word','count'])
df_ngram = df_ngram.groupby('word').sum()['count'].sort_values(ascending=False).reset_index()

fig,ax = plt.subplots(figsize=(14,6))
fig = sns.barplot(x = 'word',y='count',data=df_ngram,ci = None, ax=ax)

ax.set_xticklabels(labels=df_ngram['word'], rotation=45, ha='right')
plt.show()

Bivariate Visualization of Features

Distribution of sentiment polarity score by recommendations

x1 = data.loc[data['reviews.doRecommend']==1,'polarity']
x0 = data.loc[data['reviews.doRecommend']==0,'polarity']

fig, ax = plt.subplots()

for i in [x1,x0]:
    sns.distplot(i, ax=ax, kde=False)
    
plt.show()

The reviews that were not recommended to others have their polarity distributed around zero while reviews that were recommended to others have polarity distributed around 0.5

Distribution of review lengths by recommendations

df_revlen = data[data['review_len']<500]

x1 = df_revlen.loc[df_revlen['reviews.doRecommend']==1,'review_len']
x0 = df_revlen.loc[df_revlen['reviews.doRecommend']==0,'review_len']

fig, ax = plt.subplots()

for i in [x1,x0]:
    sns.distplot(i, ax=ax, kde=False)
    
plt.show()

2D Density jointplot of sentiment polarity vs. rating

g = sns.jointplot(data=data, x="polarity", y="reviews.rating")
g.plot_joint(sns.kdeplot, color="r", zorder=0, levels=6)
g.plot_marginals(sns.rugplot, color="r", height=-.15, clip_on=False)

Most of the reviews have concentrated around rating of 5 and polarity 0.25 to 0.5 giving a sense of overall positive reviews for the products purchased.

2D Density jointplot of sentiment polarity vs. Category

sns.jointplot(data=data, x="polarity", y="cat_id",kind='kde')

Most of the positive reviews are given form categories 0 and the polarity of such reviews are around 0.25 to 0.5. Other categories like 1,2,3 have similar distribution in terms of polarity.

Topic Modeling

Visualizing sentiments through Scattertext

Let’s analyze words used by reviews according “recommended” column and outputs some notable term associations

nlp = spacy.load('en_core_web_sm')
corpus = st.CorpusFromPandas(data, category_col='reviews.doRecommend', text_col='reviews.text', nlp=nlp).build()
print(list(corpus.get_scaled_f_scores_vs_background().index[:10]))

corpus.get_term_freq_df()[:10]

This shows terms associated with both recommended and not recommended reviews.

Following are the top terms that are mostly associated with reviews that had been recommended by users to others.

term_freq_df = corpus.get_term_freq_df()
term_freq_df['Recc Score'] = corpus.get_scaled_f_scores(1.0)
print(list(term_freq_df.sort_values(by='Recc Score', ascending=False).index[:10]))

Following are the terms that are mostly associated with reviews that had NOT been recommended by users to others.

term_freq_df['No Recc Score'] = corpus.get_scaled_f_scores(0.0)
print(list(term_freq_df.sort_values(by='No Recc Score', ascending=False).index[:10]))

Topic modeling using LSA

reindexed_data = data['reviews.text']
tfidf_vectorizer = TfidfVectorizer(stop_words='english', use_idf=True, smooth_idf=True)
reindexed_data = reindexed_data.values
document_term_matrix = tfidf_vectorizer.fit_transform(reindexed_data)

n_topics = 10
lsa_model = TruncatedSVD(n_components=n_topics)
lsa_topic_matrix = lsa_model.fit_transform(document_term_matrix)

#<https://gist.github.com/susanli2016/3f88f5aab3f844cc53a44817386d06ce#file-topic_model_lsa-py>

def get_keys(topic_matrix):
    '''
    returns an integer list of predicted topic 
    categories for a given topic matrix
    '''
    keys = topic_matrix.argmax(axis=1).tolist()
    return keys

def keys_to_counts(keys):
    '''
    returns a tuple of topic categories and their 
    accompanying magnitudes for a given list of keys
    '''
    count_pairs = Counter(keys).items()
    categories = [pair[0] for pair in count_pairs]
    counts = [pair[1] for pair in count_pairs]
    return (categories, counts)
    

def get_top_n_words(n, keys, document_term_matrix, tfidf_vectorizer):
    '''
    returns a list of n_topic strings, where each string contains the n most common 
    words in a predicted category, in order
    '''
    top_word_indices = []
    for topic in range(n_topics):
        temp_vector_sum = 0
        for i in range(len(keys)):
            if keys[i] == topic:
                temp_vector_sum += document_term_matrix[i]
        temp_vector_sum = temp_vector_sum.toarray()
        top_n_word_indices = np.flip(np.argsort(temp_vector_sum)[0][-n:],0)
        top_word_indices.append(top_n_word_indices)   
    top_words = []
    for topic in top_word_indices:
        topic_words = []
        for index in topic:
            temp_word_vector = np.zeros((1,document_term_matrix.shape[1]))
            temp_word_vector[:,index] = 1
            the_word = tfidf_vectorizer.inverse_transform(temp_word_vector)[0][0]
            topic_words.append(the_word.encode('ascii').decode('utf-8'))
        top_words.append(" ".join(topic_words))         
    return top_words

lsa_keys = get_keys(lsa_topic_matrix)
lsa_categories, lsa_counts = keys_to_counts(lsa_keys)

top_n_words_lsa = get_top_n_words(5, lsa_keys, document_term_matrix, tfidf_vectorizer)

for i in range(len(top_n_words_lsa)):
    print("Topic {}: ".format(i+1), top_n_words_lsa[i])

Let’s see the top 3 words used in each of the topics

top_3_words = get_top_n_words(3, lsa_keys, document_term_matrix, tfidf_vectorizer)
labels = ['Topic {}: \\n'.format(i) + top_3_words[i] for i in lsa_categories]
fig, ax = plt.subplots(figsize=(16,8))
ax.bar(lsa_categories, lsa_counts);
ax.set_xticks(lsa_categories);
ax.set_xticklabels(labels);
ax.set_ylabel('Number of review text');
ax.set_title('LSA topic counts');
plt.show();

The model has not been able to create distinct topics and most the words are associated with topic 0. Hence we can’t conclude anything out of this.

Word Cloud

Let’s create wordcloud by each category of products.

df_grouped=data[['name','reviews.text']].groupby(by='name').agg(lambda x:' '.join(x))

cv = CountVectorizer(analyzer='word',stop_words='english')
cv_data = cv.fit_transform(df_grouped['reviews.text'])
df_dtm = pd.DataFrame(cv_data.toarray(),columns=cv.get_feature_names())
df_dtm.index = df_grouped.index

def generate_wordcloud(data,title):
    wc = WordCloud(width=400, height=330, max_words=150,colormap="Dark2").generate_from_frequencies(data)
    plt.figure(figsize=(10,8))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")
    plt.title('\\n'.join(wrap(title,60)),fontsize=13)
    plt.show()
    
#Transposing document term matrix
df_dtm=df_dtm.transpose()

# Plotting word cloud for each product
for index,product in enumerate(df_dtm.columns):
    generate_wordcloud(df_dtm[product].sort_values(ascending=False),product)

Avg Reading time of Reviews

Text Standard

data['text_standard']=data['reviews.text'].apply(lambda x: textstat.text_standard(x))

print('Text Standard of upvoted reviews=>',data[data['reviews.numHelpful']>1]['text_standard'].mode())
print('Text Standard of not upvoted reviews=>',data[data['reviews.numHelpful']<=1]['text_standard'].mode())

Reading Time

data['reading_time']=data['reviews.text'].apply(lambda x: textstat.reading_time(x))

print('Reading Time of upvoted reviews=>',data[data['reviews.numHelpful']>1]['reading_time'].mean())
print('Reading Time of not upvoted reviews=>',data[data['reviews.numHelpful']<=1]['reading_time'].mean())

The reading time of upvoted reviews is twice that of not upvoted reviews. It means that people usually find longer reviews helpful.

Conclusion

  • The amazon reviews dataset has around 34,000 reviews
  • Most of the reviews have positive polarity but are close to 0.5 and not 1. There are very few reviews which are negative. Most users have given rating of five and 4 which clearly shows that users are happy with the product they have purchased. Most of the reviews have been written on category zero which is about Tablet Computers.
  • Most of the reviews are written in the months of December and January for both 2015 and 2016. This may be due the purchasing pattern of users in the month of December.
  • Except category 2,3 & 8, all the other categories’ median ratings are 5. Overall, the ratings are high and sentiment are positive in this review data set. Category 2,3 & 8 have only 5 ratings without any outlier
  • Top unigrams mostly talk about positive words about the products
  • Most of the reviews have concentrated around rating of 5 and polarity 0.25 to 0.5 giving a sense of overall positive reviews for the products purchased. Most of the positive reviews are given for categories 0 and the polarity of such reviews are around 0.25 to 0.5. Other categories like 1,2,3 have similar distribution in terms of polarity.
  • The terms associated with recommended reviews are . – ‘loves it’, ‘very easy’, ‘and easy’, ‘loves’, ‘she loves’, ‘love this’. The terms associated with not recommended reviews are – ‘returning’, ‘returned’, ‘return’, ‘i returned’, ‘to return’, ‘returned it’.
  • The model has not been able to create distinct topics and most the words are associated with topic 0.
  • The reading time of upvoted reviews is twice that of not upvoted reviews. It means that people usually find longer reviews helpful.

That’s it! . You can find the entire code for the above analysis here

References