Explore the power of Natural Language Processing (NLP) and Data Science in uncovering valuable insights from Flipkart product reviews. This presentation delves into the methodology, tools, and techniques used to analyze customer sentiments, identify trends, and extract actionable intelligence from a vast sea of textual data. From understanding customer preferences to improving product offerings, discover how NLP Data Science is revolutionizing the way businesses leverage consumer feedback on Flipkart. Visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
2. Introduction to Natural Language Processing (NLP)
• According to industry estimates, only 21% of the
available data is present in structured form. Data
is being generated as we speak, tweet, and send
messages on WhatsApp, and in various other
activities.
• Despite having high-dimensional data, its
information is not directly accessible unless it is
processed (read and understood) manually or
analyzed by an automated system.
• To produce significant and actionable insights
from text data, it is important to get acquainted
with the techniques and principles of Natural
Language Processing (NLP).
3. What is Sentiment Analysis?
• Sentiment Analysis, as the name suggests, means to identify the view or emotion behind a
situation.
• We, humans, communicate with each other in a variety of languages, and any language is just a
mediator or a way in which we try to express ourselves. And, whatever we say has a sentiment
associated with it. It might be positive or negative or it might be neutral as well.
• Sentiment Analysis is a sub-field of NLP and with the help of machine learning techniques, it tries
to identify and extract the insights.
• Let’s look at an example below to get a clear view of Sentiment Analysis:
4. Challenges faced by NLP in real world
1) Ambiguity and Context: NLP struggles with understanding the multiple meanings of words
and phrases in different contexts.
2) Data Quality and Quantity: NLP models need large amounts of high-quality data, but
obtaining and labeling it can be challenging.
3) Domain Adaptation: Models trained in one domain often fail to generalize well to others,
requiring adaptation for real-world use.
4) Ethical and Bias Concerns: Biases in data can lead to unfair outcomes, necessitating
measures to address ethical concerns and mitigate biases.
5) Interpretability and Trust: Complex NLP models are difficult to interpret, making it hard to
trust their decisions without explanation.
5. Real-life applications of NLP
1) Virtual Assistants: Siri, Alexa, and Google Assistant, aiding in tasks such as setting reminders,
answering questions, and controlling smart devices.
2) Email Filtering and Categorization: Sorting emails into folders or labeling them as spam based
on their content.
3) Language Translation Apps: Such as Google Translate, helps users understand and
communicate in different languages.
4) Customer Support Chatbots: Providing instant responses to customer queries on websites or
messaging platforms.
5) Social Media Monitoring: Analyzing trends, sentiments, and customer feedback on platforms
like Twitter and Facebook for brand reputation management.
6. Basic Libraries of Python
1) NumPy: For numerical computing with large
arrays and mathematical operations.
2) Pandas: For data manipulation and
analysis, especially with structured data.
3) Matplotlib: For creating various types of
plots and visualizations.
4) scikit-learn: For machine learning tasks like
classification, regression, and clustering.
7. Important Libraries for NLP
1) NLTK: Offers sentiment analysis via Vader
Sentiment Analyzer.
2) TextBlob: Provides simple functions for
sentiment polarity.
3) scikit-learn: Offers machine learning
algorithms for sentiment classification.
4) spaCy: Supports sentiment analysis via
rule-based or integrated approaches.
5) VADER: Specifically tuned for sentiment
analysis in social media text.
6) Gensim: Python library for topic modeling
and document similarity analysis,
including LSA and LDA.
8. Dataset
• This dataset contains information about Product name, Product price, Rate, Reviews, Summary,
and Sentiment in CSV format. There are 104 different types of products on flipkart.com such as
electronics items, clothing for men, women, and kids, Home decor items, Automated systems, and
so on. It has 205053 rows and 6 columns.
• This dataset has multiclass labels as sentiment such as positive, neutral, and negative. The
sentiment given was based on a column called Summary using NLP and the Vader model. Also,
after that, we manually checked the label and put it into the appropriate categories if the summary
has text like okay, just ok, or one positive and negative we labeled it as neutral for better
understanding while using this dataset for human languages.
• Data was collected through web scraping using the library called Beautiful Soup from flipkart.com.
9. First 5 rows of data
Shape of data
There are 205052 rows and 6 features. From the above table, we can see that the Sentiment
column is our target variable since we have to classify whether the Reviews are positive,
negative, or neutral.
10. All the columns in the data are of Object type.
Checking the type of columns
11. Checking the null values in the data
Review and Summary have null values present.
After dropping the null values, there are 841 unique products available in Flipkart
data.
12. Top 10 products in the data
In the product name column there were many punctuation marks and Cyrillic text was present so
it was creating noise in the data. After removing punctuation marks and converting Cyrillic text
into human-readable format here are 10 products that are frequently purchased online.
13. Distribution of Price
From the KDE plot, we can see that the maximum number of products is between 0 to 1000 price
range. The minimum product price is 59 and the maximum is 86990.
15. Top 10 Frequently Used Words in Review
These are the top 10 words used frequently in Reviews of products. And all these reviews reflect
positive sentiments about the products. Also, we saw that a maximum of people have given a 5-
star rating.
17. Relationship between Sentiment and Rate
This is a count plot of Sentiment and Rate, as we can see for the positive sentiment the highest
rating is 5 and 4, for the negative sentiment the highest is 1, and for neutral all ratings are
distributed evenly. The same can be seen through the line plot.
18. Relationship between Product price and Rate
The correlation between product price and rate is 0.062 and it is visible that for product prices
of low range, the rating is more as compared to higher product price ranges.
19. Plotting the Word Cloud for Sentiment columns
1) Positive Sentiment 2) Negative Sentiment
20. Data Preprocessing
Now, we will pre-process the data before converting it into vectors and passing it to the machine
learning model.
We will create a function for the pre-processing of data.
1) First, we will iterate through each record, and Split the text into individual words or tokens.
2) Then, we will convert the string to lowercase as the word “Good” is different from the word
“good”.
3) Then we will check for stopwords in the data and get rid of them. Stopwords are commonly
used words in a sentence such as “the”, “an”, “to” etc. which do not add much value.
4) Then, we will perform lemmatization on each word,i.e. change the different forms of a word
into a single item called a lemma.
5) A lemma is a base form of a word. For example, “run”, “running” and “runs” are all forms of the
same lexeme, where the “run” is the lemma. Hence, we are converting all occurrences of the
same lexeme to their respective lemma.
23. Topic Modelling using Latent Dirichlet Allocation (LDA)
• Latent Dirichlet Allocation (LDA) is a popular topic modeling technique to extract topics from a given
corpus. In other words, latent means hidden or concealed.
• LDA generates probabilities for the words using which the topics are formed and eventually the topics
are classified into documents.
• Any corpus, which is the collection of documents, can be represented as a document-word (or
document term matrix) also known as DTM.
24. Vectorization
To convert the text data into numerical data, we need some smart ways which are known as
vectorization, or in the NLP world, it is known as Word embeddings.
Count Vectorizer
• It creates a document term matrix, which is a set of dummy variables that indicates if a
particular word appears in the document.
• Count vectorizer will fit and learn the word vocabulary and try to create a document term matrix
in which the individual cells denote the frequency of that word in a particular document, which is
also known as term frequency, and the columns are dedicated to each word in the corpus.
25. TF-IDF Vectorization
Term frequency-inverse document frequency ( TF-IDF) gives a measure that considers the
importance of a word depending on how frequently it occurs in a document and a corpus
Term Frequency
Term frequency denotes the frequency of a word in a document.
26. Inverse Document Frequency
It measures the importance of the word in the corpus. It measures how common a particular
word is across all the documents in the corpus.
For Example, In any corpus, a few words like ‘is’ or ‘and’ are very common, and most likely,
they will be present in almost every document.
Let’s say the word ‘is’ is present in all the documents in a corpus of 1000 documents. The idf for
that would be:
The idf(‘is’) is equal to log (1000/1000) = log 1 = 0
28. Machine Learning Model
• This is a machine learning problem and classification where the goal is to predict the
sentiment based on reviews. To do this I fitted the Multinomial Naive Bayes, Random forest
classifier, and XGBoost classifier.
• Our task is a classification problem so we can use performance metrics like precision,
recall, Accuracy, and F1-score.
• We will evaluate our model using various metrics such as Accuracy Score, Precision Score,
Recall Score, and Confusion Matrix and create a roc curve to visualize how our model
performed.
37. Conclusion
1. The majority of the reviews (59%) were rated 5 out of 5, indicating a high level of customer
satisfaction.
2. Positive sentiment was the most common sentiment in the reviews, followed by neutral and
negative sentiment.
3. There was a positive correlation between product price and rate, suggesting that customers
were more likely to give higher ratings to more expensive products.
4. The most frequently used words in positive reviews included "good", "great", "love", and
"amazing", while the most frequently used words in negative reviews included "bad",
"terrible", "waste", and "disappointed".
5. The topic modeling analysis identifies several key topics in the reviews, including product
quality, customer service, value for money, and shipping.
6. The Multinomial Naive Bayes classifier achieves an accuracy of around 70% on both count
vectorizer and TF-IDF vectorizer, suggesting that it is a suitable model for sentiment analysis
on this dataset.
7. The Random Forest classifier achieves an accuracy of around 75% on both count vectorizer
and TF-IDF vectorizer, outperforming the Multinomial Naive Bayes classifier.
38. 9. The XGBoost classifier achieves an accuracy of around 80% on the TF-IDF vectorizer,
outperforming both the Multinomial Naive Bayes and Random Forest classifiers.
10. Hyperparameter tuning further improves the performance of the XGBoost classifier, achieving
an accuracy of around 85% on the TF-IDF vectorizer.
11. The analysis suggests that customers tend to be more satisfied with products that are of good
quality, offer good value for money, and have a good customer service experience.
12. The insights gained from this project can be used by Flipkart to make data-driven decisions to
improve its business and provide a better customer experience.