The document discusses evaluating the scalability of the Naive Bayes classifier for sentiment analysis on large datasets. It presents the Naive Bayes classification method, which uses Bayes' theorem with independence assumptions between features. It then describes implementing Naive Bayes in Hadoop for sentiment classification of movie reviews at scale, including preprocessing data, calculating word frequencies, and predicting sentiment. An experimental study tested the implementation on a Hadoop cluster with over 1,000 positive and 1,000 negative reviews for training.
IEEE Big Data Conference 2013: Naive Bayes Sentiment Classification
1. 2013 IEEE International Conference on Big Data
Scalable Sentiment Classification for Big
DataAnalysis Using Naive Bayes Classifier
Bingwei Liu, Erik Blasch, Yu Chen, Dan Shen and Genshe Chen
3. introduction
A typical method to obtain valuable information is
to extract the sentiment or opinion from a message
In this paper, it aim to evaluate the scalability of
Naive Bayes classifier (NBC) in large datasets
4. introduction
NBC is able to scale up to analyze the sentiment of
millions movie reviews with increasing throughput
the accuracy of NBC is improved and approaches 82%
5. Naive Bayes Classification
naive Bayes classifiers is simple probabilistic
classifiers based on applying Bayes' theorem with
strong (naive) independence assumptions between
the features
a popular method for text categorization,
( the problem of judging documents as belonging to one
category)
12. Naive Bayes Classification
N is the total number of documents,Nc is the number
of documents in class c
Nwi is the frequency of a word wi in class c.
15. implementation of Naive Bayes
in hadoop
(word,posSum,negSum)
the words frequency in all positive,negative document
(excellent,1000,10)
16. implementation of Naive Bayes
in hadoop
(excellent,1000,10) (excellent,20,5)
(word,posSum,negSum) (word,count,docID)
(docID,count,word,posSum,negSum)
(5,20,excellent,1000,10)
17. implementation of Naive Bayes
in hadoop
(5,10,excellent,20,5)
(5,2,terrible,5,20)
(5,pos,true)
(docID,predict,correct)
(6,neg,false)
(docID,count,word,posSum,negSum)
10xlog(20)+2xlog(5)
10xlog(5)+2xlog(20)
18. experimental study
one name node and six data nodes.
they allocate each VM two virtual CPU and 4GB of memory
7 nodes
a Dell server with 12 Intel Xeon E5-2630
2.3GHz cores and 32G memory
use Xen CloudPlatform (XCP) 1.6 as the hypervisor