Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Statistical Learning for Text Classification
         with scikit-learn and NLTK

                Olivier Grisel
         ...
Outline
●   Why text classification?
●   What is text classification?
●   How?
    ●   scikit-learn
    ●   NLTK
    ●   G...
Applications of Text Classification
             Task                     Predicted outcome

Spam filtering               ...
Supervised Learning Overview
●   Convert training data to a set of vectors of features
●   Build a model based on the stat...
Summary
    Training          features
      Text            vectors
   Documents,
    Images,
    Sounds...
             ...
Bags of Words
●   Tokenize document: list of uni-grams
      ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'laz...
Better than frequencies: TF-IDF
●   Term Frequency



●   Inverse Document Frequency




●   Non informative words such as...
Even better features
●   bi-grams of words:
    ●   “New York”, “very bad”, “not good”
●   n-grams of chars:
    ●   “the”...
scikit-learn
scikit-learn
●   BSD
●   numpy / scipy / cython / c++ wrappers
●   Many state of the art implementations
●   A new release...
Features Extraction in scikit-learn
from scikits.learn.features.text import WordNGramAnalyzer
text = (u"J'ai mangxe9 du ka...
Features Extraction in scikit-learn
from scikits.learn.features.text import CharNGramAnalyzer


analyzer = CharNGramAnalyz...
TF-IDF features & SVMs
from scikits.learn.features.text.sparse import Vectorizer
from scikits.learn.sparse.svm.sparse impo...
)
                                    cs
                                (do
                        form
                ...
NLTK
●   Code: ASL 2.0 & Book: CC-BY-NC-ND
●   Tokenizers, Stemmers, Parsers, Classifiers,
    Clusterers, Corpus Readers
NLTK Corpus Downloader
>>> import nltk
>>> nltk.download()
Using a NLTK corpus
>>> from nltk.corpus import movie_reviews as reviews


>>> pos_ids = reviews.fileids('pos')
>>> neg_id...
Common data cleanup operations
●   Lower case & remove accentuated chars:
import unicodedata
s = ''.join(c for c in unicod...
Feature Extraction with NLTK
                   Unigram features




def word_features(words):
   return dict((word, True)...
Feature Extraction with NLTK
                     Bigram Collocations
from nltk.collocations import BigramCollocationFinde...
The NLTK Naïve Bayes Classifier
from nltk.classify import NaiveBayesClassifier


neg_examples = [(features(reviews.words(i...
Most informative features
>>> classifier.show_most_informative_features()
     magnificent = True         pos : neg       ...
Training NLTK classifiers
●   Try nltk-trainer

●   python train_classifier.py --instances paras 
    --classifier NaiveBa...
REST services
NLTK – Online demos
NLTK – REST APIs
% curl -d "text=Inception is the best movie ever" 
            http://text-processing.com/api/sentiment/
...
Google Prediction API
Typical performance results:
                   movie reviews
●   nltk:
    ● unigram occurrences
    ● Naïve Bayesian Cla...
Typical results:
    newsgroups topics classification

●   20 newsgroups dataset
    ●   ~ 19K short text documents
    ● ...
Confusion Matrix (20 newsgroups)
00 alt.atheism
01 comp.graphics
02 comp.os.ms-windows.misc
03 comp.sys.ibm.pc.hardware
04...
Typical results:
           Language Identification


●   15 Wikipedia articles
●   [p.text_content() for p in html_tree.f...
Typical results:
Language Identification
Scaling to many possible outcomes
●   Example: possible outcomes are all the
    categories of Wikipedia (565,108)
●   Fro...
Some pointers
●   http://scikit-learn.sf.net               doc & examples
    http://github.com/scikit-learn              ...
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
Tweets Classification using Naive Bayes and SVM
Next
Download to read offline and view in fullscreen.

Share

Statistical Machine Learning for Text Classification with scikit-learn and NLTK

Download to read offline

Related Books

Free with a 30 day trial from Scribd

See all

Statistical Machine Learning for Text Classification with scikit-learn and NLTK

  1. 1. Statistical Learning for Text Classification with scikit-learn and NLTK Olivier Grisel http://twitter.com/ogrisel PyCon – 2011
  2. 2. Outline ● Why text classification? ● What is text classification? ● How? ● scikit-learn ● NLTK ● Google Prediction API ● Some results
  3. 3. Applications of Text Classification Task Predicted outcome Spam filtering Spam, Ham, Priority Language guessing English, Spanish, French, ... Sentiment Analysis for Product Positive, Neutral, Negative Reviews News Feed Topic Politics, Business, Technology, Categorization Sports, ... Pay-per-click optimal ads Will yield clicks / Won't placement Recommender systems Will I buy this book? / I won't
  4. 4. Supervised Learning Overview ● Convert training data to a set of vectors of features ● Build a model based on the statistical properties of features in the training set, e.g. ● Naïve Bayesian Classifier ● Logistic Regression ● Support Vector Machines ● For each new text document to classify ● Extract features ● Asked model to predict the most likely outcome
  5. 5. Summary Training features Text vectors Documents, Images, Sounds... Machine Learning Algorithm Labels New Text features Document, vector Predictive Expected Image, Model Label Sound
  6. 6. Bags of Words ● Tokenize document: list of uni-grams ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'] ● Binary occurrences / counts: {'the': True, 'quick': True...} ● Frequencies: {'the': 0.22, 'quick': 0.11, 'brown': 0.11, 'fox': 0.11…} ● TF-IDF {'the': 0.001, 'quick': 0.05, 'brown': 0.06, 'fox': 0.24…}
  7. 7. Better than frequencies: TF-IDF ● Term Frequency ● Inverse Document Frequency ● Non informative words such as “the” are scaled done
  8. 8. Even better features ● bi-grams of words: ● “New York”, “very bad”, “not good” ● n-grams of chars: ● “the”, “ed ”, “ a ” (useful for language guessing) ● Combine with: ● Binary occurrences ● Frequencies ● TF-IDF
  9. 9. scikit-learn
  10. 10. scikit-learn ● BSD ● numpy / scipy / cython / c++ wrappers ● Many state of the art implementations ● A new release every 3 months ● 17 contributors on release 0.7 ● Not just for text classification
  11. 11. Features Extraction in scikit-learn from scikits.learn.features.text import WordNGramAnalyzer text = (u"J'ai mangxe9 du kangourou ce midi," u" c'xe9tait pas trxeas bon.") WordNGramAnalyzer(min_n=1, max_n=2).analyze(text) [u'ai', u'mange', u'du', u'kangourou', u'ce', u'midi', u'etait', u'pas', u'tres', u'bon', u'ai mange', u'mange du', u'du kangourou', u'kangourou ce', u'ce midi', u'midi etait', u'etait pas', u'pas tres', u'tres bon']
  12. 12. Features Extraction in scikit-learn from scikits.learn.features.text import CharNGramAnalyzer analyzer = CharNGramAnalyzer(min_n=3, max_n=6) char_ngrams = analyzer.analyze(text) print char_ngrams[:5] + char_ngrams[-5:] [u"j'a", u"'ai", u'ai ', u'i m', u' ma', u's tres', u' tres ', u'tres b', u'res bo', u'es bon']
  13. 13. TF-IDF features & SVMs from scikits.learn.features.text.sparse import Vectorizer from scikits.learn.sparse.svm.sparse import LinearSVC vec = Vectorizer(analyzer=analyzer) features = vec.fit_transform(list_of_documents) clf = LinearSVC(C=100).fit(features, labels) clf2 = pickle.loads(pickle.dumps(clf)) predicted_labels = clf2.predict(features_of_new_docs)
  14. 14. ) cs (do form ns .tra Training features Text c Documents, ve vectors Images, Sounds... ) Machine X,y fit( Learning . clf Algorithm Labels w) _ ne w) cs _n e (do X m ct( sfor ed i New an . pr Text c.t r clf e features Document, v vector Predictive Expected Image, Model Label Sound
  15. 15. NLTK ● Code: ASL 2.0 & Book: CC-BY-NC-ND ● Tokenizers, Stemmers, Parsers, Classifiers, Clusterers, Corpus Readers
  16. 16. NLTK Corpus Downloader >>> import nltk >>> nltk.download()
  17. 17. Using a NLTK corpus >>> from nltk.corpus import movie_reviews as reviews >>> pos_ids = reviews.fileids('pos') >>> neg_ids = reviews.fileids('neg') >>> len(pos_ids), len(neg_ids) 1000, 1000 >>> reviews.words(pos_ids[0]) ['films', 'adapted', 'from', 'comic', 'books', 'have', ...]
  18. 18. Common data cleanup operations ● Lower case & remove accentuated chars: import unicodedata s = ''.join(c for c in unicodedata.normalize('NFD', s.lower()) if unicodedata.category(c) != 'Mn') ● Extract only word tokens of at least 2 chars ● Using NLTK tokenizers & stemmers ● Using a simple regexp: re.compile(r"bww+b", re.U).findall(s)
  19. 19. Feature Extraction with NLTK Unigram features def word_features(words): return dict((word, True) for word in words)
  20. 20. Feature Extraction with NLTK Bigram Collocations from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures as BAM from itertools import chain def bigram_features(words, score_fn=BAM.chi_sq): bg_finder = BigramCollocationFinder.from_words(words) bigrams = bg_finder.nbest(score_fn, 100000) return dict((bg, True) for bg in chain(words, bigrams))
  21. 21. The NLTK Naïve Bayes Classifier from nltk.classify import NaiveBayesClassifier neg_examples = [(features(reviews.words(i)), 'neg') for i in neg_ids] pos_examples = [(features(reviews.words(i)), 'pos') for i in pos_ids] train_set = pos_examples + neg_examples classifier = NaiveBayesClassifier.train(train_set)
  22. 22. Most informative features >>> classifier.show_most_informative_features() magnificent = True pos : neg = 15.0 : 1.0 outstanding = True pos : neg = 13.6 : 1.0 insulting = True neg : pos = 13.0 : 1.0 vulnerable = True pos : neg = 12.3 : 1.0 ludicrous = True neg : pos = 11.8 : 1.0 avoids = True pos : neg = 11.7 : 1.0 uninvolving = True neg : pos = 11.7 : 1.0 astounding = True pos : neg = 10.3 : 1.0 fascination = True pos : neg = 10.3 : 1.0 idiotic = True neg : pos = 9.8 : 1.0
  23. 23. Training NLTK classifiers ● Try nltk-trainer ● python train_classifier.py --instances paras --classifier NaiveBayes –bigrams --min_score 3 movie_reviews
  24. 24. REST services
  25. 25. NLTK – Online demos
  26. 26. NLTK – REST APIs % curl -d "text=Inception is the best movie ever" http://text-processing.com/api/sentiment/ { "probability": { "neg": 0.36647424288117808, "pos": 0.63352575711882186 }, "label": "pos" }
  27. 27. Google Prediction API
  28. 28. Typical performance results: movie reviews ● nltk: ● unigram occurrences ● Naïve Bayesian Classifier ~ 70% ● Google Prediction API ~ 83% ● scikit-learn: ● TF-IDF unigram features ● LinearSVC ~ 87% ● nltk: ● Collocation features selection ● Naïve Bayesian Classifier ~ 97%
  29. 29. Typical results: newsgroups topics classification ● 20 newsgroups dataset ● ~ 19K short text documents ● 20 categories ● By date train / test split ● Bigram TF-IDF + LinearSVC: ~ 87%
  30. 30. Confusion Matrix (20 newsgroups) 00 alt.atheism 01 comp.graphics 02 comp.os.ms-windows.misc 03 comp.sys.ibm.pc.hardware 04 comp.sys.mac.hardware 05 comp.windows.x 06 misc.forsale 07 rec.autos 08 rec.motorcycles 09 rec.sport.baseball 10 rec.sport.hockey 11 sci.crypt 12 sci.electronics 13 sci.med 14 sci.space 15 soc.religion.christian 16 talk.politics.guns 17 talk.politics.mideast 18 talk.politics.misc 19 talk.religion.misc
  31. 31. Typical results: Language Identification ● 15 Wikipedia articles ● [p.text_content() for p in html_tree.findall('//p')] ● CharNGramAnalyzer(min_n=1, max_n=3) ● TF-IDF ● LinearSVC
  32. 32. Typical results: Language Identification
  33. 33. Scaling to many possible outcomes ● Example: possible outcomes are all the categories of Wikipedia (565,108) ● From Document Categorization to Information Retrieval ● Fulltext index for TF-IDF similarity queries ● Smart way to find the top 30 search keywords ● Use Apache Lucene / Solr MoreLikeThisQuery
  34. 34. Some pointers ● http://scikit-learn.sf.net doc & examples http://github.com/scikit-learn code ● http://www.nltk.org code & doc & PDF book ● http://streamhacker.com/ ● Jacob Perkins' blog on NLTK & APIs ● https://github.com/japerk/nltk-trainer ● http://www.slideshare.net/ogrisel these slides ● http://twitter.com/ogrisel / http://github.com/ogrisel Questions?
  • amitsingh5494360

    Apr. 6, 2019
  • MohammadAlmawali

    Aug. 22, 2018
  • BasilKosovan

    Nov. 29, 2017
  • kush93pandey

    Nov. 25, 2017
  • James139

    Oct. 9, 2017
  • khazinawaz

    May. 1, 2017
  • bcajes

    Apr. 17, 2017
  • AkhilAravind7

    Mar. 14, 2017
  • williammdavis

    Dec. 3, 2016
  • mikepham12

    Sep. 30, 2016
  • linekin

    Sep. 15, 2016
  • HanamantBongane

    Jun. 21, 2016
  • RayHan17

    May. 15, 2016
  • BozhinKaraivanov

    May. 6, 2016
  • danilobarboza

    May. 6, 2016
  • conan_godman

    Mar. 28, 2016
  • AshishAgrawal120

    Mar. 24, 2016
  • linhx13

    Mar. 14, 2016
  • HaopingLiu

    Jan. 7, 2016
  • angelivanov1

    Dec. 2, 2015

Views

Total views

36,253

On Slideshare

0

From embeds

0

Number of embeds

331

Actions

Downloads

981

Shares

0

Comments

0

Likes

69

×