SlideShare a Scribd company logo
1 of 36
Download to read offline
Statistical Learning for Text Classification
         with scikit-learn and NLTK

                Olivier Grisel
         http://twitter.com/ogrisel

              PyCon – 2011
Outline
●   Why text classification?
●   What is text classification?
●   How?
    ●   scikit-learn
    ●   NLTK
    ●   Google Prediction API
●   Some results
Applications of Text Classification
             Task                     Predicted outcome

Spam filtering                  Spam, Ham, Priority

Language guessing               English, Spanish, French, ...

Sentiment Analysis for Product
                               Positive, Neutral, Negative
Reviews
News Feed Topic                 Politics, Business, Technology,
Categorization                  Sports, ...
Pay-per-click optimal ads
                                Will yield clicks / Won't
placement
Recommender systems             Will I buy this book? / I won't
Supervised Learning Overview
●   Convert training data to a set of vectors of features
●   Build a model based on the statistical properties of
    features in the training set, e.g.
    ●   Naïve Bayesian Classifier
    ●   Logistic Regression
    ●   Support Vector Machines
●   For each new text document to classify
    ●   Extract features
    ●   Asked model to predict the most likely outcome
Summary
    Training          features
      Text            vectors
   Documents,
    Images,
    Sounds...
                                 Machine
                                 Learning
                                 Algorithm
     Labels




  New
  Text          features
Document,       vector           Predictive   Expected
 Image,                            Model       Label
 Sound
Bags of Words
●   Tokenize document: list of uni-grams
      ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
●   Binary occurrences / counts:
      {'the': True, 'quick': True...}
●   Frequencies:
       {'the': 0.22, 'quick': 0.11, 'brown': 0.11, 'fox': 0.11…}
●   TF-IDF
      {'the': 0.001, 'quick': 0.05, 'brown': 0.06, 'fox': 0.24…}
Better than frequencies: TF-IDF
●   Term Frequency



●   Inverse Document Frequency




●   Non informative words such as “the” are scaled done
Even better features
●   bi-grams of words:
    ●   “New York”, “very bad”, “not good”
●   n-grams of chars:
    ●   “the”, “ed ”, “ a ” (useful for language guessing)
●   Combine with:
    ●   Binary occurrences
    ●   Frequencies
    ●   TF-IDF
scikit-learn
scikit-learn
●   BSD
●   numpy / scipy / cython / c++ wrappers
●   Many state of the art implementations
●   A new release every 3 months
●   17 contributors on release 0.7
●   Not just for text classification
Features Extraction in scikit-learn
from scikits.learn.features.text import WordNGramAnalyzer
text = (u"J'ai mangxe9 du kangourou ce midi,"
       u" c'xe9tait pas trxeas bon.")


WordNGramAnalyzer(min_n=1, max_n=2).analyze(text)
[u'ai', u'mange', u'du', u'kangourou', u'ce', u'midi', u'etait',
u'pas', u'tres', u'bon', u'ai mange', u'mange du', u'du
kangourou', u'kangourou ce', u'ce midi', u'midi etait', u'etait
pas', u'pas tres', u'tres bon']
Features Extraction in scikit-learn
from scikits.learn.features.text import CharNGramAnalyzer


analyzer = CharNGramAnalyzer(min_n=3, max_n=6)
char_ngrams = analyzer.analyze(text)


print char_ngrams[:5] + char_ngrams[-5:]
[u"j'a", u"'ai", u'ai ', u'i m', u' ma', u's tres', u' tres ', u'tres b',
u'res bo', u'es bon']
TF-IDF features & SVMs
from scikits.learn.features.text.sparse import Vectorizer
from scikits.learn.sparse.svm.sparse import LinearSVC



vec = Vectorizer(analyzer=analyzer)

features = vec.fit_transform(list_of_documents)

clf = LinearSVC(C=100).fit(features, labels)



clf2 = pickle.loads(pickle.dumps(clf))

predicted_labels = clf2.predict(features_of_new_docs)
)
                                    cs
                                (do
                        form
                      ns
                  .tra
    Training                 features
      Text       c
   Documents, ve             vectors
     Images,
     Sounds...
                                             )   Machine
                                          X,y
                                     fit(
                                                 Learning
                                   .
                               clf               Algorithm
      Labels

                                       w)
                                 _   ne                                                w)
                               cs                                                  _n
                                                                                      e
                           (do                                                   X
                         m                                                   ct(
                     sfor                                                ed
                                                                           i
  New             an                                              .   pr
  Text       c.t
                 r                                            clf
           e              features
Document, v               vector                 Predictive                     Expected
 Image,                                            Model                         Label
 Sound
NLTK
●   Code: ASL 2.0 & Book: CC-BY-NC-ND
●   Tokenizers, Stemmers, Parsers, Classifiers,
    Clusterers, Corpus Readers
NLTK Corpus Downloader
>>> import nltk
>>> nltk.download()
Using a NLTK corpus
>>> from nltk.corpus import movie_reviews as reviews


>>> pos_ids = reviews.fileids('pos')
>>> neg_ids = reviews.fileids('neg')
>>> len(pos_ids), len(neg_ids)
1000, 1000


>>> reviews.words(pos_ids[0])
['films', 'adapted', 'from', 'comic', 'books', 'have', ...]
Common data cleanup operations
●   Lower case & remove accentuated chars:
import unicodedata
s = ''.join(c for c in unicodedata.normalize('NFD', s.lower())
             if unicodedata.category(c) != 'Mn')
●   Extract only word tokens of at least 2 chars
    ●   Using NLTK tokenizers & stemmers
    ●   Using a simple regexp:
        re.compile(r"bww+b", re.U).findall(s)
Feature Extraction with NLTK
                   Unigram features




def word_features(words):
   return dict((word, True) for word in words)
Feature Extraction with NLTK
                     Bigram Collocations
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures as BAM
from itertools import chain


def bigram_features(words, score_fn=BAM.chi_sq):
   bg_finder = BigramCollocationFinder.from_words(words)
   bigrams = bg_finder.nbest(score_fn, 100000)
   return dict((bg, True) for bg in chain(words, bigrams))
The NLTK Naïve Bayes Classifier
from nltk.classify import NaiveBayesClassifier


neg_examples = [(features(reviews.words(i)), 'neg') for i in neg_ids]
pos_examples = [(features(reviews.words(i)), 'pos') for i in pos_ids]
train_set = pos_examples + neg_examples


classifier = NaiveBayesClassifier.train(train_set)
Most informative features
>>> classifier.show_most_informative_features()
     magnificent = True         pos : neg          =     15.0 : 1.0
     outstanding = True         pos : neg          =      13.6 : 1.0
      insulting = True        neg : pos     =          13.0 : 1.0
      vulnerable = True        pos : neg           =     12.3 : 1.0
      ludicrous = True         neg : pos       =        11.8 : 1.0
        avoids = True         pos : neg        =       11.7 : 1.0
     uninvolving = True         neg : pos          =     11.7 : 1.0
      astounding = True         pos : neg          =      10.3 : 1.0
     fascination = True        pos : neg       =         10.3 : 1.0
       idiotic = True        neg : pos     =           9.8 : 1.0
Training NLTK classifiers
●   Try nltk-trainer

●   python train_classifier.py --instances paras 
    --classifier NaiveBayes –bigrams 
    --min_score 3      movie_reviews
REST services
NLTK – Online demos
NLTK – REST APIs
% curl -d "text=Inception is the best movie ever" 
            http://text-processing.com/api/sentiment/


{
    "probability": {
         "neg": 0.36647424288117808,
         "pos": 0.63352575711882186
    },
    "label": "pos"
}
Google Prediction API
Typical performance results:
                   movie reviews
●   nltk:
    ● unigram occurrences
    ● Naïve Bayesian Classifier          ~ 70%
●   Google Prediction API                ~ 83%
●   scikit-learn:
    ●  TF-IDF unigram features
    ● LinearSVC                          ~ 87%
●   nltk:
    ●   Collocation features selection
    ●   Naïve Bayesian Classifier        ~ 97%
Typical results:
    newsgroups topics classification

●   20 newsgroups dataset
    ●   ~ 19K short text documents
    ●   20 categories
    ●   By date train / test split

●   Bigram TF-IDF + LinearSVC:       ~ 87%
Confusion Matrix (20 newsgroups)
00 alt.atheism
01 comp.graphics
02 comp.os.ms-windows.misc
03 comp.sys.ibm.pc.hardware
04 comp.sys.mac.hardware
05 comp.windows.x
06 misc.forsale
07 rec.autos
08 rec.motorcycles
09 rec.sport.baseball
10 rec.sport.hockey
11 sci.crypt
12 sci.electronics
13 sci.med
14 sci.space
15 soc.religion.christian
16 talk.politics.guns
17 talk.politics.mideast
18 talk.politics.misc
19 talk.religion.misc
Typical results:
           Language Identification


●   15 Wikipedia articles
●   [p.text_content() for p in html_tree.findall('//p')]
●   CharNGramAnalyzer(min_n=1, max_n=3)
●   TF-IDF
●   LinearSVC
Typical results:
Language Identification
Scaling to many possible outcomes
●   Example: possible outcomes are all the
    categories of Wikipedia (565,108)
●   From Document Categorization
                           to Information Retrieval
●   Fulltext index for TF-IDF similarity queries
●   Smart way to find the top 30 search keywords
●   Use Apache Lucene / Solr MoreLikeThisQuery
Some pointers
●   http://scikit-learn.sf.net               doc & examples
    http://github.com/scikit-learn                     code
●   http://www.nltk.org              code & doc & PDF book
●   http://streamhacker.com/
    ●   Jacob Perkins' blog on NLTK & APIs
●   https://github.com/japerk/nltk-trainer
●   http://www.slideshare.net/ogrisel             these slides
●   http://twitter.com/ogrisel / http://github.com/ogrisel

                             Questions?

More Related Content

What's hot

support vector machine 1.pptx
support vector machine 1.pptxsupport vector machine 1.pptx
support vector machine 1.pptxsurbhidutta4
 
Best Practices in the Use of Columnar Databases
Best Practices in the Use of Columnar DatabasesBest Practices in the Use of Columnar Databases
Best Practices in the Use of Columnar DatabasesDATAVERSITY
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkDatabricks
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Edureka!
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine LearningJoel Graff
 
Deep learning crash course
Deep learning crash courseDeep learning crash course
Deep learning crash courseVishwas N
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine LearningKnoldus Inc.
 
SRV308 Deep Dive on Amazon Aurora
SRV308 Deep Dive on Amazon AuroraSRV308 Deep Dive on Amazon Aurora
SRV308 Deep Dive on Amazon AuroraAmazon Web Services
 
Migrating monolithic applications with the strangler pattern - FSV303 - New Y...
Migrating monolithic applications with the strangler pattern - FSV303 - New Y...Migrating monolithic applications with the strangler pattern - FSV303 - New Y...
Migrating monolithic applications with the strangler pattern - FSV303 - New Y...Amazon Web Services
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)Abhimanyu Dwivedi
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architectureLiang Xiang
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch
 

What's hot (20)

Vista lógica
Vista lógicaVista lógica
Vista lógica
 
support vector machine 1.pptx
support vector machine 1.pptxsupport vector machine 1.pptx
support vector machine 1.pptx
 
Best Practices in the Use of Columnar Databases
Best Practices in the Use of Columnar DatabasesBest Practices in the Use of Columnar Databases
Best Practices in the Use of Columnar Databases
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
 
Introduction to Sagemaker
Introduction to SagemakerIntroduction to Sagemaker
Introduction to Sagemaker
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
Deep learning crash course
Deep learning crash courseDeep learning crash course
Deep learning crash course
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
SRV308 Deep Dive on Amazon Aurora
SRV308 Deep Dive on Amazon AuroraSRV308 Deep Dive on Amazon Aurora
SRV308 Deep Dive on Amazon Aurora
 
Migrating monolithic applications with the strangler pattern - FSV303 - New Y...
Migrating monolithic applications with the strangler pattern - FSV303 - New Y...Migrating monolithic applications with the strangler pattern - FSV303 - New Y...
Migrating monolithic applications with the strangler pattern - FSV303 - New Y...
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 
LSTM
LSTMLSTM
LSTM
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
 

Similar to Statistical Machine Learning for Text Classification with scikit-learn and NLTK

Statistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnStatistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnOlivier Grisel
 
Simple APIs and innovative documentation
Simple APIs and innovative documentationSimple APIs and innovative documentation
Simple APIs and innovative documentationPyDataParis
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimizationg3_nittala
 
Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata londonkperi
 
Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.Piotr Milanowski
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...Chetan Khatri
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...Positive Hack Days
 
Learning with classification and clustering, neural networks
Learning with classification and clustering, neural networksLearning with classification and clustering, neural networks
Learning with classification and clustering, neural networksShaun D'Souza
 
Software tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsSoftware tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsbutest
 
Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Ro...
Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Ro...Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Ro...
Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Ro...Provectus
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
TensorFlow in Your Browser
TensorFlow in Your BrowserTensorFlow in Your Browser
TensorFlow in Your BrowserOswald Campesato
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetupsource{d}
 
200612_BioPackathon_ss
200612_BioPackathon_ss200612_BioPackathon_ss
200612_BioPackathon_ssSatoshi Kume
 
Tesseract. Recognizing Errors in Recognition Software
Tesseract. Recognizing Errors in Recognition SoftwareTesseract. Recognizing Errors in Recognition Software
Tesseract. Recognizing Errors in Recognition SoftwareAndrey Karpov
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonRalf Gommers
 
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...ETS Asset Management Factory
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdfRohanBorgalli
 

Similar to Statistical Machine Learning for Text Classification with scikit-learn and NLTK (20)

Statistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnStatistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learn
 
Simple APIs and innovative documentation
Simple APIs and innovative documentationSimple APIs and innovative documentation
Simple APIs and innovative documentation
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
 
Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata london
 
Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Text analysis using python
Text analysis using pythonText analysis using python
Text analysis using python
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
 
Learning with classification and clustering, neural networks
Learning with classification and clustering, neural networksLearning with classification and clustering, neural networks
Learning with classification and clustering, neural networks
 
Software tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsSoftware tookits for machine learning and graphical models
Software tookits for machine learning and graphical models
 
Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Ro...
Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Ro...Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Ro...
Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Ro...
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
TensorFlow in Your Browser
TensorFlow in Your BrowserTensorFlow in Your Browser
TensorFlow in Your Browser
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
 
200612_BioPackathon_ss
200612_BioPackathon_ss200612_BioPackathon_ss
200612_BioPackathon_ss
 
Tesseract. Recognizing Errors in Recognition Software
Tesseract. Recognizing Errors in Recognition SoftwareTesseract. Recognizing Errors in Recognition Software
Tesseract. Recognizing Errors in Recognition Software
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdf
 

More from Olivier Grisel

Strategies and Tools for Parallel Machine Learning in Python
Strategies and Tools for Parallel Machine Learning in PythonStrategies and Tools for Parallel Machine Learning in Python
Strategies and Tools for Parallel Machine Learning in PythonOlivier Grisel
 
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...Olivier Grisel
 
Nuxeo 5.3 and Semantic R&D
Nuxeo 5.3 and Semantic R&DNuxeo 5.3 and Semantic R&D
Nuxeo 5.3 and Semantic R&DOlivier Grisel
 
Hadoop MapReduce - OSDC FR 2009
Hadoop MapReduce - OSDC FR 2009Hadoop MapReduce - OSDC FR 2009
Hadoop MapReduce - OSDC FR 2009Olivier Grisel
 

More from Olivier Grisel (6)

Strategies and Tools for Parallel Machine Learning in Python
Strategies and Tools for Parallel Machine Learning in PythonStrategies and Tools for Parallel Machine Learning in Python
Strategies and Tools for Parallel Machine Learning in Python
 
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
 
Nuxeo Iks 2009 11 13
Nuxeo Iks 2009 11 13Nuxeo Iks 2009 11 13
Nuxeo Iks 2009 11 13
 
Nuxeo 5.3 and Semantic R&D
Nuxeo 5.3 and Semantic R&DNuxeo 5.3 and Semantic R&D
Nuxeo 5.3 and Semantic R&D
 
Hadoop MapReduce - OSDC FR 2009
Hadoop MapReduce - OSDC FR 2009Hadoop MapReduce - OSDC FR 2009
Hadoop MapReduce - OSDC FR 2009
 
Programming the PS3
Programming the PS3Programming the PS3
Programming the PS3
 

Recently uploaded

Radiance Majestic Valasaravakkam Chennai.pdf
Radiance Majestic Valasaravakkam Chennai.pdfRadiance Majestic Valasaravakkam Chennai.pdf
Radiance Majestic Valasaravakkam Chennai.pdfashiyadav24
 
DLF Plots Sriperumbudur in Chennai E Brochure Pdf
DLF Plots Sriperumbudur in Chennai E Brochure PdfDLF Plots Sriperumbudur in Chennai E Brochure Pdf
DLF Plots Sriperumbudur in Chennai E Brochure Pdfashiyadav24
 
MADHUGIRI FARM LAND BROCHURES (11)_compressed (1).pdf
MADHUGIRI FARM LAND BROCHURES (11)_compressed (1).pdfMADHUGIRI FARM LAND BROCHURES (11)_compressed (1).pdf
MADHUGIRI FARM LAND BROCHURES (11)_compressed (1).pdfknoxdigital1
 
_Navigating Inflation's Influence on Commercial Real Estate (CRE) Investing I...
_Navigating Inflation's Influence on Commercial Real Estate (CRE) Investing I..._Navigating Inflation's Influence on Commercial Real Estate (CRE) Investing I...
_Navigating Inflation's Influence on Commercial Real Estate (CRE) Investing I...SyndicationPro, LLC
 
How to Navigate the Eviction Process in Pennsylvania: A Landlord's Guide
How to Navigate the Eviction Process in Pennsylvania: A Landlord's GuideHow to Navigate the Eviction Process in Pennsylvania: A Landlord's Guide
How to Navigate the Eviction Process in Pennsylvania: A Landlord's GuideezLandlordForms
 
Experion Elements Sector 45 Noida_Brochure.pdf.pdf
Experion Elements Sector 45 Noida_Brochure.pdf.pdfExperion Elements Sector 45 Noida_Brochure.pdf.pdf
Experion Elements Sector 45 Noida_Brochure.pdf.pdfkratirudram
 
Brigade Neopolis Kokapet, Hyderabad E- Brochure
Brigade Neopolis Kokapet, Hyderabad E- BrochureBrigade Neopolis Kokapet, Hyderabad E- Brochure
Brigade Neopolis Kokapet, Hyderabad E- Brochurefaheemali990101
 
Affordable and Quality Construction SVTN's Signature Blend of Cost-Efficiency...
Affordable and Quality Construction SVTN's Signature Blend of Cost-Efficiency...Affordable and Quality Construction SVTN's Signature Blend of Cost-Efficiency...
Affordable and Quality Construction SVTN's Signature Blend of Cost-Efficiency...AditiAlishetty
 
Kolte Patil Mirabilis at Horamavu Road, Bangalore E brochure.pdf
Kolte Patil Mirabilis at Horamavu Road, Bangalore E brochure.pdfKolte Patil Mirabilis at Horamavu Road, Bangalore E brochure.pdf
Kolte Patil Mirabilis at Horamavu Road, Bangalore E brochure.pdfAhanundefined
 
Ryan Mahoney - How Property Technology Is Altering the Real Estate Market
Ryan Mahoney - How Property Technology Is Altering the Real Estate MarketRyan Mahoney - How Property Technology Is Altering the Real Estate Market
Ryan Mahoney - How Property Technology Is Altering the Real Estate MarketRyan Mahoney
 
Mahindra Vista Kandivali East Mumbai Brochure.pdf
Mahindra Vista Kandivali East Mumbai Brochure.pdfMahindra Vista Kandivali East Mumbai Brochure.pdf
Mahindra Vista Kandivali East Mumbai Brochure.pdfPrachiRudram
 
What Is Biophilic Design .pdf
What Is Biophilic Design            .pdfWhat Is Biophilic Design            .pdf
What Is Biophilic Design .pdfyamunaNMH
 
SVN Live 4.22.24 Weekly Property Broadcast
SVN Live 4.22.24 Weekly Property BroadcastSVN Live 4.22.24 Weekly Property Broadcast
SVN Live 4.22.24 Weekly Property BroadcastSVN International Corp.
 
Prestige Somerville Whitefield Bangalore E- Brochure.pdf
Prestige Somerville Whitefield Bangalore E- Brochure.pdfPrestige Somerville Whitefield Bangalore E- Brochure.pdf
Prestige Somerville Whitefield Bangalore E- Brochure.pdffaheemali990101
 
Prestige Rainbow Waters Raidurgam, Gachibowli Hyderabad E- Brochure.pdf
Prestige Rainbow Waters Raidurgam, Gachibowli Hyderabad E- Brochure.pdfPrestige Rainbow Waters Raidurgam, Gachibowli Hyderabad E- Brochure.pdf
Prestige Rainbow Waters Raidurgam, Gachibowli Hyderabad E- Brochure.pdffaheemali990101
 
Shapoorji Spectra Sensorium Hinjewadi Pune | E-Brochure
Shapoorji Spectra Sensorium Hinjewadi Pune | E-BrochureShapoorji Spectra Sensorium Hinjewadi Pune | E-Brochure
Shapoorji Spectra Sensorium Hinjewadi Pune | E-BrochureOmanaConsulting
 
Maha Mauka Squarefeet Brochure |Maha Mauka Squarefeet PDF Brochure|
Maha Mauka Squarefeet Brochure |Maha Mauka Squarefeet PDF Brochure|Maha Mauka Squarefeet Brochure |Maha Mauka Squarefeet PDF Brochure|
Maha Mauka Squarefeet Brochure |Maha Mauka Squarefeet PDF Brochure|AkshayJoshi575980
 
Prestige Sector 94 at Noida E Brochure.pdf
Prestige Sector 94 at Noida E Brochure.pdfPrestige Sector 94 at Noida E Brochure.pdf
Prestige Sector 94 at Noida E Brochure.pdfsarak0han45400
 
Anandtara Iris Residences Mundhwa Pune Brochure.pdf
Anandtara Iris Residences Mundhwa Pune Brochure.pdfAnandtara Iris Residences Mundhwa Pune Brochure.pdf
Anandtara Iris Residences Mundhwa Pune Brochure.pdfabbu831446
 

Recently uploaded (20)

Radiance Majestic Valasaravakkam Chennai.pdf
Radiance Majestic Valasaravakkam Chennai.pdfRadiance Majestic Valasaravakkam Chennai.pdf
Radiance Majestic Valasaravakkam Chennai.pdf
 
DLF Plots Sriperumbudur in Chennai E Brochure Pdf
DLF Plots Sriperumbudur in Chennai E Brochure PdfDLF Plots Sriperumbudur in Chennai E Brochure Pdf
DLF Plots Sriperumbudur in Chennai E Brochure Pdf
 
MADHUGIRI FARM LAND BROCHURES (11)_compressed (1).pdf
MADHUGIRI FARM LAND BROCHURES (11)_compressed (1).pdfMADHUGIRI FARM LAND BROCHURES (11)_compressed (1).pdf
MADHUGIRI FARM LAND BROCHURES (11)_compressed (1).pdf
 
_Navigating Inflation's Influence on Commercial Real Estate (CRE) Investing I...
_Navigating Inflation's Influence on Commercial Real Estate (CRE) Investing I..._Navigating Inflation's Influence on Commercial Real Estate (CRE) Investing I...
_Navigating Inflation's Influence on Commercial Real Estate (CRE) Investing I...
 
young call girls in Lajpat Nagar,🔝 9953056974 🔝 escort Service
young call girls in Lajpat Nagar,🔝 9953056974 🔝 escort Serviceyoung call girls in Lajpat Nagar,🔝 9953056974 🔝 escort Service
young call girls in Lajpat Nagar,🔝 9953056974 🔝 escort Service
 
How to Navigate the Eviction Process in Pennsylvania: A Landlord's Guide
How to Navigate the Eviction Process in Pennsylvania: A Landlord's GuideHow to Navigate the Eviction Process in Pennsylvania: A Landlord's Guide
How to Navigate the Eviction Process in Pennsylvania: A Landlord's Guide
 
Experion Elements Sector 45 Noida_Brochure.pdf.pdf
Experion Elements Sector 45 Noida_Brochure.pdf.pdfExperion Elements Sector 45 Noida_Brochure.pdf.pdf
Experion Elements Sector 45 Noida_Brochure.pdf.pdf
 
Brigade Neopolis Kokapet, Hyderabad E- Brochure
Brigade Neopolis Kokapet, Hyderabad E- BrochureBrigade Neopolis Kokapet, Hyderabad E- Brochure
Brigade Neopolis Kokapet, Hyderabad E- Brochure
 
Affordable and Quality Construction SVTN's Signature Blend of Cost-Efficiency...
Affordable and Quality Construction SVTN's Signature Blend of Cost-Efficiency...Affordable and Quality Construction SVTN's Signature Blend of Cost-Efficiency...
Affordable and Quality Construction SVTN's Signature Blend of Cost-Efficiency...
 
Kolte Patil Mirabilis at Horamavu Road, Bangalore E brochure.pdf
Kolte Patil Mirabilis at Horamavu Road, Bangalore E brochure.pdfKolte Patil Mirabilis at Horamavu Road, Bangalore E brochure.pdf
Kolte Patil Mirabilis at Horamavu Road, Bangalore E brochure.pdf
 
Ryan Mahoney - How Property Technology Is Altering the Real Estate Market
Ryan Mahoney - How Property Technology Is Altering the Real Estate MarketRyan Mahoney - How Property Technology Is Altering the Real Estate Market
Ryan Mahoney - How Property Technology Is Altering the Real Estate Market
 
Mahindra Vista Kandivali East Mumbai Brochure.pdf
Mahindra Vista Kandivali East Mumbai Brochure.pdfMahindra Vista Kandivali East Mumbai Brochure.pdf
Mahindra Vista Kandivali East Mumbai Brochure.pdf
 
What Is Biophilic Design .pdf
What Is Biophilic Design            .pdfWhat Is Biophilic Design            .pdf
What Is Biophilic Design .pdf
 
SVN Live 4.22.24 Weekly Property Broadcast
SVN Live 4.22.24 Weekly Property BroadcastSVN Live 4.22.24 Weekly Property Broadcast
SVN Live 4.22.24 Weekly Property Broadcast
 
Prestige Somerville Whitefield Bangalore E- Brochure.pdf
Prestige Somerville Whitefield Bangalore E- Brochure.pdfPrestige Somerville Whitefield Bangalore E- Brochure.pdf
Prestige Somerville Whitefield Bangalore E- Brochure.pdf
 
Prestige Rainbow Waters Raidurgam, Gachibowli Hyderabad E- Brochure.pdf
Prestige Rainbow Waters Raidurgam, Gachibowli Hyderabad E- Brochure.pdfPrestige Rainbow Waters Raidurgam, Gachibowli Hyderabad E- Brochure.pdf
Prestige Rainbow Waters Raidurgam, Gachibowli Hyderabad E- Brochure.pdf
 
Shapoorji Spectra Sensorium Hinjewadi Pune | E-Brochure
Shapoorji Spectra Sensorium Hinjewadi Pune | E-BrochureShapoorji Spectra Sensorium Hinjewadi Pune | E-Brochure
Shapoorji Spectra Sensorium Hinjewadi Pune | E-Brochure
 
Maha Mauka Squarefeet Brochure |Maha Mauka Squarefeet PDF Brochure|
Maha Mauka Squarefeet Brochure |Maha Mauka Squarefeet PDF Brochure|Maha Mauka Squarefeet Brochure |Maha Mauka Squarefeet PDF Brochure|
Maha Mauka Squarefeet Brochure |Maha Mauka Squarefeet PDF Brochure|
 
Prestige Sector 94 at Noida E Brochure.pdf
Prestige Sector 94 at Noida E Brochure.pdfPrestige Sector 94 at Noida E Brochure.pdf
Prestige Sector 94 at Noida E Brochure.pdf
 
Anandtara Iris Residences Mundhwa Pune Brochure.pdf
Anandtara Iris Residences Mundhwa Pune Brochure.pdfAnandtara Iris Residences Mundhwa Pune Brochure.pdf
Anandtara Iris Residences Mundhwa Pune Brochure.pdf
 

Statistical Machine Learning for Text Classification with scikit-learn and NLTK

  • 1. Statistical Learning for Text Classification with scikit-learn and NLTK Olivier Grisel http://twitter.com/ogrisel PyCon – 2011
  • 2. Outline ● Why text classification? ● What is text classification? ● How? ● scikit-learn ● NLTK ● Google Prediction API ● Some results
  • 3. Applications of Text Classification Task Predicted outcome Spam filtering Spam, Ham, Priority Language guessing English, Spanish, French, ... Sentiment Analysis for Product Positive, Neutral, Negative Reviews News Feed Topic Politics, Business, Technology, Categorization Sports, ... Pay-per-click optimal ads Will yield clicks / Won't placement Recommender systems Will I buy this book? / I won't
  • 4. Supervised Learning Overview ● Convert training data to a set of vectors of features ● Build a model based on the statistical properties of features in the training set, e.g. ● Naïve Bayesian Classifier ● Logistic Regression ● Support Vector Machines ● For each new text document to classify ● Extract features ● Asked model to predict the most likely outcome
  • 5. Summary Training features Text vectors Documents, Images, Sounds... Machine Learning Algorithm Labels New Text features Document, vector Predictive Expected Image, Model Label Sound
  • 6. Bags of Words ● Tokenize document: list of uni-grams ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'] ● Binary occurrences / counts: {'the': True, 'quick': True...} ● Frequencies: {'the': 0.22, 'quick': 0.11, 'brown': 0.11, 'fox': 0.11…} ● TF-IDF {'the': 0.001, 'quick': 0.05, 'brown': 0.06, 'fox': 0.24…}
  • 7. Better than frequencies: TF-IDF ● Term Frequency ● Inverse Document Frequency ● Non informative words such as “the” are scaled done
  • 8. Even better features ● bi-grams of words: ● “New York”, “very bad”, “not good” ● n-grams of chars: ● “the”, “ed ”, “ a ” (useful for language guessing) ● Combine with: ● Binary occurrences ● Frequencies ● TF-IDF
  • 10. scikit-learn ● BSD ● numpy / scipy / cython / c++ wrappers ● Many state of the art implementations ● A new release every 3 months ● 17 contributors on release 0.7 ● Not just for text classification
  • 11. Features Extraction in scikit-learn from scikits.learn.features.text import WordNGramAnalyzer text = (u"J'ai mangxe9 du kangourou ce midi," u" c'xe9tait pas trxeas bon.") WordNGramAnalyzer(min_n=1, max_n=2).analyze(text) [u'ai', u'mange', u'du', u'kangourou', u'ce', u'midi', u'etait', u'pas', u'tres', u'bon', u'ai mange', u'mange du', u'du kangourou', u'kangourou ce', u'ce midi', u'midi etait', u'etait pas', u'pas tres', u'tres bon']
  • 12. Features Extraction in scikit-learn from scikits.learn.features.text import CharNGramAnalyzer analyzer = CharNGramAnalyzer(min_n=3, max_n=6) char_ngrams = analyzer.analyze(text) print char_ngrams[:5] + char_ngrams[-5:] [u"j'a", u"'ai", u'ai ', u'i m', u' ma', u's tres', u' tres ', u'tres b', u'res bo', u'es bon']
  • 13. TF-IDF features & SVMs from scikits.learn.features.text.sparse import Vectorizer from scikits.learn.sparse.svm.sparse import LinearSVC vec = Vectorizer(analyzer=analyzer) features = vec.fit_transform(list_of_documents) clf = LinearSVC(C=100).fit(features, labels) clf2 = pickle.loads(pickle.dumps(clf)) predicted_labels = clf2.predict(features_of_new_docs)
  • 14. ) cs (do form ns .tra Training features Text c Documents, ve vectors Images, Sounds... ) Machine X,y fit( Learning . clf Algorithm Labels w) _ ne w) cs _n e (do X m ct( sfor ed i New an . pr Text c.t r clf e features Document, v vector Predictive Expected Image, Model Label Sound
  • 15. NLTK ● Code: ASL 2.0 & Book: CC-BY-NC-ND ● Tokenizers, Stemmers, Parsers, Classifiers, Clusterers, Corpus Readers
  • 16. NLTK Corpus Downloader >>> import nltk >>> nltk.download()
  • 17. Using a NLTK corpus >>> from nltk.corpus import movie_reviews as reviews >>> pos_ids = reviews.fileids('pos') >>> neg_ids = reviews.fileids('neg') >>> len(pos_ids), len(neg_ids) 1000, 1000 >>> reviews.words(pos_ids[0]) ['films', 'adapted', 'from', 'comic', 'books', 'have', ...]
  • 18. Common data cleanup operations ● Lower case & remove accentuated chars: import unicodedata s = ''.join(c for c in unicodedata.normalize('NFD', s.lower()) if unicodedata.category(c) != 'Mn') ● Extract only word tokens of at least 2 chars ● Using NLTK tokenizers & stemmers ● Using a simple regexp: re.compile(r"bww+b", re.U).findall(s)
  • 19. Feature Extraction with NLTK Unigram features def word_features(words): return dict((word, True) for word in words)
  • 20. Feature Extraction with NLTK Bigram Collocations from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures as BAM from itertools import chain def bigram_features(words, score_fn=BAM.chi_sq): bg_finder = BigramCollocationFinder.from_words(words) bigrams = bg_finder.nbest(score_fn, 100000) return dict((bg, True) for bg in chain(words, bigrams))
  • 21. The NLTK Naïve Bayes Classifier from nltk.classify import NaiveBayesClassifier neg_examples = [(features(reviews.words(i)), 'neg') for i in neg_ids] pos_examples = [(features(reviews.words(i)), 'pos') for i in pos_ids] train_set = pos_examples + neg_examples classifier = NaiveBayesClassifier.train(train_set)
  • 22. Most informative features >>> classifier.show_most_informative_features() magnificent = True pos : neg = 15.0 : 1.0 outstanding = True pos : neg = 13.6 : 1.0 insulting = True neg : pos = 13.0 : 1.0 vulnerable = True pos : neg = 12.3 : 1.0 ludicrous = True neg : pos = 11.8 : 1.0 avoids = True pos : neg = 11.7 : 1.0 uninvolving = True neg : pos = 11.7 : 1.0 astounding = True pos : neg = 10.3 : 1.0 fascination = True pos : neg = 10.3 : 1.0 idiotic = True neg : pos = 9.8 : 1.0
  • 23. Training NLTK classifiers ● Try nltk-trainer ● python train_classifier.py --instances paras --classifier NaiveBayes –bigrams --min_score 3 movie_reviews
  • 26. NLTK – REST APIs % curl -d "text=Inception is the best movie ever" http://text-processing.com/api/sentiment/ { "probability": { "neg": 0.36647424288117808, "pos": 0.63352575711882186 }, "label": "pos" }
  • 28.
  • 29.
  • 30. Typical performance results: movie reviews ● nltk: ● unigram occurrences ● Naïve Bayesian Classifier ~ 70% ● Google Prediction API ~ 83% ● scikit-learn: ● TF-IDF unigram features ● LinearSVC ~ 87% ● nltk: ● Collocation features selection ● Naïve Bayesian Classifier ~ 97%
  • 31. Typical results: newsgroups topics classification ● 20 newsgroups dataset ● ~ 19K short text documents ● 20 categories ● By date train / test split ● Bigram TF-IDF + LinearSVC: ~ 87%
  • 32. Confusion Matrix (20 newsgroups) 00 alt.atheism 01 comp.graphics 02 comp.os.ms-windows.misc 03 comp.sys.ibm.pc.hardware 04 comp.sys.mac.hardware 05 comp.windows.x 06 misc.forsale 07 rec.autos 08 rec.motorcycles 09 rec.sport.baseball 10 rec.sport.hockey 11 sci.crypt 12 sci.electronics 13 sci.med 14 sci.space 15 soc.religion.christian 16 talk.politics.guns 17 talk.politics.mideast 18 talk.politics.misc 19 talk.religion.misc
  • 33. Typical results: Language Identification ● 15 Wikipedia articles ● [p.text_content() for p in html_tree.findall('//p')] ● CharNGramAnalyzer(min_n=1, max_n=3) ● TF-IDF ● LinearSVC
  • 35. Scaling to many possible outcomes ● Example: possible outcomes are all the categories of Wikipedia (565,108) ● From Document Categorization to Information Retrieval ● Fulltext index for TF-IDF similarity queries ● Smart way to find the top 30 search keywords ● Use Apache Lucene / Solr MoreLikeThisQuery
  • 36. Some pointers ● http://scikit-learn.sf.net doc & examples http://github.com/scikit-learn code ● http://www.nltk.org code & doc & PDF book ● http://streamhacker.com/ ● Jacob Perkins' blog on NLTK & APIs ● https://github.com/japerk/nltk-trainer ● http://www.slideshare.net/ogrisel these slides ● http://twitter.com/ogrisel / http://github.com/ogrisel Questions?