SlideShare uma empresa Scribd logo
1 de 169
Baixar para ler offline
Quick Tour of Text Mining
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
About Speaker
陳宜欣 Yi-Shin Chen
▷ Currently
•  清華大學資訊工程系副教授
•  主持智慧型資料工程與應用實驗室 (IDEA Lab)
▷ Education
•  Ph.D. in Computer Science, USC, USA
•  M.B.A. in Information Management, NCU, TW
•  B.B.A. in Information Management, NCU, TW
▷ Courses (all in English)
•  Research and Presentation Skills
•  Introduction to Database Systems
•  Advanced Database Systems
•  Data Mining: Concepts, Techniques, and
@ Yi-Shin Chen, Text Mining Overview
Research Focus from 2000
@ Yi-Shin Chen, Text Mining Overview 3
Free Resources
▷ 免費數據
•  不管公網、私網,能合法下載的資料都是好物
@ Yi-Shin Chen, Text Mining Overview
Past and Current Studies
▷ Location identification
▷ Interest identification
▷ Event identification
▷ Extract semantic relationships
▷ Unsupervised multilingual sentiment analysis
▷ Keyword extraction and summary
▷ Emotion analysis
▷ Mental illness detection
Text Processing
@ Yi-Shin Chen, Text Mining Overview 5
Text Mining Overview
@ Yi-Shin Chen, Text Mining Overview
Data (Text vs. Non-Text)
World Sensor Data
Interpret by Report
Thermometer, Hygrometer
24。C, 55%
Location GPS
 37。N, 123 。E
Body Sphygmometer, MRI, etc.
 126/70 mmHg
World To be or not to be..
@ Yi-Shin Chen, Text Mining Overview
Data Mining vs. Text Mining
Non-text data
•  Numerical
•  Precise
•  Objective
Text data
•  Text
•  Ambiguous
•  Subjective
Data Mining
•  Clustering
•  Classification
•  Association Rules
•  …
Text Processing
(including NLP)
Text Mining
@ Yi-Shin Chen, Text Mining Overview
Preprocessing in Reality
Data Collection
▷ Align /Classify the attributes correctly
Who post this message Mentioned User
Shared URL
Language Detection
▷ To detect an language (possible languages)
in which the specified text is written
▷ Difficulties
•  Short message
•  Different languages in one statement
•  Noisy
你好 現在幾點鐘
apa kabar sekarang jam
berapa ?
繁體中文 (zh-tw)
印尼文 (id)
Wrong Detection Examples
▷ Twitter examples
@sayidatynet top song #LailaGhofran
shokran ya garh new album #listen
#ChineseTaipei #Sochi #2014冬奧
Before / after removing noise 
en - id
it - zh-tw
en - ja
Removing Noise
▷ Removing noise before detection
•  Html file -tags
•  Twitter - hashtag, mention, URL
meta name=twitter:description
(張貼悔過書 .../
首頁(張貼悔過書 ...
Data Cleaning
▷ Special character
▷ Utilize regular expressions to clean data
Unicode emotions  ☺, ♥…
Symbol icon  ☏, ✉…
Currency symbol  €, £, $...
Tweet URL
Filter out non-(letters, space,
punctuation, digit) 
 ◕‿◕ Friendship is everything ♥ ✉
I added a video to a @YouTube playlist http:// Jamie Riepe
Japanese Examples
▷ Use regular expression remove all special
•  うふふふふ(*^^*)楽しむ!ありがとうございま
す^o^ アイコン、ラブラブ(-_-)♡
•  うふふふふ 楽しむ ありがとうございます ア
イコン ラブラブ
Part-of-speech (POS) Tagging
▷ Processing text and assigning parts of
speech to each word
▷ Twitter POS tagging
•  Noun (N), Adjective (A), Verb (V), URL (U)…
Happy Easter! I went to work and came home to an empty house now im
going for a quick run
Happy_A Easter_N !_, I_O went_V to_P work_N and_ came_V home_N
to_P an_D empty_A house_N now_R im_L going_V for_P a_D quick_A
▷ @DirtyDTran gotta be caught up for
tomorrow nights episode
▷ @ASVP_Jaykey for some reasons I found
this very amusing
•  @DirtyDTran gotta be catch up for tomorrow night episode
•  @ASVP_Jaykey for some reason I find this very amusing
RT @kt_biv : @caycelynnn loving and missing you! we are
still looking for Lucy
Hashtag Segmentation
▷ By using Microsoft Web N-Gram Service
(or by using Viterbi algorithm)
#pray #for #boston
Wow! explosion at a boston race ... #prayforboston
#citizen #science
#boston #marathon
#good #things #are #coming
#low #blood #pressure
More Preprocesses for Different Web
▷ Extract source code without javascript
▷ Removing html tags
Extract Source Code Without Javascript
▷ Javascript code should be considered as an exception
•  it may contain hidden content
Remove Html Tags
▷ Removing html tags to extract meaningful content
More Preprocesses for Different Languages
▷ Chinese Simplified/Traditional Conversion
▷ Word segmentation
Chinese Simplified/Traditional Conversion
▷ Word conversion
•  请乘客从后门落车 → 請乘客從後門下車
▷ One-to-many mapping
•  @shinrei 出去旅游还是崩坏 → @shinrei 出去旅游還是崩壞
游 (zh-cn) → 游|遊 (zh-tw)
▷ Wrong segmentation
•  人体内存在很多微生物 → 內存: 人體 記憶體 在很多微生物
→ 存在: 人體內 存在 很多微生物
Wrong Chinese Word Segmentation
▷ Wrong segmentation
•  這(Nep) 地面(Nc) 積(VJ) 還(D) 真(D) 不(D) 小(VH)
▷ Wrong word
•  @iamzeke 實驗(Na) 室友(Na) 多(Dfa) 危險(VH) 你(Nh) 不(D) 知道(VK) 嗎
(T) ?
▷ Wrong order
•  人體(Na) 存(VC) 內在(Na) 很多(Neqa) 微生物(Na)
▷ Unknown word
•  半夜(Nd) 逛團(Na) 購(VC) 看到(VE) 太(Dfa) 吸引人(VH) !!
未知詞: 團購
Back to Text Mining
Let’s come back to Text Mining
@ Yi-Shin Chen, Text Mining Overview
Data Mining vs. Text Mining
Non-text data
•  Numerical
•  Precise
•  Objective
Text data
•  Text
•  Ambiguous
•  Subjective
Data Mining
•  Clustering
•  Classification
•  Association Rules
•  …
Text Processing
(including NLP)
Text Mining
@ Yi-Shin Chen, Text Mining Overview
Landscape of Text Mining
World Sensor Data
Interpret by Report
24。C, 55%
World To be or not to be..
Perceived by Express
Mining knowledge
about Languages
Nature Language Processing  Text Representation;
Word Association and Mining
@ Yi-Shin Chen, Text Mining Overview
Landscape of Text Mining
World Sensor Data
Interpret by Report
24。C, 55%
World To be or not to be..
Perceived by Express
Mining content about the observers
Opinion Mining and Sentiment Analysis
@ Yi-Shin Chen, Text Mining Overview
Landscape of Text Mining
World Sensor Data
Interpret by Report
24。C, 55%
World To be or not to be..
Perceived by Express
Mining content about the World
Topic Mining , Contextual Text Mining
@ Yi-Shin Chen, Text Mining Overview
Structure of Information Extraction System
local text analysis
lexical analysis
name recognition
partial syntactic analysis
scenario pattern matching
discourse analysis
Co-reference analysis
template generation
extracted templates
Basic Concepts in NLP
This is the best thing happened in my life.
Verb VerbAdj
Lexical analysis
(Part-of Speech
Noun Phrase
Prep Phrase
Prep Phrase
Noun Phrase
Syntactic analysis
This? (t1)
Best thing (t2)
My (m1)
Happened (t1, t2, m1)
Semantic Analysis
Happy (x) if Happened (t1, ‘Best’, m1) Happy
@ Yi-Shin Chen, Text Mining Overview
Basic Concepts in NLP
This is the best thing happened in my
Verb VerbAdj
String of Characters
This is the best thing happened in my
String of Words
POS Tags
Best thing
My life
Entity Period
The writer loves his new born baby Understanding
(Logic predicates)
Deeper NLP
Less accurate
Closer to knowledge
@ Yi-Shin Chen, Text Mining Overview
NLP vs. Text Mining
▷ Text Mining objectives
•  Overview
•  Know the trends
•  Accept noise
@ Yi-Shin Chen, Text Mining Overview 33	
▷ NLP objectives
•  Understanding
•  Ability to answer
•  Immaculate
Basic Data Model Concepts
Let’s learn from giants
@ Yi-Shin Chen, Text Mining Overview
Data Models
▷ Data model/Language: a collection of concepts for describing data
▷ Schema/Structured observation: a description of a particular
collection of data, using the a given data model
▷ Data instance/Statements
@ Yi-Shin Chen, Text Mining Overview 35	
Using a model Using ER Model
Schema that
represents the World Car PeopleDrive
E-R Model
▷ Introduced by Peter Chen; ACM TODS, March 1976
•  Additional Readings
→  Peter Chen. English Sentence Structure and Entity-Relationship Diagram.
Information Sciences, Vol. 1, No. 1, Elsevier, May 1983, Pages 127-149
→  Peter Chen. A Preliminary Framework for Entity-Relationship Models.
Entity-Relationship Approach to Information Modeling and Analysis, North-
Holland (Elsevier), 1983, Pages 19 - 28
@ Yi-Shin Chen, Text Mining Overview 36
E-R Model Basics -Entity
▷ Based on a perception of a real world, which consists
•  A set of basic objects ⇒ Entities
•  Relationships among objects
▷ Entity: Real-world object distinguishable from other objects
▷ Entity Set: A collection of similar entities. E.g., all employees.
•  Presented as:
@ Yi-Shin Chen, Text Mining Overview 37	
Animals Time People
This is the best thing happened in my life.
Dogs love their owners.
E-R: Relationship Sets
▷ Relationship: Association among two or more entities
▷ Relationship Set: Collection of similar relationships.
•  Relationship set are presented as:
•  The relationship cannot exist without having corresponding entities
@ Yi-Shin Chen, Text Mining Overview 38	
actionAnimal People
Dogs love their owners.
High-Level Entity
▷ High-level entity: Abstracted from a group of
interconnected low-level entity and relationship types
@ Yi-Shin Chen, Text Mining Overview 39	
Alice loves me.
This is the best thing happened in my life.
People action
This is the best thing
My life
Word Relations
Back to text
@ Yi-Shin Chen, Text Mining Overview
Word Relations
▷ Paradigmatic: can be substituted for each other (similar)
•  E.g., Cat  dog, run and walk
▷ Syntagmatic: can be combined with each other
•  E.g., Cat and fights, dog and barks
→ These two basic and complementary relations can be generalized to
describe relations of any times in a language 41	
Animals Act
Animals Act
@ Yi-Shin Chen, Text Mining Overview
Mining Word Associations
▷ Paradigmatic
•  Represent each word by its context
•  Compute context similarity
•  Words with high context similarity
▷ Syntagmatic
•  Count the number of times two words occur together in a context
•  Compare the co-occurrences with the corresponding individual
•  Words with high co-occurrences but relatively low individual
42	@ Yi-Shin Chen, Text Mining Overview
Paradigmatic Word Associations
John’s cat eats fish in Saturday
Mary’s dog eats meat in Sunday
John’s cat drinks milk in Sunday
Mary’s dog drinks beer in Tuesday
John’s cat
In Saturday
John’s --- eats fish in Saturday
Mary’s --- eats meat in Sunday
John’s --- drinks milk in Sunday
Mary’s --- drinks beer in Tuesday
Similar left content
Similar right content
Similar general content
How similar are context (“cat”) and context (“dog”)?
How similar are context (“cat”) and context (“John”)?
→ Expected Overlap of Words in Context (EOWC)
Overlap (“cat”, “dog”)
Overlap (“cat”, “John”)
@ Yi-Shin Chen, Text Mining Overview
Vector Space Model (Bag of Words)
▷ Represent the keywords of objects using a term
•  Term: basic concept, e.g., keywords to describe an object
•  Each term represents one dimension in a vector
•  N total terms define an n-element terms
•  Values of each term in a vector corresponds to the
importance of that term
▷ Measure similarity by the vector distances
Document 1
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
Common Approach for EOWC:
Cosine Similarity
▷ If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| ,
where • indicates vector dot product and || d || is the length of vector d.
▷ Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150
→ Overlap (“John”, “Cat”) =.3150
45	@ Yi-Shin Chen, Text Mining Overview
Quality of EOWC?
▷ The more overlap the two context documents
have, the higher the similarity would be
▷ However:
•  It favor matching one frequent term very well over matching
more distinct terms
•  It treats every word equally (overlap on “the” should not be as
meaningful as overlap on “eats”)
46	@ Yi-Shin Chen, Text Mining Overview
Term Frequency and Inverse
Document Frequency (TFIDF)
▷ Since not all objects in the vector space are equally
important, we can weight each term using its
occurrence probability in the object description
•  Term frequency: TF(d,t)
→  number of times t occurs in the object description d
•  Inverse document frequency: IDF(t)
→  to scale down the terms that occur in many descriptions
47	@ Yi-Shin Chen, Text Mining Overview
Normalizing Term Frequency
▷ nij represents the number of times a term ti occurs
in a description dj . tfij can be normalized using the
total number of terms in the document
•  ​ 𝑡 𝑓↓𝑖𝑗 =​​ 𝑛↓𝑖𝑗 /𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑𝑉𝑎𝑙𝑢𝑒 
▷ Normalized value could be:
•  Sum of all frequencies of terms
•  Max frequency value
•  Any other values can make tfij between 0 to 1
•  BM25*: ​ 𝑡 𝑓↓𝑖𝑗 =​​ 𝑛↓𝑖𝑗 ×(𝑘+1)/​ 𝑛↓𝑖𝑗 + 𝑘 
48	@ Yi-Shin Chen, Text Mining Overview
Inverse Document Frequency
▷ IDF seeks to scale down the coordinates of terms
that occur in many object descriptions
•  For example, some stop words(the, a, of, to, and…) may
occur many times in a description. However, they should
be considered as non-important in many cases
•  ​ 𝑖 𝑑𝑓↓𝑖 = 𝑙𝑜𝑔(​ 𝑁/​ 𝑑 𝑓↓𝑖  +1)
→  where dfi (document frequency of term ti) is the
number of descriptions in which ti occurs
▷ IDF can be replaced with ICF (inverse class frequency) and
many other concepts based on applications
49	@ Yi-Shin Chen, Text Mining Overview
Reasons of Log
▷ Each distribution can indicate the hidden force
•  Time
•  Independent
•  Control
Power-law distribution Normal distribution
 Normal distribution
@ Yi-Shin Chen, Text Mining Overview
Mining Word Associations
▷ Paradigmatic
•  Represent each word by its context
•  Compute context similarity
•  Words with high context similarity
▷ Syntagmatic
•  Count the number of times two words occur together in a context
•  Compare the co-occurrences with the corresponding individual
•  Words with high co-occurrences but relatively low individual
51	@ Yi-Shin Chen, Text Mining Overview
Syntagmatic Word Associations
John’s cat eats fish in Saturday
Mary’s dog eats meat in Sunday
John’s cat drinks milk in Sunday
Mary’s dog drinks beer in Tuesday
John’s cat
In Saturday
John’s *** eats *** in Saturday
Mary’s *** eats *** in Sunday
John’s --- drinks --- in Sunday
Mary’s --- drinks --- in Tuesday
What words tend to occur to the left of “eats”
What words to the right?
Whenever “eats” occurs, what other words also tend to occur?
Correlated occurrences
P(dog | eats) = ? ; P(cats | eats) = ?
@ Yi-Shin Chen, Text Mining Overview
Word Prediction
Prediction Question: Is word W present (or absent) in this segment?
Text Segment (any unit, e.g., sentence, paragraph, document)
Predict the occurrence of word
W1 = ‘meat’ W2 = ‘a’ W3 = ‘unicorn’
@ Yi-Shin Chen, Text Mining Overview
Word Prediction: Formal Definition
▷ Binary random variable {0,1}
•  ​ 𝑥↓𝑤 ={█1𝑤 𝑖𝑠 𝑝𝑟𝑒𝑠𝑒𝑛𝑡@0𝑤 𝑖𝑠 𝑎𝑏𝑠𝑒𝑡  
•  𝑃(​ 𝑥↓𝑤 =1)+ 𝑃(​ 𝑥↓𝑤 =0)=1
▷ The more random ​ 𝑥↓𝑤  is, the more difficult the
prediction is
▷ How do we quantitatively measure the randomness?
54	@ Yi-Shin Chen, Text Mining Overview
▷ Entropy measures the amount of randomness or
surprise or uncertainty
▷ Entropy is defined as:
( ) ( )
( ) 1
i i
pppH !
• entropy = 0
• entropy=1
• difficult0
0 0.2 0.4 0.6 0.8 1
@ Yi-Shin Chen, Text Mining Overview
Conditional Entropy
Know nothing about the segment
Know “eats” is present (Xeat=1)
𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 
𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 
𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 =1|​
𝑥↓𝑒𝑎𝑡𝑠 =1 )
𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 =0|​
𝑥↓𝑒𝑎𝑡𝑠 =1 )
𝐻 (​ 𝑥↓𝑚𝑒𝑎𝑡 )=− 𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 =0)×​ 𝑙 𝑜𝑔↓2 (𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 
=0))− 𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 =1)×​ 𝑙 𝑜𝑔↓2 (𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 =1))
𝐻 (​ 𝑥↓𝑚𝑒𝑎𝑡 |​ 𝑥↓𝑒𝑎𝑡𝑠 =1 )=− 𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 =0|​ 𝑥↓𝑒𝑎𝑡𝑠 
=1 )×​ 𝑙 𝑜𝑔↓2 (𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 =0|​ 𝑥↓𝑒𝑎𝑡𝑠 =1 ))− 𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 
=1|​ 𝑥↓𝑒𝑎𝑡𝑠 =1 )×​ 𝑙 𝑜𝑔↓2 (𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 =1|​ 𝑥↓𝑒𝑎𝑡𝑠 =1 ))
@ Yi-Shin Chen, Text Mining Overview
( ) ( ) ( )( )∑∈
Mining Syntagmatic Relations
▷ For each word W1
•  For every word W2, compute conditional entropy 𝐻(​ 𝑥↓𝑤1 |​ 𝑥↓𝑤2  )
•  Sort all the candidate words in ascending order of 𝐻(​ 𝑥↓𝑤1 |​
𝑥↓𝑤2  )
•  Take the top-ranked candidate words with some given threshold
▷ However
•  𝐻(​ 𝑥↓𝑤1 |​ 𝑥↓𝑤2  ) and 𝐻(​ 𝑥↓𝑤1 |​ 𝑥↓𝑤3  ) are comparable
•  𝐻(​ 𝑥↓𝑤1 |​ 𝑥↓𝑤2  ) and 𝐻(​ 𝑥↓𝑤3 |​ 𝑥↓𝑤2  ) are not
→  Because the upper bounds are different
▷ Conditional entropy is not symmetric 57	@ Yi-Shin Chen, Text Mining Overview
Mutual Information
▷ 𝐼(𝑥; 𝑦)= 𝐻(𝑥)− 𝐻(𝑥|𝑦 )= 𝐻(𝑦)− 𝐻(𝑦|𝑥 )
▷ Properties:
•  Symmetric
•  Non-negative
•  I(x;y)=0 iff x and y are independent
▷ Allow us to compare different (x,y) pairs
@ Yi-Shin Chen, Text Mining Overview
Topic Mining
Assume we already know the word relationships
59	@ Yi-Shin Chen, Text Mining Overview
Landscape of Text Mining
World Sensor Data
Interpret by Report
24。C, 55%
World To be or not to be..
Perceived by Express
Mining knowledge
about Languages
Nature Language Processing  Text Representation;
Word Association and Mining
@ Yi-Shin Chen, Text Mining Overview
Topic Mining: Motivation
▷ Topic: key idea in text data
•  Theme/subject
•  Different granularities (e.g., sentence, article)
▷ Motivated applications, e.g.:
•  Hot topics during the debates in 2016 presidential election
•  What do people like about Windows 10
•  What are Facebook users talking about today?
•  What are the most watched news?
@ Yi-Shin Chen, Text Mining Overview
Tasks of Topic Mining
Text Data
 Topic 1
Topic 2
Topic 3
Topic 4
Topic n
Doc1 Doc2
@ Yi-Shin Chen, Text Mining Overview
Formal Definition of Topic Mining
▷ Input
•  A collection of N text documents 𝑆={​ 𝑑↓1 ,​ 𝑑↓2 ,​ 𝑑↓3 ,…​
𝑑↓𝑛 }
•  Number of topics: k
▷ Output
•  k topics: {​ 𝜃↓1 ,​ 𝜃↓2 ,​ 𝜃↓3 ,…​ 𝜃↓𝑛 }
•  Coverage of topics in each ​ 𝑑↓𝑖 : {​ 𝜇↓𝑖1 ,​ 𝜇↓𝑖2 ,​ 𝜇↓𝑖3 ,…​
𝜇↓𝑖𝑛 }
▷ How to define topic ​ 𝜃↓𝑖 ?
•  Topic=term (word)? 63	@ Yi-Shin Chen, Text Mining Overview
Tasks of Topic Mining (Terms as Topics)
Text Data
Doc1 Doc2
@ Yi-Shin Chen, Text Mining Overview
Problems with “Terms as Topics”
▷ Not generic
•  Can only represent simple/general topic
•  Cannot represent complicated topics
→  E.g., “uber issue”: political or transportation related?
▷ Incompleteness in coverage
•  Cannot capture variation of vocabulary
▷ Word sense ambiguity
•  E.g., Hollywood star vs. stars in the sky; apple watch
vs. apple recipes
65	@ Yi-Shin Chen, Text Mining Overview
Improved Ideas
▷ Idea1 (Probabilistic topic models): topic = word
•  E.g.: Sports = {(Sports, 0.2), (Game 0.01), (basketball 0.005),
(play, 0.003), (NBA,0.01)…}
•  √: generic, easy to implement
▷ Idea 2 (Concept topic models): topic = concept
•  Maintain concepts (manually or automatically)
→  E.g., ConceptNet
66	@ Yi-Shin Chen, Text Mining Overview
Possible Approaches for Probabilistic
Topic Models
▷ Bag-of-words approach:
•  Mixture of unigram language model
•  Expectation-maximization algorithm
•  Probabilistic latent semantic analysis
•  Latent Dirichlet allocation (LDA) model
▷ Graph-based approach :
•  TextRank (Mihalcea and Tarau, 2004)
•  Reinforcement Approach (Xiaojun et al., 2007)
•  CollabRank (Xiaojun er al., 2008)
67	@ Yi-Shin Chen, Text Mining Overview
Bag-of-words Assumption
▷ Word order is ignored
▷ “bag-of-words” – exchangeability
▷ Theorem (De Finetti, 1935) – if (​ 𝑥↓1 ,​ 𝑥↓2 ,
​…, 𝑥↓𝑛 ) are infinitely exchangeable, then the joint
probability p(​ 𝑥↓1 ,​ 𝑥↓2 , ​…, 𝑥↓𝑛 ) has a
representation as a mixture:
▷ p(​ 𝑥↓1 ,​ 𝑥↓2 , ​…, 𝑥↓𝑛 )=∫↑▒𝑑𝜃𝑝(𝜃) ∏𝑖=1↑𝑁▒𝑝(​
𝑥↓𝑖 |𝜃 ) 
for some random variable θ@ Yi-Shin Chen, Text Mining Overview 68
Latent Dirichlet Allocation
▷ Latent Dirichlet Allocation (D. M. Blei, A. Y. Ng, 2003)
Linear Discriminant Analysis
∏∫ ∏∑
∫ ∏∑
= =
n z
n z
1 1
@ Yi-Shin Chen, Text Mining Overview 69
LDA Assumption
▷ Assume:
•  When writing a document, you
1.  Decide how many words
2.  Decide distribution(P = Dir( 𝛼) P = Dir( 𝛽))) P = Dir( 𝛽))))
3.  Choose topics (Dirichlet)
4.  Choose words for topics (Dirichlet)
5.  Repeat 3
•  Example
1.  5 words in document
2.  50% food  50% cute animals
3.  1st word - food topic, gives you the word “bread”.
4.  2nd word - cute animals topic, “adorable”.
5.  3rd word - cute animals topic, “dog”.
6.  4th word - food topic, “eating”.
7.  5th word - food topic, “banana”.
“bread adorable dog eating banana”
Choice of topics and words
@ Yi-Shin Chen, Text Mining Overview
LDA Learning (Gibbs)
▷ How many topics you think there are ?
▷ Randomly assign words to topics
▷ Check and update topic assignments (Iterative)
•  p(topic t | document d)
•  p(word w | topic t)
•  Reassign w a new topic, p(topic t | document d) * p(word w | topic t)
I eat fish and vegetables.
Dog and fish are pets.
My kitten eats fish.
#Topic: 2
I eat fish and vegetables.
Dog and fish are pets.
My kitten eats fish.
I eat fish and vegetables.
Dog and fish are pets.
My kitten eats fish.
p(red|1)=0.67; p(purple|1)=0.33; p(red|2)=0.67; p(purple|2)=0.33; p(red|3)=0.67; p(purple|3)=0.33;
p(eat|red)=0.17; p(eat|purple)=0.33; p(fish|red)=0.33; p(fish|purple)=0.33; p(vegetable|
red)=0.17; p(dog|purple)=0.33; p(pet|red)=0.17; p(kitten|red)=0.17;
p(purple|2)*p(fish|purple)=0.5*0.33=0.165; p(red|2)*p(fish|red)=0.5*0.2=0.1;
p(red|1)=0.67; p(purple|1)=0.33; p(red|2)=0.50; p(purple|2)=0.50; p(red|3)=0.67; p(purple|
p(eat|red)=0.20; p(eat|purple)=0.33; p(fish|red)=0.20; p(fish|purple)=0.33; p(vegetable|
red)=0.20; p(dog|purple)=0.33; p(pet|red)=0.20; p(kitten|red)=0.20;
I eat fish and vegetables.
Dog and fish are pets.
My kitten eats fish.
@ Yi-Shin Chen, Text Mining Overview
Related Work – Topic Model (LDA)
▷ I eat fish and vegetables.
▷ Dog and fish are pets.
▷ My kitten eats fish.
Sentence 1: 14.67% Topic 1, 85.33% Topic 2
Sentence 2: 85.44% Topic 1, 14.56% Topic 2
Sentence 3: 19.95% Topic 1, 80.05% Topic 2
Topic 1
0.268 fish
0.210 pet
0.210 dog
0.147 kitten
Topic 2
0.296 eat
0.265 fish
0.189 vegetable
0.121 kitten
@ Yi-Shin Chen, Text Mining Overview
Possible Approaches for Probabilistic
Topic Models
▷ Bag-of-words approach:
•  Mixture of unigram language model
•  Expectation-maximization algorithm
•  Probabilistic latent semantic analysis
•  Latent Dirichlet allocation (LDA) model
▷ Graph-based approach :
•  TextRank (Mihalcea and Tarau, 2004)
•  Reinforcement Approach (Xiaojun et al., 2007)
•  CollabRank (Xiaojun er al., 2008)
73	@ Yi-Shin Chen, Text Mining Overview
Construct Graph
▷ Directed graph
▷ Elements in the graph
→  Terms
→  Phrases
→  Sentences
74	@ Yi-Shin Chen, Text Mining Overview
Connect Term Nodes
▷ Connect terms based on its slop.
I love the new ipod shuffle. 
It is the smallest ipod.
@ Yi-Shin Chen, Text Mining Overview
Connect Phrase Nodes
▷ Connect phrases to
•  Compound words
•  Neighbor words
I love the new
ipod shuffle. 
It is the
smallest ipod.
ipod shuffle
@ Yi-Shin Chen, Text Mining Overview
Connect Sentence Nodes
▷ Connect to
•  Neighbor sentences
•  Compound terms
•  Compound phrase
@ Yi-Shin Chen, Text Mining Overview 77	
I love the new ipod shuffle.
It is the smallest ipod.
ipod shuffle
I love the new ipod shuffle.
It is the smallest ipod.
Edge Weight Types
I love the new ipod shuffle.
It is the smallest ipod.
ipod shuffle
@ Yi-Shin Chen, Text Mining Overview
Graph-base Ranking
▷ Scores for each node (TextRank 2004)
ythe = (1− 0.85) + 0.85 *(0.1× 0.5 + 0.2 × 0.6 + 0.3× 0.7)
Score from parent nodes
damping factor
@ Yi-Shin Chen, Text Mining Overview
@ Yi-Shin Chen, Text Mining Overview
Graph-based Extraction
•  Pros
→  Structure and syntax information
→  Mutual influence
•  Cons
→  Common words get higher scores
81	@ Yi-Shin Chen, Text Mining Overview
Summary: Probabilistic Topic Models
▷ Probabilistic topic models): topic = word distribution
•  E.g.: Sports = {(Sports, 0.2), (Game 0.01), (basketball 0.005),
(play, 0.003), (NBA,0.01)…}
•  √: generic, easy to implement
•  ?: Not easy to understand/communicate
•  ?: Not easy to construct semantic relationship between topics
 Topic= {(Crooked, 0.02), (dishonest, 0.001), (News,
0.0008), (totally, 0.0009), (total, 0.000009), (failed,
0.0006), (bad, 0.0015), (failing, 0.00001),
(presidential, 0.0000008), (States, 0.0000004),
(terrible, 0.0000085),(failed, 0.000021), (lightweight,
0.00001),(weak, 0.0000075), ……}
@ Yi-Shin Chen, Text Mining Overview
Improved Ideas
▷ Idea1 (Probabilistic topic models): topic = word
•  E.g.: Sports = {(Sports, 0.2), (Game 0.01), (basketball 0.005),
(play, 0.003), (NBA,0.01)…}
•  √: generic, easy to implement
▷ Idea 2 (Concept topic models): topic = concept
•  Maintain concepts (manually or automatically)
→  E.g., ConceptNet
83	@ Yi-Shin Chen, Text Mining Overview
NLP Related Approach:
Named Entity Recognition
▷ Find and classify all the named entities in a text.
▷ What’s a named entity?
•  A mention of an entity using its name.
→  Kansas Jayhawks
•  This is a subset of the possible mentions...
→  Kansas, Jayhawks, the team, it, they
▷ Find means identify the exact span of the mention
▷ Classify means determine the category of the entity
being referred to
Named Entity Recognition Approaches
▷ As with partial parsing and chunking there are two
basic approaches (and hybrids)
•  Rule-based (regular expressions)
→  Lists of names
→  Patterns to match things that look like names
→  Patterns to match the environments that classes of names
tend to occur in.
•  Machine Learning-based approaches
→  Get annotated training data
→  Extract features
→  Train systems to replicate the annotation
Rule-Based Approaches
▷ Employ regular expressions to extract data
▷ Examples:
•  Telephone number: (d{3}[-. ()]){1,2}[dA-Z]{4}.
→  800-865-1125
→  800.865.1125
→  (800)865-CARE
•  Software name extraction: ([A-Z][a-z]*s*)+
→  Installation Designer v1.1
▷ Once you have captured the entities in a text you
might want to ascertain how they relate to one
•  Here we’re just talking about explicitly stated relations
Relation Types
▷ As with named entities, the list of relations is
application specific. For generic news texts...
Bootstrapping Approaches
▷ What if you don’t have enough annotated text to
train on.
•  But you might have some seed tuples
•  Or you might have some patterns that work pretty well
▷ Can you use those seeds to do something useful?
•  Co-training and active learning use the seeds to train
classifiers to tag more data to train better classifiers...
•  Bootstrapping tries to learn directly (populate a relation)
through direct use of the seeds
Bootstrapping Example: Seed Tuple
▷ Mark Twain, Elmira Seed tuple
•  Grep (google)
•  “Mark Twain is buried in Elmira, NY.”
→  X is buried in Y
•  “The grave of Mark Twain is in Elmira”
→  The grave of X is in Y
•  “Elmira is Mark Twain’s final resting place”
→  Y is X’s final resting place.
▷ Use those patterns to grep for new tuples that you
don’t already know
Bootstrapping Relations
Wikipedia Infobox
▷ Infoboxes are kept in a namespace separate from articles
•  Namespce example: Special:SpecialPages; Wikipedia:List of infoboxes
•  Example:
{{Infobox person
|name = Casanova
|image = Casanova_self_portrait.jpg
|caption = A self portrait of Casanova
|website = }}
Concept-based Model
▷ ESA (Egozi, Markovitch, 2011)
Every Wikipedia article represents a concept
TF-IDF concept to inferring concepts from document
Manually-maintained knowledge base
@ Yi-Shin Chen, Text Mining Overview
▷ YAGO: A Core of Semantic Knowledge Unifying WordNet and
Wikipedia, WWW 2007
▷ Unification of Wikipedia  WordNet
▷ Make use of rich structures and information
•  Infoboxes, Category Pages, etc.
94	@ Yi-Shin Chen, Text Mining Overview
Mining Concepts from User-Generated
Web Data
▷ Concept: in a word sequence, its meaning is known
by a group of people and everyone within the group
refers to the same meaning
Pei-Ling Hsu, Hsiao-Shan Hsieh, Jheng-He Liang, and Yi-
Shin Chen*, Mining Various Semantic Relationships from
Unstructured User-Generated Web Data, Journal of Web
Semantics, 2015
Sentenced-based	Keyword-based	Arcle-based	
@ Yi-Shin Chen, Text Mining Overview
Concepts in Word Sequences
▷ Word sequences are likely meaningless and noise
à A word sequence as a candidate to be a concept
→  About noun
•  A noun
•  A sequence of noun
•  A noun and a number
•  An adjective and a noun
→  Special format
•  A word sequence contains “of”
e.g., Basketball
e.g., Christmas Eve
e.g., 311 earthquake
e.g., Desperate housewives
e.g., Cover of harry potter	appleipod,	nano		 ipod	
@ Yi-Shin Chen, Text Mining Overview
Concept Modeling
•  Concept: in a word sequence, its meaning is known
by a group of people and everyone within the group
refers to the same meaning.
→ Try to detect a word sequence that is known by a group of
97	@ Yi-Shin Chen, Text Mining Overview
Concept Modeling
•  Frequent Concept:
•  A word sequence mentioned frequently from a single source.
Menoned	number	is	
normalized	in	each	data	
apple	apple	
apple	 apple,	
companies	color,	of,	ipod,	nano,	1gb	
ipod	 ipod,	nano	
@ Yi-Shin Chen, Text Mining Overview
Concept Modeling
▷ Some data sources are public, sharing data with other users.
▷ These data sources with frequently seen word sequences.
à  These data sources provide concepts with higher confidence.
▷ Confident Value: every data source has a confident value
▷ Confident concept: a word sequence is from data sources with higher
confident values.
apple	inc.	 ipod,	nano	apple	inc.	
ipod,	nano	apple	 apple	apple	appleipod,	nano		 ipod
apple	 apple	 apple	
@ Yi-Shin Chen, Text Mining Overview
Opinion Mining
How people feel?
100	@ Yi-Shin Chen, Text Mining Overview
Landscape of Text Mining
World Sensor Data
Interpret by Report
24。C, 55%
World To be or not to be..
Perceived by Express
Mining content about the observers
Opinion Mining and Sentiment Analysis
@ Yi-Shin Chen, Text Mining Overview
▷ a subjective statement describing a person's
perspective about something
Objective statement or Factual statement: can be proved to be right or wrong
Opinion holder: Personalized / customized
Depends on background, culture, context
@ Yi-Shin Chen, Text Mining Overview
Opinion Representation
▷ Opinion holder: user
▷ Opinion target: object
▷ Opinion content: keywords?
▷ Opinion context: time, location, others?
▷ Opinion sentiment (emotion): positive/negative,
happy or sad
103	@ Yi-Shin Chen, Text Mining Overview
Sentiment Analysis
▷ Input: An opinionated text object
▷ Output: Sentiment tag/Emotion label
•  Polarity analysis: {positive, negative, neutral}
•  Emotion analysis: happy, sad, anger
▷ Naive approach:
•  Apply classification, clustering for extracted text features
104	@ Yi-Shin Chen, Text Mining Overview
Text Features
▷ Character n-grams
•  Usually for spelling/recognition proof
•  Less meaningful
▷ Word n-grams
•  n should be bigger than 1 for sentiment analysis
▷ POS tag n-grams
•  Can mixed with words and POS tags
→  E.g., “adj noun”, “sad noun”
105	@ Yi-Shin Chen, Text Mining Overview
More Text Features
▷ Word classes
•  Thesaurus: LIWC
•  Ontology: WordNet, Yago, DBPedia
•  Recognized entities: DBPedia, Yago
▷ Frequent patterns in text
•  Could utilize pattern discovery algorithms
•  Optimizing the tradeoff between coverage and
specificity is essential
106	@ Yi-Shin Chen, Text Mining Overview
▷ Linguistic Inquiry and word count
•  LIWC2015
▷ Home page:
▷ 70 classes
▷ Developed by researchers with interests in social,
clinical, health, and cognitive psychology
▷ Cost: US$89.95
Emotion Analysis: Pattern Approach
▷ Carlos Argueta, Fernando Calderon, and Yi-Shin Chen,
Multilingual Emotion Classifier using Unsupervised Pattern
Extraction from Microblog Data, Intelligent Data Analysis - An
International Journal, 2016
108	@ Yi-Shin Chen, Text Mining Overview
Collect Emotion Data
@ Yi-Shin Chen, Text Mining Overview
Collect Emotion Data Wait!
@ Yi-Shin Chen, Text Mining Overview
Not-Emotion Data
@ Yi-Shin Chen, Text Mining Overview
Preprocessing Steps
▷ Hints: Remove troublesome ones
o  Too short
→  Too short to get important features
o  Contain too many hashtags
→  Too much information to process
o  Are retweets
→  Increase the complexity
o  Have URLs
→  Too trouble to collect the page data
o  Convert user mentions to usermention and hashtags to
→  Remove the identification. We should not peek answers!
@ Yi-Shin Chen, Text Mining Overview
Basic Guidelines
▷ Identify the common and differences between
the experimental and control groups
•  Analyze the frequency of words
→  TF•IDF (Term frequency, inverse document frequency)
•  Analyze the co-occurrence between words/patterns
→  Co-occurrence
•  Analyze the importance between words
→  Centrality
@ Yi-Shin Chen, Text Mining Overview
Graph Construction
▷ Construct two graphs
•  E.g.
→  Emotion one: I love the World of Warcraft new game J
→ Not-emotion one: 3,000 killed in the world by ebola
killed in
@ Yi-Shin Chen, Text Mining Overview
Graph Processes
▷ Remove the common ones between two
•  Leave the significant ones only appear in the
emotion graph
▷ Analyze the centrality of words
•  Betweenness, Closeness, Eigenvector, Degree, Katz
→  Can use the free/open software, e.g, Gaphi, GraphDB
▷ Analyze the cluster degrees
•  Clustering Coefficient
@ Yi-Shin Chen, Text Mining Overview
Essence Only
Only key phrases
→emotion patterns
@ Yi-Shin Chen, Text Mining Overview
Emotion Patterns Extraction
o The goal:
o  Language independent extraction – not based on grammar or
manual templates
o  More representative set of features - balance between generality
and specificity
o  High recall/coverage – adapt to unseen words
o  Requiring only a relatively small number – high reliability
o  Efficient— fast extraction and utilization
o  Meaningful - even if there are no recognizable emotion words in
117	@ Yi-Shin Chen, Text Mining Overview
Patterns Definition
o Constructed from two types of elements:
o  Surface tokens: hello, J, lol, house, …
o  Wildcards: * (matches every word)
o Contains at least 2 elements
o Contains at least one of each type of element
Pattern Matches
* this * “Hate this weather”, “love this drawing”
* * J “so happy J”, “to me J”
luv my * “luv my gift”, “luv my price”
* that “want that”, “love that”, “hate that”
@ Yi-Shin Chen, Text Mining Overview
Patterns Construction
o Constructed from instances
o An instance is a sequence of 2 or more words from CW
and SW
o Contains at least one CW and one SW
“hate this weather”
“so happy J”
“luv my gift”
“love this drawing”
“luv my price”
“to me J
“kill this idiot”
“finish this task”
@ Yi-Shin Chen, Text Mining Overview
Patterns Construction (2)
o Find all instances in a corpus with their frequency
o Aggregate counts by grouping them based on length
and position of matching CW
Instances Count
“hate this weather” 5
“so happy J” 4
“luv my gift” 7
“love this drawing” 2
“luv my price” 1
“to me J” 3
“kill this idiot” 1
“finish this task” 4
Groups Cou
“Hate this weather”, “love this drawing”, “kill this idiot”,
“finish this task”
“so happy J”, “to me J” 7
“luv my gift”, “luv my price” 8
… …
@ Yi-Shin Chen, Text Mining Overview
Patterns Construction (3)
o Replace all the SWs by a wildcard * and keep the CWs to
convert all instances into the representing pattern
o The wildcard matches any word and is used for term
o Infrequent patterns are filtered out
Pattern Groups Cou
* this * “Hate this weather”, “love this drawing”, “kill this idiot”, “finish this task” 12
* * J “so happy J”, “to me J” 7
luv my * “luv my gift”, “luv my price” 8
… … …
@ Yi-Shin Chen, Text Mining Overview
Ranking Emotion Patterns
▷ Ranking the emotion patterns for each emotion
•  Frequency, exclusiveness, diversity
•  One ranked list for each emotion
SadJoy Anger
@ Yi-Shin Chen, Text Mining Overview
Contextual Text Mining
Basic Concepts
123	@ Yi-Shin Chen, Text Mining Overview
▷ Text usually has rich context information
•  Direct context (meta-data): time, location, author
•  Indirect context: social networks of authors,
other text related to the same source
•  Any other related text
▷ Context could be used for:
•  Partition the data
•  Provide extra features
124	@ Yi-Shin Chen, Text Mining Overview
Contextual Text Mining
▷ Query log + User = Personalized search
▷ Tweet + Time = Event identification
▷ Tweet + Location-related patterns = Location identification
▷ Tweet + Sentiment = Opinion mining
▷ Text Mining +Context → Contextual Text Mining
125	@ Yi-Shin Chen, Text Mining Overview
Partition Text
User y
User 2
User n
User k
User x
User 1
Users above age 65
Users under age 12
1998 1999 2000 2001 2002 2003 2004 2005 2006 
Data within year 2000
Posts containing #sad
@ Yi-Shin Chen, Text Mining Overview
Generative Model of Text
I eat
fish and
Dog and
are pets.
My kitten
eats fish.
)|( ModelwordP
Analyze Model
Topic 1
0.268 fish
0.210 pet
0.210 dog
0.147 kitten
Topic 2
0.296 eat
0.265 fish
0.189 vegetable
0.121 kitten
@ Yi-Shin Chen, Text Mining Overview
Contextualized Models of Text
I eat
fish and
Dog and
are pets.
My kitten
eats fish.
Analyze Model
),|( ContextModelwordP
@ Yi-Shin Chen, Text Mining Overview
Naïve Contextual Topic Model
I eat
fish and
Dog and
are pets.
My kitten
eats fish.
∑ ∑= =
Cj Ki
jij ContextTopicwPContextizPjcPwP
..1 ..1
Topic 1
0.268 fish
0.210 pet
0.210 dog
0.147 kitten
Topic 2
0.296 eat
0.265 fish
0.189 vegetable
0.121 kitten
Topic 1
0.268 fish
0.210 pet
0.210 dog
0.147 kitten
Topic 2
0.296 eat
0.265 fish
0.189 vegetable
0.121 kitten
How do we estimate it? → Different approaches for different contextual data and problems
Contextual Probabilistic Latent Semantic
Analysis (CPLAS) (Mei, Zhai, KDD2006)
▷ An extension of PLSA model ([Hofmann 99]) by
•  Introducing context variables
•  Modeling views of topics
•  Modeling coverage variations of topics
▷ Process of contextual text mining
•  Instantiation of CPLSA (context, views, coverage)
•  Fit the model to text data (EM algorithm)
•  Compare a topic from different views
•  Compute strength dynamics of topics from coverages
•  Compute other probabilistic topic patterns
@ Yi-Shin Chen, Text Mining Overview
The Probabilistic Model
∑ ∑ ∑∑∑∈ ∈ ===
),( 111
i wplpCDpCDvpDwcp θκκ
•  A probabilistic model explaining the generation of a
document D and its context features C: if an author
wants to write such a document, he will
–  Choose a view vi according to the view distribution 𝑝(​ 𝑣↓𝑖 |𝐷, 𝐶 )
–  Choose a coverage кj according to the coverage distribution
𝑝(​ 𝑘↓𝑗 |𝐷, 𝐶 )
–  Choose a theme ​θ↓𝑖𝑙  according to the coverage кj
–  Generate a word using ​θ↓𝑖𝑙 
–  The likelihood of the document collection is:
@ Yi-Shin Chen, Text Mining Overview
Contextual Text Mining Example 1
Event Identification
132	@ Yi-Shin Chen, Text Mining Overview
▷ Event definition:
•  Something (non-trivial) happening in a certain
place at a certain time (Yang et al. 1998)
▷ Features of events:
•  Something (non-trivial)
•  Certain time
•  Certain place
@ Yi-Shin Chen, Text Mining Overview
▷ Identify events from social streams
▷ Events contain these characteristics
•  Content
•  Many words related to
one topic
•  Happened time
•  Time dependent
•  Influenced users
•  Transmit to others
Evolving Social Graphs
@ Yi-Shin Chen, Text Mining Overview
Keyword Selection
▷ Well-noticed criterion
•  Compared to the past, if a word suddenly be mentioned by
many users, it is well-noticed
•  Time Frame – a unit of time period
•  Sliding Window – a certain number of past time frames
@ Yi-Shin Chen, Text Mining Overview
Keyword Selection
▷ Compare the probability proportion of this word
count with the past sliding window
Event Candidate Recognition
▷ Concept-based event: all keywords about the same
•  Use the co-occurrence of words in tweets to connect keywords
Huge amount of keywords connected as one event
explosion confirm
prayerbombing boston-
victim afghanistan
@ Yi-Shin Chen, Text Mining Overview
Event Candidate Recognition
▷ Idea: group one keyword with its most relevant keywords into
one event candidate
▷  How do we decide which keyword to start grouping?
explosion confirm
prayerbombing boston-
victim afghanistan
@ Yi-Shin Chen, Text Mining Overview
Event Candidate Recognition
▷ TextRank: rank importance of each
•  Number of edges
•  Edge weights (word count and word co-occurrence)
𝒘𝒆𝒊𝒈𝒉𝒕= ​ 𝒄 𝒐𝒐𝒄𝒄𝒖𝒓( 𝒌 𝟏, 𝒌 𝟐)/
𝒄𝒐𝒖𝒏𝒕( 𝒌 𝟏) 
Boston marathon
@ Yi-Shin Chen, Text Mining Overview
Event Candidate Recognition
1.  Pick the most relevant neighbor keywords
0.07 0.01
Normalized weight
@ Yi-Shin Chen, Text Mining Overview
Event Candidate Recognition
2.  Check neighbor keywords with the
secondary keyword
@ Yi-Shin Chen, Text Mining Overview
Event Candidate Recognition
3.  Group the remaining keywords as one event
Event Candidate
@ Yi-Shin Chen, Text Mining Overview
Evolving Social Graph Analysis
▷ Bring in social relationships to estimate
information propagation
▷ Social Relation Graph
•  Vertex: users that mentioned one or more keywords in the
candidate event
•  Edge: the following relationship between users
Boston marathon
explosion bomb
@ Yi-Shin Chen, Text Mining Overview
Evolving Social Graph Analysis
▷ Add in evolving idea (time frame
increment): graph sequences
@ Yi-Shin Chen, Text Mining Overview
Evolving Social Graph Analysis
▷ Information decay:
•  Vertex weight, edge weight
•  Decay mechanism
▷ Concept-Based Evolving Graph Sequences (cEGS): a
sequence of directed graphs that demonstrate information
@ Yi-Shin Chen, Text Mining Overview
Methodology – Evolving Social Graph Analysis
▷ Hidden link – construct hidden relationship
•  To model better interaction between users
•  Sample data
@ Yi-Shin Chen, Text Mining Overview
Evolving Social Graph Analysis
▷ Analysis of cEGS
•  Number of vertices(nV)
→  The number of users mentioned this event candidate
•  Number of edges(nE):
→  The number of following relationship in this cEGS
•  Number of connected components(nC)
→  The number of communities in this cEGS
•  Reciprocity(R)
→  The degree of mutual connections is in this cEGS
Reciprocity = ​ 𝑛 𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑑𝑔𝑒𝑠/𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓
𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑒𝑑𝑔𝑒𝑠 
@ Yi-Shin Chen, Text Mining Overview
Event Identification
▷ Type1 Event: One-shot event
•  An event that receives attention in a short period of time
→  Number of users, number of followings, number of
connected components suddenly increase
▷ Type2 Event: Long-run event
•  An event that attracts many discussion for a period of time
→  Reciprocity
▷ Non-event
@ Yi-Shin Chen, Text Mining Overview
Experimental Results
▷ April 15, “library jfk blast bostonmarathon prayforboston pd
possible area boston police social explosion bomb bombing
marathon confirm incident”
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Days in April
Number of users Number of followings
Number of connected components Reciprocity
Hidden link reciprocity
Contextual Text Mining Example 2
Location Identification
Twitter and Geo-Tagging
▷ Geospatial features in Twitter have been available
•  User’s Profile location
•  Per tweet geo-tagging
▷ Users have been slow to adopt these features
•  On a 1-million-record sample, 0.30% of tweets were geo-tagged
▷ Locations are not specific
151	@ Yi-Shin Chen, Text Mining Overview
Our Goal
▷ Identify the location of a particular Twitter
user at a given time
•  Using exclusively the content of his/her tweets
152	@ Yi-Shin Chen, Text Mining Overview
Data Sources
▷ Twitter Dataset
•  13 million tweets, 1.53 million profiles
•  From Nov. ‘11 to Apr. ’12
▷ Local Cache
•  Geospatial Resources
•  GeoNames Spatial DB
•  Wikiplaces DB
•  Lexical Resources
•  WordNet
▷ Web Resources
153	@ Yi-Shin Chen, Text Mining Overview
Baseline Classification
▷ We are interested only in tweets that might
suggest a location.
▷ Tweet Classification
•  Direct Subject
•  Has a first person personal pronoun – “I love New York”
→  (I, me, myself,…)
•  Anonymous Subject
→  Starts with Verbs – “Rocking the bar tonight!”
→  Composed of only nouns and adjectives – “Perfect
day at the beach”
•  Others
154	@ Yi-Shin Chen, Text Mining Overview
Rule Generation
▷ By identifying certain association rules, we can
identify certain locations
▷ Combinations of certain verbs and nouns (bigrams)
imply some locations
•  “Wait” + “Flight” → Airport
•  “Go” + “Funeral” → Funeral Home
▷ Create combinations of all synonyms, meronyms
and hyponyms related to a location
▷ Number of occurrences must be greater than K
▷ Each rule does not overlap with other categories
155	@ Yi-Shin Chen, Text Mining Overview
Hyponyms : Airfield is a kind of airport
Meronyms : Gates is a part of airport Restaurant
@ Yi-Shin Chen, Text Mining Overview
N-gram Combinations
▷ Identify likely location candidates
▷ Contiguous sequences of n-items
▷ Only nouns and adjectives
▷ Extracted from Direct Subject and Anonymous
Subject categories.
157	@ Yi-Shin Chen, Text Mining Overview
Tweet Types
▷ Coordinates
•  Tweet has geographical coordinates
▷ Explicit Specific
•  “I Love Hsinchu”
→  Toponyms
▷ Explicit General
•  “Going to the Gym”
→  Through Hypernyms
▷ Implicit
•  “Buying a new scarf” - Department Store
•  Emphasize on actions
Identified through a Web Search
@ Yi-Shin Chen, Text Mining Overview
Country Discovery
▷ Identify the user’s country
▷ Identified with OPTICS algorithm
▷ Cluster all previously marked n-grams
▷ Most significant cluster is retrieved
•  User’s country
Inner Region Discovery
▷ Identify user’s Hometown
▷ Remove toponyms
▷ Clustered with OPTICS
▷ Only locations within ± 2 σ
160	@ Yi-Shin Chen, Text Mining Overview
Inner Region Discovery
•  Identify user’s Hometown
•  Remove toponyms
•  Clustered with OPTICS
•  Only locations within ± 2 σ
•  Most Representative cluster
– Reversely Geocoded
– Most representative cityCoordinates
Lat: 41.3948029987514
Long : -73.472126500681
@ Yi-Shin Chen, Text Mining Overview
Web Search
▷ Use web services
▷ Example :
•  “What's the weather like outside? I haven't left the library in three
•  Search: Library near Danbury, Connecticut , US
Timeline Sorting
“I have to agree that
Washington is so nice at
this time of the year”
“Heat Wave in Texas”
@ Yi-Shin Chen, Text Mining Overview
Tweet A : 9AM
“I have to agree that
Washington is so nice at
this time of the year”
Tweet B : 11AM
“Heat Wave in Texas”
Tweet A Tweet B(Tweet A) – X days (Tweet B) + X days
@ Yi-Shin Chen, Text Mining Overview
Location Inferred
▷ Users are classified according to the particular
findings :
•  No Country (No Information)
•  Just Country
•  Timeline
→  Current and past locations
•  Timeline with Hometown
→  Current and past locations
→  User’s Hometown
→  General Locations
165	@ Yi-Shin Chen, Text Mining Overview
General Statistics
@ Yi-Shin Chen, Text Mining Overview 166
Some Remarks
Some but not all
167	@ Yi-Shin Chen, Text Mining Overview
Knowledge Discovery (KDD) Process
Data Cleaning
Data Integration
Data Mining
Always Remember
▷ Have a good and solid objective
•  No goal no gold
•  Know the relationships between them
169	@ Yi-Shin Chen, Text Mining Overview

Mais conteúdo relacionado

Mais procurados

形態素解析の過去・現在・未来Preferred Networks
ニューラルネットワークを用いた自然言語処理Sho Takase
scikit-learnを用いた機械学習チュートリアル敦志 金谷
FIT2012招待講演「異常検知技術のビジネス応用最前線」Shohei Hido
動的計画法の基礎と応用 ~色々使える大局的最適化法
動的計画法の基礎と応用 ~色々使える大局的最適化法動的計画法の基礎と応用 ~色々使える大局的最適化法
動的計画法の基礎と応用 ~色々使える大局的最適化法Seiichi Uchida
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法MapR Technologies Japan
自然言語処理 BERTに関する論文紹介とまとめ
自然言語処理 BERTに関する論文紹介とまとめ自然言語処理 BERTに関する論文紹介とまとめ
自然言語処理 BERTに関する論文紹介とまとめKeisukeNakazono
勾配降下法の 最適化アルゴリズム
勾配降下法の 最適化アルゴリズムnishio
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것NAVER Engineering
AtCoder Beginner Contest 023 解説
AtCoder Beginner Contest 023 解説AtCoder Beginner Contest 023 解説
AtCoder Beginner Contest 023 解説AtCoder Inc.
音楽波形データからコードを推定してみるKen'ichi Matsui
高速フーリエ変換AtCoder Inc.
AtCoder Beginner Contest 015 解説
AtCoder Beginner Contest 015 解説AtCoder Beginner Contest 015 解説
AtCoder Beginner Contest 015 解説AtCoder Inc.
実践・最強最速のアルゴリズム勉強会 第四回講義資料(ワークスアプリケーションズ & AtCoder)
実践・最強最速のアルゴリズム勉強会 第四回講義資料(ワークスアプリケーションズ & AtCoder)実践・最強最速のアルゴリズム勉強会 第四回講義資料(ワークスアプリケーションズ & AtCoder)
実践・最強最速のアルゴリズム勉強会 第四回講義資料(ワークスアプリケーションズ & AtCoder)AtCoder Inc.
AtCoder Beginner Contest 004 解説
AtCoder Beginner Contest 004 解説AtCoder Beginner Contest 004 解説
AtCoder Beginner Contest 004 解説AtCoder Inc.

Mais procurados (20)

Rolling hash
Rolling hashRolling hash
Rolling hash
動的計画法の基礎と応用 ~色々使える大局的最適化法
動的計画法の基礎と応用 ~色々使える大局的最適化法動的計画法の基礎と応用 ~色々使える大局的最適化法
動的計画法の基礎と応用 ~色々使える大局的最適化法
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法
自然言語処理 BERTに関する論文紹介とまとめ
自然言語処理 BERTに関する論文紹介とまとめ自然言語処理 BERTに関する論文紹介とまとめ
自然言語処理 BERTに関する論文紹介とまとめ
勾配降下法の 最適化アルゴリズム
勾配降下法の 最適化アルゴリズム
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
AtCoder Beginner Contest 023 解説
AtCoder Beginner Contest 023 解説AtCoder Beginner Contest 023 解説
AtCoder Beginner Contest 023 解説
Topological sort
Topological sortTopological sort
Topological sort
AtCoder Beginner Contest 015 解説
AtCoder Beginner Contest 015 解説AtCoder Beginner Contest 015 解説
AtCoder Beginner Contest 015 解説
実践・最強最速のアルゴリズム勉強会 第四回講義資料(ワークスアプリケーションズ & AtCoder)
実践・最強最速のアルゴリズム勉強会 第四回講義資料(ワークスアプリケーションズ & AtCoder)実践・最強最速のアルゴリズム勉強会 第四回講義資料(ワークスアプリケーションズ & AtCoder)
実践・最強最速のアルゴリズム勉強会 第四回講義資料(ワークスアプリケーションズ & AtCoder)
AtCoder Beginner Contest 004 解説
AtCoder Beginner Contest 004 解説AtCoder Beginner Contest 004 解説
AtCoder Beginner Contest 004 解説


[系列活動] 智慧城市中的時空大數據應用
[系列活動] 智慧城市中的時空大數據應用[系列活動] 智慧城市中的時空大數據應用
[系列活動] 智慧城市中的時空大數據應用台灣資料科學年會
給軟體工程師的不廢話 R 語言精要班
給軟體工程師的不廢話 R 語言精要班給軟體工程師的不廢話 R 語言精要班
給軟體工程師的不廢話 R 語言精要班台灣資料科學年會
[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路台灣資料科學年會
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程台灣資料科學年會
[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務台灣資料科學年會
[系列活動] Python 程式語言起步走
[系列活動] Python 程式語言起步走[系列活動] Python 程式語言起步走
[系列活動] Python 程式語言起步走台灣資料科學年會
[系列活動] 一天搞懂對話機器人
[系列活動] 一天搞懂對話機器人[系列活動] 一天搞懂對話機器人
[系列活動] 一天搞懂對話機器人台灣資料科學年會
[系列活動] 使用 R 語言建立自己的演算法交易事業
[系列活動] 使用 R 語言建立自己的演算法交易事業[系列活動] 使用 R 語言建立自己的演算法交易事業
[系列活動] 使用 R 語言建立自己的演算法交易事業台灣資料科學年會
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹台灣資料科學年會
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用台灣資料科學年會
[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用
[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用
[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用台灣資料科學年會
曾韵/沒有大數據怎麼辦 ? 會計師事務所的小數據科學
曾韵/沒有大數據怎麼辦 ? 會計師事務所的小數據科學曾韵/沒有大數據怎麼辦 ? 會計師事務所的小數據科學
曾韵/沒有大數據怎麼辦 ? 會計師事務所的小數據科學台灣資料科學年會
Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)Data Science Thailand
高嘉良/Open Innovation as Strategic Plan
高嘉良/Open Innovation as Strategic Plan高嘉良/Open Innovation as Strategic Plan
高嘉良/Open Innovation as Strategic Plan台灣資料科學年會
The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly ProblemMark Chang
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Chris Fregly
TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用Mark Chang

Destaque (20)

[系列活動] Python爬蟲實戰
[系列活動] Python爬蟲實戰[系列活動] Python爬蟲實戰
[系列活動] Python爬蟲實戰
[系列活動] 智慧城市中的時空大數據應用
[系列活動] 智慧城市中的時空大數據應用[系列活動] 智慧城市中的時空大數據應用
[系列活動] 智慧城市中的時空大數據應用
給軟體工程師的不廢話 R 語言精要班
給軟體工程師的不廢話 R 語言精要班給軟體工程師的不廢話 R 語言精要班
給軟體工程師的不廢話 R 語言精要班
[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務[系列活動] 手把手教你R語言資料分析實務
[系列活動] 手把手教你R語言資料分析實務
[系列活動] Python 程式語言起步走
[系列活動] Python 程式語言起步走[系列活動] Python 程式語言起步走
[系列活動] Python 程式語言起步走
[系列活動] 一天搞懂對話機器人
[系列活動] 一天搞懂對話機器人[系列活動] 一天搞懂對話機器人
[系列活動] 一天搞懂對話機器人
[系列活動] 使用 R 語言建立自己的演算法交易事業
[系列活動] 使用 R 語言建立自己的演算法交易事業[系列活動] 使用 R 語言建立自己的演算法交易事業
[系列活動] 使用 R 語言建立自己的演算法交易事業
[系列活動] 機器學習速遊
[系列活動] 機器學習速遊[系列活動] 機器學習速遊
[系列活動] 機器學習速遊
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用
[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用
[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用
曾韵/沒有大數據怎麼辦 ? 會計師事務所的小數據科學
曾韵/沒有大數據怎麼辦 ? 會計師事務所的小數據科學曾韵/沒有大數據怎麼辦 ? 會計師事務所的小數據科學
曾韵/沒有大數據怎麼辦 ? 會計師事務所的小數據科學
Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)
高嘉良/Open Innovation as Strategic Plan
高嘉良/Open Innovation as Strategic Plan高嘉良/Open Innovation as Strategic Plan
高嘉良/Open Innovation as Strategic Plan
The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly Problem
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用

Semelhante a [系列活動] 文字探勘者的入門心法

Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text MiningYi-Shin Chen
From NLP to text mining
From NLP to text mining From NLP to text mining
From NLP to text mining Yi-Shin Chen
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
Silk data - machine learning
Silk data - machine learning Silk data - machine learning
Silk data - machine learning SaltoDigitale
Machine learning technology for publishing industry / Buchmesse 2018
Machine learning technology for publishing industry / Buchmesse 2018Machine learning technology for publishing industry / Buchmesse 2018
Machine learning technology for publishing industry / Buchmesse 2018Maxim Kublitski
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingIla Group
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Andre Freitas
For Voyant_berendt_VSSDH15_lecture1.pptx
For Voyant_berendt_VSSDH15_lecture1.pptxFor Voyant_berendt_VSSDH15_lecture1.pptx
For Voyant_berendt_VSSDH15_lecture1.pptxssuserf1863b
Quick tour all handout
Quick tour all handoutQuick tour all handout
Quick tour all handoutYi-Shin Chen
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingGeeks Anonymes
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Andre Freitas
Improving search with neural ranking methods
Improving search with neural ranking methodsImproving search with neural ranking methods
Improving search with neural ranking methodsvoginip
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?Frank van Harmelen
What knowledge bases know (and what they don't)
What knowledge bases know (and what they don't)What knowledge bases know (and what they don't)
What knowledge bases know (and what they don't)srazniewski
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
Energizing PowerPoint
Energizing PowerPointEnergizing PowerPoint
Energizing PowerPointLaDonna Coy
What data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsWhat data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsHugo Bowne-Anderson

Semelhante a [系列活動] 文字探勘者的入門心法 (20)

Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text Mining
From NLP to text mining
From NLP to text mining From NLP to text mining
From NLP to text mining
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
Silk data - machine learning
Silk data - machine learning Silk data - machine learning
Silk data - machine learning
Machine learning technology for publishing industry / Buchmesse 2018
Machine learning technology for publishing industry / Buchmesse 2018Machine learning technology for publishing industry / Buchmesse 2018
Machine learning technology for publishing industry / Buchmesse 2018
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
For Voyant_berendt_VSSDH15_lecture1.pptx
For Voyant_berendt_VSSDH15_lecture1.pptxFor Voyant_berendt_VSSDH15_lecture1.pptx
For Voyant_berendt_VSSDH15_lecture1.pptx
Quick tour all handout
Quick tour all handoutQuick tour all handout
Quick tour all handout
[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Improving search with neural ranking methods
Improving search with neural ranking methodsImproving search with neural ranking methods
Improving search with neural ranking methods
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
Noah A Smith - 2017 - Invited Keynote: Squashing Computational Linguistics
Noah A Smith - 2017 - Invited Keynote: Squashing Computational Linguistics Noah A Smith - 2017 - Invited Keynote: Squashing Computational Linguistics
Noah A Smith - 2017 - Invited Keynote: Squashing Computational Linguistics
What knowledge bases know (and what they don't)
What knowledge bases know (and what they don't)What knowledge bases know (and what they don't)
What knowledge bases know (and what they don't)
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
Energizing PowerPoint
Energizing PowerPointEnergizing PowerPoint
Energizing PowerPoint
What data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsWhat data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientists

Mais de 台灣資料科學年會

[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用台灣資料科學年會
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告台灣資料科學年會
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰台灣資料科學年會
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機台灣資料科學年會
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機台灣資料科學年會
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話台灣資料科學年會
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇台灣資料科學年會
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 [TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 台灣資料科學年會
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵台灣資料科學年會
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用台灣資料科學年會
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告台灣資料科學年會
[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話台灣資料科學年會
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人台灣資料科學年會
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維台灣資料科學年會
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察台灣資料科學年會
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰台灣資料科學年會
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT台灣資料科學年會
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達台灣資料科學年會
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳台灣資料科學年會

Mais de 台灣資料科學年會 (20)

[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 [TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳


NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017

Último (20)

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2

[系列活動] 文字探勘者的入門心法

  • 1. Quick Tour of Text Mining Yi-Shin Chen Institute of Information Systems and Applications Department of Computer Science National Tsing Hua University
  • 2. About Speaker 陳宜欣 Yi-Shin Chen ▷ Currently •  清華大學資訊工程系副教授 •  主持智慧型資料工程與應用實驗室 (IDEA Lab) ▷ Education •  Ph.D. in Computer Science, USC, USA •  M.B.A. in Information Management, NCU, TW •  B.B.A. in Information Management, NCU, TW ▷ Courses (all in English) •  Research and Presentation Skills •  Introduction to Database Systems •  Advanced Database Systems •  Data Mining: Concepts, Techniques, and Applications 2 @ Yi-Shin Chen, Text Mining Overview
  • 3. Research Focus from 2000 Storage Index Optimization Query Mining DB @ Yi-Shin Chen, Text Mining Overview 3
  • 5. Past and Current Studies ▷ Location identification ▷ Interest identification ▷ Event identification ▷ Extract semantic relationships ▷ Unsupervised multilingual sentiment analysis ▷ Keyword extraction and summary ▷ Emotion analysis ▷ Mental illness detection Text Processing @ Yi-Shin Chen, Text Mining Overview 5
  • 6. Text Mining Overview 6 @ Yi-Shin Chen, Text Mining Overview
  • 7. Data (Text vs. Non-Text) 7 World Sensor Data Interpret by Report Weather Thermometer, Hygrometer 24。C, 55% Location GPS 37。N, 123 。E Body Sphygmometer, MRI, etc. 126/70 mmHg World To be or not to be.. human Subjective Objective @ Yi-Shin Chen, Text Mining Overview
  • 8. Data Mining vs. Text Mining 8 Non-text data •  Numerical Categorical Relational •  Precise •  Objective Text data •  Text •  Ambiguous •  Subjective Data Mining •  Clustering •  Classification •  Association Rules •  … Text Processing (including NLP) Text Mining @ Yi-Shin Chen, Text Mining Overview Preprocessing
  • 10. Data Collection ▷ Align /Classify the attributes correctly 10 Who post this message Mentioned User Hashtag Shared URL
  • 11. Language Detection ▷ To detect an language (possible languages) in which the specified text is written ▷ Difficulties •  Short message •  Different languages in one statement •  Noisy 11 你好 現在幾點鐘 apa kabar sekarang jam berapa ? 繁體中文 (zh-tw) 印尼文 (id)
  • 12. Wrong Detection Examples ▷ Twitter examples 12 @sayidatynet top song #LailaGhofran shokran ya garh new album #listen 中華隊的服裝挺特別的,好藍。。。 #ChineseTaipei #Sochi #2014冬奧 授業前の雪合戦w d9b5peaq7J Before / after removing noise en - id it - zh-tw en - ja
  • 13. Removing Noise ▷ Removing noise before detection •  Html file -tags •  Twitter - hashtag, mention, URL 13 meta name=twitter:description content=觸犯法國隱私法〔駐歐洲特派記 者胡蕙寧、國際新聞中心/綜合報導〕網路 搜尋引擎巨擘Google8日在法文版首頁 (張貼悔過書 .../ 觸犯法國隱私法〔駐歐洲特派記者胡蕙寧、國際新聞中 心/綜合報導〕網路搜尋引擎巨擘Google8日在法文版 首頁(張貼悔過書 ... 英文 (en) 繁中 (zh-tw)
  • 14. Data Cleaning ▷ Special character ▷ Utilize regular expressions to clean data 14 Unicode emotions ☺, ♥… Symbol icon ☏, ✉… Currency symbol €, £, $... Tweet URL Filter out non-(letters, space, punctuation, digit) ◕‿◕ Friendship is everything ♥ ✉ I added a video to a @YouTube playlist http:// Jamie Riepe (^|s*)http(S+)?(s*|$) (p{L}+)|(p{Z}+)| (p{Punct}+)|(p{Digit}+)
  • 15. Japanese Examples ▷ Use regular expression remove all special words •  うふふふふ(*^^*)楽しむ!ありがとうございま す^o^ アイコン、ラブラブ(-_-)♡ •  うふふふふ 楽しむ ありがとうございます ア イコン ラブラブ 15 W
  • 16. Part-of-speech (POS) Tagging ▷ Processing text and assigning parts of speech to each word ▷ Twitter POS tagging •  Noun (N), Adjective (A), Verb (V), URL (U)… 16 Happy Easter! I went to work and came home to an empty house now im going for a quick run Happy_A Easter_N !_, I_O went_V to_P work_N and_ came_V home_N to_P an_D empty_A house_N now_R im_L going_V for_P a_D quick_A run_N
  • 17. Stemming ▷ @DirtyDTran gotta be caught up for tomorrow nights episode ▷ @ASVP_Jaykey for some reasons I found this very amusing 17 •  @DirtyDTran gotta be catch up for tomorrow night episode •  @ASVP_Jaykey for some reason I find this very amusing RT @kt_biv : @caycelynnn loving and missing you! we are still looking for Lucy love miss be look
  • 18. Hashtag Segmentation ▷ By using Microsoft Web N-Gram Service (or by using Viterbi algorithm) 18 #pray #for #boston Wow! explosion at a boston race ... #prayforboston #citizenscience #bostonmarathon #goodthingsarecoming #lowbloodpressure → → → → #citizen #science #boston #marathon #good #things #are #coming #low #blood #pressure
  • 19. More Preprocesses for Different Web Data ▷ Extract source code without javascript ▷ Removing html tags 19
  • 20. Extract Source Code Without Javascript ▷ Javascript code should be considered as an exception •  it may contain hidden content 20
  • 21. Remove Html Tags ▷ Removing html tags to extract meaningful content 21
  • 22. More Preprocesses for Different Languages ▷ Chinese Simplified/Traditional Conversion ▷ Word segmentation 22
  • 23. Chinese Simplified/Traditional Conversion ▷ Word conversion •  请乘客从后门落车 → 請乘客從後門下車 ▷ One-to-many mapping •  @shinrei 出去旅游还是崩坏 → @shinrei 出去旅游還是崩壞 游 (zh-cn) → 游|遊 (zh-tw) ▷ Wrong segmentation •  人体内存在很多微生物 → 內存: 人體 記憶體 在很多微生物 → 存在: 人體內 存在 很多微生物 23 內存|存在
  • 24. Wrong Chinese Word Segmentation ▷ Wrong segmentation •  這(Nep) 地面(Nc) 積(VJ) 還(D) 真(D) 不(D) 小(VH) ▷ Wrong word •  @iamzeke 實驗(Na) 室友(Na) 多(Dfa) 危險(VH) 你(Nh) 不(D) 知道(VK) 嗎 (T) ? ▷ Wrong order •  人體(Na) 存(VC) 內在(Na) 很多(Neqa) 微生物(Na) ▷ Unknown word •  半夜(Nd) 逛團(Na) 購(VC) 看到(VE) 太(Dfa) 吸引人(VH) !! 24 地面|面積 實驗室|室友 存在|內在 未知詞: 團購
  • 25. Back to Text Mining Let’s come back to Text Mining 25 @ Yi-Shin Chen, Text Mining Overview
  • 26. Data Mining vs. Text Mining 26 Non-text data •  Numerical Categorical Relational •  Precise •  Objective Text data •  Text •  Ambiguous •  Subjective Data Mining •  Clustering •  Classification •  Association Rules •  … Text Processing (including NLP) Text Mining @ Yi-Shin Chen, Text Mining Overview Preprocessing
  • 27. Landscape of Text Mining 27 World Sensor Data Interpret by Report World devices 24。C, 55% World To be or not to be.. human Non-text (Context) Text Subjective Objective Perceived by Express Mining knowledge about Languages Nature Language Processing Text Representation; Word Association and Mining @ Yi-Shin Chen, Text Mining Overview
  • 28. Landscape of Text Mining 28 World Sensor Data Interpret by Report World devices 24。C, 55% World To be or not to be.. human Non-text (Context) Text Subjective Objective Perceived by Express Mining content about the observers Opinion Mining and Sentiment Analysis @ Yi-Shin Chen, Text Mining Overview
  • 29. Landscape of Text Mining 29 World Sensor Data Interpret by Report World devices 24。C, 55% World To be or not to be.. human Non-text (Context) Text Subjective Objective Perceived by Express Mining content about the World Topic Mining , Contextual Text Mining @ Yi-Shin Chen, Text Mining Overview
  • 30. Structure of Information Extraction System 30 local text analysis lexical analysis name recognition partial syntactic analysis scenario pattern matching discourse analysis Co-reference analysis inference template generation document extracted templates
  • 31. Basic Concepts in NLP 31 This is the best thing happened in my life. Det . Det . NN PNPre . Verb VerbAdj Lexical analysis (Part-of Speech Tagging) Noun Phrase Prep Phrase Prep Phrase Noun Phrase Sentence Syntactic analysis (Parsing) This? (t1) Best thing (t2) My (m1) Happened (t1, t2, m1) Semantic Analysis Happy (x) if Happened (t1, ‘Best’, m1) Happy Inference (Emotion Analysis) @ Yi-Shin Chen, Text Mining Overview
  • 32. Basic Concepts in NLP 32 This is the best thing happened in my life. Det . Det . NN PNPre . Verb VerbAdj String of Characters This? Happy This is the best thing happened in my life. String of Words POS Tags Best thing Happened My life Entity Period Entities Relationships Emotion The writer loves his new born baby Understanding (Logic predicates) Entity Deeper NLP Less accurate Closer to knowledge @ Yi-Shin Chen, Text Mining Overview
  • 33. NLP vs. Text Mining ▷ Text Mining objectives •  Overview •  Know the trends •  Accept noise @ Yi-Shin Chen, Text Mining Overview 33 ▷ NLP objectives •  Understanding •  Ability to answer •  Immaculate
  • 34. Basic Data Model Concepts Let’s learn from giants 34 @ Yi-Shin Chen, Text Mining Overview
  • 35. Data Models ▷ Data model/Language: a collection of concepts for describing data ▷ Schema/Structured observation: a description of a particular collection of data, using the a given data model ▷ Data instance/Statements @ Yi-Shin Chen, Text Mining Overview 35 Using a model Using ER Model World Schema that represents the World Car PeopleDrive
  • 36. E-R Model ▷ Introduced by Peter Chen; ACM TODS, March 1976 • •  Additional Readings →  Peter Chen. English Sentence Structure and Entity-Relationship Diagram. Information Sciences, Vol. 1, No. 1, Elsevier, May 1983, Pages 127-149 →  Peter Chen. A Preliminary Framework for Entity-Relationship Models. Entity-Relationship Approach to Information Modeling and Analysis, North- Holland (Elsevier), 1983, Pages 19 - 28 @ Yi-Shin Chen, Text Mining Overview 36
  • 37. E-R Model Basics -Entity ▷ Based on a perception of a real world, which consists •  A set of basic objects ⇒ Entities •  Relationships among objects ▷ Entity: Real-world object distinguishable from other objects ▷ Entity Set: A collection of similar entities. E.g., all employees. •  Presented as: @ Yi-Shin Chen, Text Mining Overview 37 Animals Time People This is the best thing happened in my life. Dogs love their owners. Things
  • 38. E-R: Relationship Sets ▷ Relationship: Association among two or more entities ▷ Relationship Set: Collection of similar relationships. •  Relationship set are presented as: •  The relationship cannot exist without having corresponding entities @ Yi-Shin Chen, Text Mining Overview 38 actionAnimal People action Dogs love their owners.
  • 39. High-Level Entity ▷ High-level entity: Abstracted from a group of interconnected low-level entity and relationship types @ Yi-Shin Chen, Text Mining Overview 39 Alice loves me. This is the best thing happened in my life. Time People action Alice loves me This is the best thing happen My life
  • 40. Word Relations Back to text 40 @ Yi-Shin Chen, Text Mining Overview
  • 41. Word Relations ▷ Paradigmatic: can be substituted for each other (similar) •  E.g., Cat dog, run and walk ▷ Syntagmatic: can be combined with each other (correlated) •  E.g., Cat and fights, dog and barks → These two basic and complementary relations can be generalized to describe relations of any times in a language 41 Animals Act Animals Act @ Yi-Shin Chen, Text Mining Overview
  • 42. Mining Word Associations ▷ Paradigmatic •  Represent each word by its context •  Compute context similarity •  Words with high context similarity ▷ Syntagmatic •  Count the number of times two words occur together in a context •  Compare the co-occurrences with the corresponding individual occurrences •  Words with high co-occurrences but relatively low individual occurrence 42 @ Yi-Shin Chen, Text Mining Overview
  • 43. Paradigmatic Word Associations John’s cat eats fish in Saturday Mary’s dog eats meat in Sunday John’s cat drinks milk in Sunday Mary’s dog drinks beer in Tuesday 43 Act FoodTime Human Animals Own John Cat John’s cat Eat Fish In Saturday John’s --- eats fish in Saturday Mary’s --- eats meat in Sunday John’s --- drinks milk in Sunday Mary’s --- drinks beer in Tuesday Similar left content Similar right content Similar general content How similar are context (“cat”) and context (“dog”)? How similar are context (“cat”) and context (“John”)? → Expected Overlap of Words in Context (EOWC) Overlap (“cat”, “dog”) Overlap (“cat”, “John”) @ Yi-Shin Chen, Text Mining Overview
  • 44. Vector Space Model (Bag of Words) ▷ Represent the keywords of objects using a term vector •  Term: basic concept, e.g., keywords to describe an object •  Each term represents one dimension in a vector •  N total terms define an n-element terms •  Values of each term in a vector corresponds to the importance of that term ▷ Measure similarity by the vector distances 44 Document 1 season timeout lost wi n game score ball pla y coach team Document 2 Document 3 3 0 5 0 2 6 0 2 0 2 0 0 7 0 2 1 0 0 3 0 0 1 0 0 1 2 2 0 3 0
  • 45. Common Approach for EOWC: Cosine Similarity ▷ If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| , where • indicates vector dot product and || d || is the length of vector d. ▷ Example: d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245 cos( d1, d2 ) = .3150 → Overlap (“John”, “Cat”) =.3150 45 @ Yi-Shin Chen, Text Mining Overview
  • 46. Quality of EOWC? ▷ The more overlap the two context documents have, the higher the similarity would be ▷ However: •  It favor matching one frequent term very well over matching more distinct terms •  It treats every word equally (overlap on “the” should not be as meaningful as overlap on “eats”) 46 @ Yi-Shin Chen, Text Mining Overview
  • 47. Term Frequency and Inverse Document Frequency (TFIDF) ▷ Since not all objects in the vector space are equally important, we can weight each term using its occurrence probability in the object description •  Term frequency: TF(d,t) →  number of times t occurs in the object description d •  Inverse document frequency: IDF(t) →  to scale down the terms that occur in many descriptions 47 @ Yi-Shin Chen, Text Mining Overview
  • 48. Normalizing Term Frequency ▷ nij represents the number of times a term ti occurs in a description dj . tfij can be normalized using the total number of terms in the document •  ​ 𝑡 𝑓↓𝑖𝑗 =​​ 𝑛↓𝑖𝑗 /𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑𝑉𝑎𝑙𝑢𝑒  ▷ Normalized value could be: •  Sum of all frequencies of terms •  Max frequency value •  Any other values can make tfij between 0 to 1 •  BM25*: ​ 𝑡 𝑓↓𝑖𝑗 =​​ 𝑛↓𝑖𝑗 ×(𝑘+1)/​ 𝑛↓𝑖𝑗 + 𝑘  48 @ Yi-Shin Chen, Text Mining Overview
  • 49. Inverse Document Frequency ▷ IDF seeks to scale down the coordinates of terms that occur in many object descriptions •  For example, some stop words(the, a, of, to, and…) may occur many times in a description. However, they should be considered as non-important in many cases •  ​ 𝑖 𝑑𝑓↓𝑖 = 𝑙𝑜𝑔(​ 𝑁/​ 𝑑 𝑓↓𝑖  +1) →  where dfi (document frequency of term ti) is the number of descriptions in which ti occurs ▷ IDF can be replaced with ICF (inverse class frequency) and many other concepts based on applications 49 @ Yi-Shin Chen, Text Mining Overview
  • 50. Reasons of Log ▷ Each distribution can indicate the hidden force •  Time •  Independent •  Control 50 Power-law distribution Normal distribution Normal distribution @ Yi-Shin Chen, Text Mining Overview
  • 51. Mining Word Associations ▷ Paradigmatic •  Represent each word by its context •  Compute context similarity •  Words with high context similarity ▷ Syntagmatic •  Count the number of times two words occur together in a context •  Compare the co-occurrences with the corresponding individual occurrences •  Words with high co-occurrences but relatively low individual occurrence 51 @ Yi-Shin Chen, Text Mining Overview
  • 52. Syntagmatic Word Associations John’s cat eats fish in Saturday Mary’s dog eats meat in Sunday John’s cat drinks milk in Sunday Mary’s dog drinks beer in Tuesday 52 Act FoodTime Human Animals Own John Cat John’s cat Eat Fish In Saturday John’s *** eats *** in Saturday Mary’s *** eats *** in Sunday John’s --- drinks --- in Sunday Mary’s --- drinks --- in Tuesday What words tend to occur to the left of “eats” What words to the right? Whenever “eats” occurs, what other words also tend to occur? Correlated occurrences P(dog | eats) = ? ; P(cats | eats) = ? @ Yi-Shin Chen, Text Mining Overview
  • 53. Word Prediction Prediction Question: Is word W present (or absent) in this segment? 53 Text Segment (any unit, e.g., sentence, paragraph, document) Predict the occurrence of word W1 = ‘meat’ W2 = ‘a’ W3 = ‘unicorn’ @ Yi-Shin Chen, Text Mining Overview
  • 54. Word Prediction: Formal Definition ▷ Binary random variable {0,1} •  ​ 𝑥↓𝑤 ={█1𝑤 𝑖𝑠 𝑝𝑟𝑒𝑠𝑒𝑛𝑡@0𝑤 𝑖𝑠 𝑎𝑏𝑠𝑒𝑡   •  𝑃(​ 𝑥↓𝑤 =1)+ 𝑃(​ 𝑥↓𝑤 =0)=1 ▷ The more random ​ 𝑥↓𝑤  is, the more difficult the prediction is ▷ How do we quantitatively measure the randomness? 54 @ Yi-Shin Chen, Text Mining Overview
  • 55. Entropy ▷ Entropy measures the amount of randomness or surprise or uncertainty ▷ Entropy is defined as: 55 ( ) ( ) ( ) 1 log 1 log, 1 11 1 = ×−=⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ ×= ∑ ∑∑ = == n i i n i ii n i i in pwhere pp p pppH ! • entropy = 0 easy • entropy=1 • difficult0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 Entropy(p,1-p) @ Yi-Shin Chen, Text Mining Overview
  • 56. Conditional Entropy Know nothing about the segment 56 Know “eats” is present (Xeat=1) 𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡  =1) 𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡  =0) 𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 =1|​ 𝑥↓𝑒𝑎𝑡𝑠 =1 ) 𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 =0|​ 𝑥↓𝑒𝑎𝑡𝑠 =1 ) 𝐻 (​ 𝑥↓𝑚𝑒𝑎𝑡 )=− 𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 =0)×​ 𝑙 𝑜𝑔↓2 (𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡  =0))− 𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 =1)×​ 𝑙 𝑜𝑔↓2 (𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 =1)) 𝐻 (​ 𝑥↓𝑚𝑒𝑎𝑡 |​ 𝑥↓𝑒𝑎𝑡𝑠 =1 )=− 𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 =0|​ 𝑥↓𝑒𝑎𝑡𝑠  =1 )×​ 𝑙 𝑜𝑔↓2 (𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 =0|​ 𝑥↓𝑒𝑎𝑡𝑠 =1 ))− 𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡  =1|​ 𝑥↓𝑒𝑎𝑡𝑠 =1 )×​ 𝑙 𝑜𝑔↓2 (𝑝(​ 𝑥↓𝑚𝑒𝑎𝑡 =1|​ 𝑥↓𝑒𝑎𝑡𝑠 =1 )) @ Yi-Shin Chen, Text Mining Overview ( ) ( ) ( )( )∑∈ == n Xx xXYHxpXYH
  • 57. Mining Syntagmatic Relations ▷ For each word W1 •  For every word W2, compute conditional entropy 𝐻(​ 𝑥↓𝑤1 |​ 𝑥↓𝑤2  ) •  Sort all the candidate words in ascending order of 𝐻(​ 𝑥↓𝑤1 |​ 𝑥↓𝑤2  ) •  Take the top-ranked candidate words with some given threshold ▷ However •  𝐻(​ 𝑥↓𝑤1 |​ 𝑥↓𝑤2  ) and 𝐻(​ 𝑥↓𝑤1 |​ 𝑥↓𝑤3  ) are comparable •  𝐻(​ 𝑥↓𝑤1 |​ 𝑥↓𝑤2  ) and 𝐻(​ 𝑥↓𝑤3 |​ 𝑥↓𝑤2  ) are not →  Because the upper bounds are different ▷ Conditional entropy is not symmetric 57 @ Yi-Shin Chen, Text Mining Overview
  • 58. Mutual Information ▷ 𝐼(𝑥; 𝑦)= 𝐻(𝑥)− 𝐻(𝑥|𝑦 )= 𝐻(𝑦)− 𝐻(𝑦|𝑥 ) ▷ Properties: •  Symmetric •  Non-negative •  I(x;y)=0 iff x and y are independent ▷ Allow us to compare different (x,y) pairs 58 H(x) H(y) H(x|y) H(y|x) I(x;y) @ Yi-Shin Chen, Text Mining Overview H(x,y)
  • 59. Topic Mining Assume we already know the word relationships 59 @ Yi-Shin Chen, Text Mining Overview
  • 60. Landscape of Text Mining 60 World Sensor Data Interpret by Report World devices 24。C, 55% World To be or not to be.. human Non-text (Context) Text Subjective Objective Perceived by Express Mining knowledge about Languages Nature Language Processing Text Representation; Word Association and Mining @ Yi-Shin Chen, Text Mining Overview
  • 61. Topic Mining: Motivation ▷ Topic: key idea in text data •  Theme/subject •  Different granularities (e.g., sentence, article) ▷ Motivated applications, e.g.: •  Hot topics during the debates in 2016 presidential election •  What do people like about Windows 10 •  What are Facebook users talking about today? •  What are the most watched news? 61 @ Yi-Shin Chen, Text Mining Overview
  • 62. Tasks of Topic Mining 62 Text Data Topic 1 Topic 2 Topic 3 Topic 4 Topic n Doc1 Doc2 @ Yi-Shin Chen, Text Mining Overview
  • 63. Formal Definition of Topic Mining ▷ Input •  A collection of N text documents 𝑆={​ 𝑑↓1 ,​ 𝑑↓2 ,​ 𝑑↓3 ,…​ 𝑑↓𝑛 } •  Number of topics: k ▷ Output •  k topics: {​ 𝜃↓1 ,​ 𝜃↓2 ,​ 𝜃↓3 ,…​ 𝜃↓𝑛 } •  Coverage of topics in each ​ 𝑑↓𝑖 : {​ 𝜇↓𝑖1 ,​ 𝜇↓𝑖2 ,​ 𝜇↓𝑖3 ,…​ 𝜇↓𝑖𝑛 } ▷ How to define topic ​ 𝜃↓𝑖 ? •  Topic=term (word)? 63 @ Yi-Shin Chen, Text Mining Overview
  • 64. Tasks of Topic Mining (Terms as Topics) 64 Text Data Politics Weather Sports Travel Technology Doc1 Doc2 @ Yi-Shin Chen, Text Mining Overview
  • 65. Problems with “Terms as Topics” ▷ Not generic •  Can only represent simple/general topic •  Cannot represent complicated topics →  E.g., “uber issue”: political or transportation related? ▷ Incompleteness in coverage •  Cannot capture variation of vocabulary ▷ Word sense ambiguity •  E.g., Hollywood star vs. stars in the sky; apple watch vs. apple recipes 65 @ Yi-Shin Chen, Text Mining Overview
  • 66. Improved Ideas ▷ Idea1 (Probabilistic topic models): topic = word distribution •  E.g.: Sports = {(Sports, 0.2), (Game 0.01), (basketball 0.005), (play, 0.003), (NBA,0.01)…} •  √: generic, easy to implement ▷ Idea 2 (Concept topic models): topic = concept •  Maintain concepts (manually or automatically) →  E.g., ConceptNet 66 @ Yi-Shin Chen, Text Mining Overview
  • 67. Possible Approaches for Probabilistic Topic Models ▷ Bag-of-words approach: •  Mixture of unigram language model •  Expectation-maximization algorithm •  Probabilistic latent semantic analysis •  Latent Dirichlet allocation (LDA) model ▷ Graph-based approach : •  TextRank (Mihalcea and Tarau, 2004) •  Reinforcement Approach (Xiaojun et al., 2007) •  CollabRank (Xiaojun er al., 2008) 67 @ Yi-Shin Chen, Text Mining Overview
  • 68. Bag-of-words Assumption ▷ Word order is ignored ▷ “bag-of-words” – exchangeability ▷ Theorem (De Finetti, 1935) – if (​ 𝑥↓1 ,​ 𝑥↓2 , ​…, 𝑥↓𝑛 ) are infinitely exchangeable, then the joint probability p(​ 𝑥↓1 ,​ 𝑥↓2 , ​…, 𝑥↓𝑛 ) has a representation as a mixture: ▷ p(​ 𝑥↓1 ,​ 𝑥↓2 , ​…, 𝑥↓𝑛 )=∫↑▒𝑑𝜃𝑝(𝜃) ∏𝑖=1↑𝑁▒𝑝(​ 𝑥↓𝑖 |𝜃 )  for some random variable θ@ Yi-Shin Chen, Text Mining Overview 68
  • 69. Latent Dirichlet Allocation ▷ Latent Dirichlet Allocation (D. M. Blei, A. Y. Ng, 2003) Linear Discriminant Analysis ∏∫ ∏∑ ∫ ∏∑ ∏ = = = = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = = M d d k N n z dndnddnd k N n z nnn nn N n n dzwpzppDp dzwpzppp zwpzppp d dn n 1 1 1 1 ),()()(),( ),()()(),()3( ),()()(),,,()2( θβθαθβα θβθαθβα βθαθβαθ w wz @ Yi-Shin Chen, Text Mining Overview 69
  • 70. LDA Assumption ▷ Assume: •  When writing a document, you 1.  Decide how many words 2.  Decide distribution(P = Dir( 𝛼) P = Dir( 𝛽))) P = Dir( 𝛽)))) 3.  Choose topics (Dirichlet) 4.  Choose words for topics (Dirichlet) 5.  Repeat 3 •  Example 1.  5 words in document 2.  50% food 50% cute animals 3.  1st word - food topic, gives you the word “bread”. 4.  2nd word - cute animals topic, “adorable”. 5.  3rd word - cute animals topic, “dog”. 6.  4th word - food topic, “eating”. 7.  5th word - food topic, “banana”. 70 “bread adorable dog eating banana” Document Choice of topics and words @ Yi-Shin Chen, Text Mining Overview
  • 71. LDA Learning (Gibbs) ▷ How many topics you think there are ? ▷ Randomly assign words to topics ▷ Check and update topic assignments (Iterative) •  p(topic t | document d) •  p(word w | topic t) •  Reassign w a new topic, p(topic t | document d) * p(word w | topic t) 71 I eat fish and vegetables. Dog and fish are pets. My kitten eats fish. #Topic: 2 I eat fish and vegetables. Dog and fish are pets. My kitten eats fish. I eat fish and vegetables. Dog and fish are pets. My kitten eats fish. p(red|1)=0.67; p(purple|1)=0.33; p(red|2)=0.67; p(purple|2)=0.33; p(red|3)=0.67; p(purple|3)=0.33; p(eat|red)=0.17; p(eat|purple)=0.33; p(fish|red)=0.33; p(fish|purple)=0.33; p(vegetable| red)=0.17; p(dog|purple)=0.33; p(pet|red)=0.17; p(kitten|red)=0.17; p(purple|2)*p(fish|purple)=0.5*0.33=0.165; p(red|2)*p(fish|red)=0.5*0.2=0.1; √ fish p(red|1)=0.67; p(purple|1)=0.33; p(red|2)=0.50; p(purple|2)=0.50; p(red|3)=0.67; p(purple| 3)=0.33; p(eat|red)=0.20; p(eat|purple)=0.33; p(fish|red)=0.20; p(fish|purple)=0.33; p(vegetable| red)=0.20; p(dog|purple)=0.33; p(pet|red)=0.20; p(kitten|red)=0.20; I eat fish and vegetables. Dog and fish are pets. My kitten eats fish. @ Yi-Shin Chen, Text Mining Overview
  • 72. Related Work – Topic Model (LDA) 72 ▷ I eat fish and vegetables. ▷ Dog and fish are pets. ▷ My kitten eats fish. Sentence 1: 14.67% Topic 1, 85.33% Topic 2 Sentence 2: 85.44% Topic 1, 14.56% Topic 2 Sentence 3: 19.95% Topic 1, 80.05% Topic 2 LDA Topic 1 0.268 fish 0.210 pet 0.210 dog 0.147 kitten Topic 2 0.296 eat 0.265 fish 0.189 vegetable 0.121 kitten @ Yi-Shin Chen, Text Mining Overview
  • 73. Possible Approaches for Probabilistic Topic Models ▷ Bag-of-words approach: •  Mixture of unigram language model •  Expectation-maximization algorithm •  Probabilistic latent semantic analysis •  Latent Dirichlet allocation (LDA) model ▷ Graph-based approach : •  TextRank (Mihalcea and Tarau, 2004) •  Reinforcement Approach (Xiaojun et al., 2007) •  CollabRank (Xiaojun er al., 2008) 73 @ Yi-Shin Chen, Text Mining Overview
  • 74. Construct Graph ▷ Directed graph ▷ Elements in the graph →  Terms →  Phrases →  Sentences 74 @ Yi-Shin Chen, Text Mining Overview
  • 75. Connect Term Nodes ▷ Connect terms based on its slop. 75 I love the new ipod shuffle. It is the smallest ipod. Slop=1 I love the new ipod shuffle itis smallest Slop=0 @ Yi-Shin Chen, Text Mining Overview
  • 76. Connect Phrase Nodes ▷ Connect phrases to •  Compound words •  Neighbor words 76 I love the new ipod shuffle. It is the smallest ipod. I love the new ipod shuffle itis smallest ipod shuffle @ Yi-Shin Chen, Text Mining Overview
  • 77. Connect Sentence Nodes ▷ Connect to •  Neighbor sentences •  Compound terms •  Compound phrase @ Yi-Shin Chen, Text Mining Overview 77 I love the new ipod shuffle. It is the smallest ipod. ipod new shuffle it smallest ipod shuffle I love the new ipod shuffle. It is the smallest ipod.
  • 78. Edge Weight Types 78 I love the new ipod shuffle. It is the smallest ipod. I love the new ipod shuffleitis smallest ipod shuffle @ Yi-Shin Chen, Text Mining Overview
  • 79. Graph-base Ranking ▷ Scores for each node (TextRank 2004) 79 ythe = (1− 0.85) + 0.85 *(0.1× 0.5 + 0.2 × 0.6 + 0.3× 0.7) the new love is 0.1 0.2 0.3 Score from parent nodes 0.5 0.7 0.6 ynew ylove yis damping factor @ Yi-Shin Chen, Text Mining Overview ∑ ∑∈ ∈ +−= )( )( )()1()( i jk VInj j VOutV jk ji i VWS w w ddVWS
  • 80. Result 80 the 1.45 ipod 1.21 new 1.02 is 1.00 shuffle 0.99 it 0.98 smallest 0.77 love 0.57 @ Yi-Shin Chen, Text Mining Overview
  • 81. Graph-based Extraction •  Pros →  Structure and syntax information →  Mutual influence •  Cons →  Common words get higher scores 81 @ Yi-Shin Chen, Text Mining Overview
  • 82. Summary: Probabilistic Topic Models ▷ Probabilistic topic models): topic = word distribution •  E.g.: Sports = {(Sports, 0.2), (Game 0.01), (basketball 0.005), (play, 0.003), (NBA,0.01)…} •  √: generic, easy to implement •  ?: Not easy to understand/communicate •  ?: Not easy to construct semantic relationship between topics 82 Topic? Topic= {(Crooked, 0.02), (dishonest, 0.001), (News, 0.0008), (totally, 0.0009), (total, 0.000009), (failed, 0.0006), (bad, 0.0015), (failing, 0.00001), (presidential, 0.0000008), (States, 0.0000004), (terrible, 0.0000085),(failed, 0.000021), (lightweight, 0.00001),(weak, 0.0000075), ……} @ Yi-Shin Chen, Text Mining Overview
  • 83. Improved Ideas ▷ Idea1 (Probabilistic topic models): topic = word distribution •  E.g.: Sports = {(Sports, 0.2), (Game 0.01), (basketball 0.005), (play, 0.003), (NBA,0.01)…} •  √: generic, easy to implement ▷ Idea 2 (Concept topic models): topic = concept •  Maintain concepts (manually or automatically) →  E.g., ConceptNet 83 @ Yi-Shin Chen, Text Mining Overview
  • 84. NLP Related Approach: Named Entity Recognition ▷ Find and classify all the named entities in a text. ▷ What’s a named entity? •  A mention of an entity using its name. →  Kansas Jayhawks •  This is a subset of the possible mentions... →  Kansas, Jayhawks, the team, it, they ▷ Find means identify the exact span of the mention ▷ Classify means determine the category of the entity being referred to 84
  • 85. Named Entity Recognition Approaches ▷ As with partial parsing and chunking there are two basic approaches (and hybrids) •  Rule-based (regular expressions) →  Lists of names →  Patterns to match things that look like names →  Patterns to match the environments that classes of names tend to occur in. •  Machine Learning-based approaches →  Get annotated training data →  Extract features →  Train systems to replicate the annotation 85
  • 86. Rule-Based Approaches ▷ Employ regular expressions to extract data ▷ Examples: •  Telephone number: (d{3}[-. ()]){1,2}[dA-Z]{4}. →  800-865-1125 →  800.865.1125 →  (800)865-CARE •  Software name extraction: ([A-Z][a-z]*s*)+ →  Installation Designer v1.1 86
  • 87. Relations ▷ Once you have captured the entities in a text you might want to ascertain how they relate to one another. •  Here we’re just talking about explicitly stated relations 87
  • 88. Relation Types ▷ As with named entities, the list of relations is application specific. For generic news texts... 88
  • 89. Bootstrapping Approaches ▷ What if you don’t have enough annotated text to train on. •  But you might have some seed tuples •  Or you might have some patterns that work pretty well ▷ Can you use those seeds to do something useful? •  Co-training and active learning use the seeds to train classifiers to tag more data to train better classifiers... •  Bootstrapping tries to learn directly (populate a relation) through direct use of the seeds 89
  • 90. Bootstrapping Example: Seed Tuple ▷ Mark Twain, Elmira Seed tuple •  Grep (google) •  “Mark Twain is buried in Elmira, NY.” →  X is buried in Y •  “The grave of Mark Twain is in Elmira” →  The grave of X is in Y •  “Elmira is Mark Twain’s final resting place” →  Y is X’s final resting place. ▷ Use those patterns to grep for new tuples that you don’t already know 90
  • 92. Wikipedia Infobox ▷ Infoboxes are kept in a namespace separate from articles •  Namespce example: Special:SpecialPages; Wikipedia:List of infoboxes •  Example: •  92 {{Infobox person |name = Casanova |image = Casanova_self_portrait.jpg |caption = A self portrait of Casanova ... |website = }}
  • 93. Concept-based Model ▷ ESA (Egozi, Markovitch, 2011) Every Wikipedia article represents a concept TF-IDF concept to inferring concepts from document Manually-maintained knowledge base 93 @ Yi-Shin Chen, Text Mining Overview
  • 94. Yago ▷ YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW 2007 ▷ Unification of Wikipedia WordNet ▷ Make use of rich structures and information •  Infoboxes, Category Pages, etc. 94 @ Yi-Shin Chen, Text Mining Overview
  • 95. Mining Concepts from User-Generated Web Data ▷ Concept: in a word sequence, its meaning is known by a group of people and everyone within the group refers to the same meaning 95 Pei-Ling Hsu, Hsiao-Shan Hsieh, Jheng-He Liang, and Yi- Shin Chen*, Mining Various Semantic Relationships from Unstructured User-Generated Web Data, Journal of Web Semantics, 2015 Sentenced-based Keyword-based Arcle-based @ Yi-Shin Chen, Text Mining Overview
  • 96. Concepts in Word Sequences ▷ Word sequences are likely meaningless and noise à A word sequence as a candidate to be a concept →  About noun •  A noun •  A sequence of noun •  A noun and a number •  An adjective and a noun →  Special format •  A word sequence contains “of” 96 e.g., Basketball e.g., Christmas Eve e.g., 311 earthquake e.g., Desperate housewives e.g., Cover of harry potter appleipod, nano ipod southernfood.about.comaboutodapplecrisprbl50616a buy @ Yi-Shin Chen, Text Mining Overview
  • 97. Concept Modeling •  Concept: in a word sequence, its meaning is known by a group of people and everyone within the group refers to the same meaning. → Try to detect a word sequence that is known by a group of people. 97 @ Yi-Shin Chen, Text Mining Overview
  • 98. Concept Modeling •  Frequent Concept: •  A word sequence mentioned frequently from a single source. 98 Menoned number is normalized in each data source apple apple apple apple apple apple apple, companies color, of, ipod, nano, 1gb ipod ipod, nano apple apple apple @ Yi-Shin Chen, Text Mining Overview
  • 99. Concept Modeling ▷ Some data sources are public, sharing data with other users. ▷ These data sources with frequently seen word sequences. à  These data sources provide concepts with higher confidence. ▷ Confident Value: every data source has a confident value ▷ Confident concept: a word sequence is from data sources with higher confident values. 99 apple inc. ipod, nano apple inc. ipod, nano apple apple apple appleipod, nano ipod apple apple apple @ Yi-Shin Chen, Text Mining Overview
  • 100. Opinion Mining How people feel? 100 @ Yi-Shin Chen, Text Mining Overview
  • 101. Landscape of Text Mining 101 World Sensor Data Interpret by Report World devices 24。C, 55% World To be or not to be.. human Non-text (Context) Text Subjective Objective Perceived by Express Mining content about the observers Opinion Mining and Sentiment Analysis @ Yi-Shin Chen, Text Mining Overview
  • 102. Opinion ▷ a subjective statement describing a person's perspective about something 102 Objective statement or Factual statement: can be proved to be right or wrong Opinion holder: Personalized / customized Depends on background, culture, context Target @ Yi-Shin Chen, Text Mining Overview
  • 103. Opinion Representation ▷ Opinion holder: user ▷ Opinion target: object ▷ Opinion content: keywords? ▷ Opinion context: time, location, others? ▷ Opinion sentiment (emotion): positive/negative, happy or sad 103 @ Yi-Shin Chen, Text Mining Overview
  • 104. Sentiment Analysis ▷ Input: An opinionated text object ▷ Output: Sentiment tag/Emotion label •  Polarity analysis: {positive, negative, neutral} •  Emotion analysis: happy, sad, anger ▷ Naive approach: •  Apply classification, clustering for extracted text features 104 @ Yi-Shin Chen, Text Mining Overview
  • 105. Text Features ▷ Character n-grams •  Usually for spelling/recognition proof •  Less meaningful ▷ Word n-grams •  n should be bigger than 1 for sentiment analysis ▷ POS tag n-grams •  Can mixed with words and POS tags →  E.g., “adj noun”, “sad noun” 105 @ Yi-Shin Chen, Text Mining Overview
  • 106. More Text Features ▷ Word classes •  Thesaurus: LIWC •  Ontology: WordNet, Yago, DBPedia •  Recognized entities: DBPedia, Yago ▷ Frequent patterns in text •  Could utilize pattern discovery algorithms •  Optimizing the tradeoff between coverage and specificity is essential 106 @ Yi-Shin Chen, Text Mining Overview
  • 107. LIWC ▷ Linguistic Inquiry and word count •  LIWC2015 ▷ Home page: ▷ 70 classes ▷ Developed by researchers with interests in social, clinical, health, and cognitive psychology ▷ Cost: US$89.95 107
  • 108. Emotion Analysis: Pattern Approach ▷ Carlos Argueta, Fernando Calderon, and Yi-Shin Chen, Multilingual Emotion Classifier using Unsupervised Pattern Extraction from Microblog Data, Intelligent Data Analysis - An International Journal, 2016 108 @ Yi-Shin Chen, Text Mining Overview
  • 109. Collect Emotion Data 109 @ Yi-Shin Chen, Text Mining Overview
  • 110. Collect Emotion Data Wait! Need Control Group 110 @ Yi-Shin Chen, Text Mining Overview
  • 111. Not-Emotion Data 111 @ Yi-Shin Chen, Text Mining Overview
  • 112. Preprocessing Steps ▷ Hints: Remove troublesome ones o  Too short →  Too short to get important features o  Contain too many hashtags →  Too much information to process o  Are retweets →  Increase the complexity o  Have URLs →  Too trouble to collect the page data o  Convert user mentions to usermention and hashtags to hashtag →  Remove the identification. We should not peek answers! 112 @ Yi-Shin Chen, Text Mining Overview
  • 113. Basic Guidelines ▷ Identify the common and differences between the experimental and control groups •  Analyze the frequency of words →  TF•IDF (Term frequency, inverse document frequency) •  Analyze the co-occurrence between words/patterns →  Co-occurrence •  Analyze the importance between words →  Centrality Graph 113 @ Yi-Shin Chen, Text Mining Overview
  • 114. Graph Construction ▷ Construct two graphs •  E.g. →  Emotion one: I love the World of Warcraft new game J → Not-emotion one: 3,000 killed in the world by ebola I of Warcraft new game WorldLove the 0.9 0.84 0.65 0.12 0.12 0.53 0.67 J 0.45 3,000 world b y ebola the killed in 0.49 0.87 0.93 0.83 0.55 0.25 114 @ Yi-Shin Chen, Text Mining Overview
  • 115. Graph Processes ▷ Remove the common ones between two graphs •  Leave the significant ones only appear in the emotion graph ▷ Analyze the centrality of words •  Betweenness, Closeness, Eigenvector, Degree, Katz →  Can use the free/open software, e.g, Gaphi, GraphDB ▷ Analyze the cluster degrees •  Clustering Coefficient GraphKey patterns 115 @ Yi-Shin Chen, Text Mining Overview
  • 116. Essence Only Only key phrases →emotion patterns 116 @ Yi-Shin Chen, Text Mining Overview
  • 117. Emotion Patterns Extraction o The goal: o  Language independent extraction – not based on grammar or manual templates o  More representative set of features - balance between generality and specificity o  High recall/coverage – adapt to unseen words o  Requiring only a relatively small number – high reliability o  Efficient— fast extraction and utilization o  Meaningful - even if there are no recognizable emotion words in it 117 @ Yi-Shin Chen, Text Mining Overview
  • 118. Patterns Definition o Constructed from two types of elements: o  Surface tokens: hello, J, lol, house, … o  Wildcards: * (matches every word) o Contains at least 2 elements o Contains at least one of each type of element Examples: 118 Pattern Matches * this * “Hate this weather”, “love this drawing” * * J “so happy J”, “to me J” luv my * “luv my gift”, “luv my price” * that “want that”, “love that”, “hate that” @ Yi-Shin Chen, Text Mining Overview
  • 119. Patterns Construction o Constructed from instances o An instance is a sequence of 2 or more words from CW and SW o Contains at least one CW and one SW Examples 119 SubjectWords love hate gift weather … Connector Words this luv my J … Instances “hate this weather” “so happy J” “luv my gift” “love this drawing” “luv my price” “to me J “kill this idiot” “finish this task” @ Yi-Shin Chen, Text Mining Overview
  • 120. Patterns Construction (2) o Find all instances in a corpus with their frequency o Aggregate counts by grouping them based on length and position of matching CW 120 Instances Count “hate this weather” 5 “so happy J” 4 “luv my gift” 7 “love this drawing” 2 “luv my price” 1 “to me J” 3 “kill this idiot” 1 “finish this task” 4 Connector Words this luv my J … Groups Cou nt “Hate this weather”, “love this drawing”, “kill this idiot”, “finish this task” 12 “so happy J”, “to me J” 7 “luv my gift”, “luv my price” 8 … … @ Yi-Shin Chen, Text Mining Overview
  • 121. Patterns Construction (3) o Replace all the SWs by a wildcard * and keep the CWs to convert all instances into the representing pattern o The wildcard matches any word and is used for term generalization o Infrequent patterns are filtered out 121 Connector Words this got my pain … Pattern Groups Cou nt * this * “Hate this weather”, “love this drawing”, “kill this idiot”, “finish this task” 12 * * J “so happy J”, “to me J” 7 luv my * “luv my gift”, “luv my price” 8 … … … @ Yi-Shin Chen, Text Mining Overview
  • 122. Ranking Emotion Patterns ▷ Ranking the emotion patterns for each emotion •  Frequency, exclusiveness, diversity •  One ranked list for each emotion SadJoy Anger 122 @ Yi-Shin Chen, Text Mining Overview
  • 123. Contextual Text Mining Basic Concepts 123 @ Yi-Shin Chen, Text Mining Overview
  • 124. Context ▷ Text usually has rich context information •  Direct context (meta-data): time, location, author •  Indirect context: social networks of authors, other text related to the same source •  Any other related text ▷ Context could be used for: •  Partition the data •  Provide extra features 124 @ Yi-Shin Chen, Text Mining Overview
  • 125. Contextual Text Mining ▷ Query log + User = Personalized search ▷ Tweet + Time = Event identification ▷ Tweet + Location-related patterns = Location identification ▷ Tweet + Sentiment = Opinion mining ▷ Text Mining +Context → Contextual Text Mining 125 @ Yi-Shin Chen, Text Mining Overview
  • 126. Partition Text 126 User y User 2 User n User k User x User 1 Users above age 65 Users under age 12 1998 1999 2000 2001 2002 2003 2004 2005 2006 Data within year 2000 Posts containing #sad @ Yi-Shin Chen, Text Mining Overview
  • 127. Generative Model of Text 127 I eat fish and vegetables. Dog and fish are pets. My kitten eats fish. eat fish vegetables Dog pets are kitten My and I )|( ModelwordP Generation Analyze Model Topic 1 0.268 fish 0.210 pet 0.210 dog 0.147 kitten Topic 2 0.296 eat 0.265 fish 0.189 vegetable 0.121 kitten )|( )|( DocumentTopicP TopicwordP @ Yi-Shin Chen, Text Mining Overview
  • 128. Contextualized Models of Text 128 I eat fish and vegetables. Dog and fish are pets. My kitten eats fish. eat fish vegetables Dog pets are kitten My and I Generation Analyze Model Year=2008 Location=Taiwan Source=FB emotion=happy Gender=Man ),|( ContextModelwordP @ Yi-Shin Chen, Text Mining Overview
  • 129. Naïve Contextual Topic Model 129 I eat fish and vegetables. Dog and fish are pets. My kitten eats fish. eat fish vegetables Dog pets are kitten My and I Generation Year=2008 Year=2007 ∑ ∑= = === Cj Ki jij ContextTopicwPContextizPjcPwP ..1 ..1 ),|()|()()( Topic 1 0.268 fish 0.210 pet 0.210 dog 0.147 kitten Topic 2 0.296 eat 0.265 fish 0.189 vegetable 0.121 kitten Topic 1 0.268 fish 0.210 pet 0.210 dog 0.147 kitten Topic 2 0.296 eat 0.265 fish 0.189 vegetable 0.121 kitten How do we estimate it? → Different approaches for different contextual data and problems
  • 130. Contextual Probabilistic Latent Semantic Analysis (CPLAS) (Mei, Zhai, KDD2006) ▷ An extension of PLSA model ([Hofmann 99]) by •  Introducing context variables •  Modeling views of topics •  Modeling coverage variations of topics ▷ Process of contextual text mining •  Instantiation of CPLSA (context, views, coverage) •  Fit the model to text data (EM algorithm) •  Compare a topic from different views •  Compute strength dynamics of topics from coverages •  Compute other probabilistic topic patterns @ Yi-Shin Chen, Text Mining Overview 130
  • 131. The Probabilistic Model 131 ∑ ∑ ∑∑∑∈ ∈ === = D D ),( 111 ))|()|(),|(),|(log(),()(log CD Vw k l ilj m j j n i i wplpCDpCDvpDwcp θκκ •  A probabilistic model explaining the generation of a document D and its context features C: if an author wants to write such a document, he will –  Choose a view vi according to the view distribution 𝑝(​ 𝑣↓𝑖 |𝐷, 𝐶 ) –  Choose a coverage кj according to the coverage distribution 𝑝(​ 𝑘↓𝑗 |𝐷, 𝐶 ) –  Choose a theme ​θ↓𝑖𝑙  according to the coverage кj –  Generate a word using ​θ↓𝑖𝑙  –  The likelihood of the document collection is: @ Yi-Shin Chen, Text Mining Overview
  • 132. Contextual Text Mining Example 1 Event Identification 132 @ Yi-Shin Chen, Text Mining Overview
  • 133. Introduction ▷ Event definition: •  Something (non-trivial) happening in a certain place at a certain time (Yang et al. 1998) ▷ Features of events: •  Something (non-trivial) •  Certain time •  Certain place 133 content temporal location user location @ Yi-Shin Chen, Text Mining Overview
  • 134. Goal ▷ Identify events from social streams ▷ Events contain these characteristics 134 Event What WhenWho •  Content •  Many words related to one topic •  Happened time •  Time dependent •  Influenced users •  Transmit to others Concept-Based Evolving Social Graphs @ Yi-Shin Chen, Text Mining Overview
  • 135. Keyword Selection ▷ Well-noticed criterion •  Compared to the past, if a word suddenly be mentioned by many users, it is well-noticed •  Time Frame – a unit of time period •  Sliding Window – a certain number of past time frames 135 time tf0 tf1 tf2 tf3 tf4 @ Yi-Shin Chen, Text Mining Overview
  • 136. Keyword Selection ▷ Compare the probability proportion of this word count with the past sliding window 136
  • 137. Event Candidate Recognition ▷ Concept-based event: all keywords about the same event •  Use the co-occurrence of words in tweets to connect keywords 137 Huge amount of keywords connected as one event boston explosion confirm prayerbombing boston- marathon threat iraq jfk hospital victim afghanistan bomb america @ Yi-Shin Chen, Text Mining Overview
  • 138. Event Candidate Recognition ▷ Idea: group one keyword with its most relevant keywords into one event candidate ▷  How do we decide which keyword to start grouping? 138 boston explosion confirm prayerbombing boston- marathon threat iraq jfk hospital victim afghanistan bomb america @ Yi-Shin Chen, Text Mining Overview
  • 139. Event Candidate Recognition ▷ TextRank: rank importance of each keyword •  Number of edges •  Edge weights (word count and word co-occurrence) 139 𝒘𝒆𝒊𝒈𝒉𝒕= ​ 𝒄 𝒐𝒐𝒄𝒄𝒖𝒓( 𝒌 𝟏, 𝒌 𝟐)/ 𝒄𝒐𝒖𝒏𝒕( 𝒌 𝟏)  Boston Boston marathon 49438 21518 ​16566/49438  ​16566/21528  @ Yi-Shin Chen, Text Mining Overview
  • 140. Event Candidate Recognition 1.  Pick the most relevant neighbor keywords 140 boston explosion prayer safe 0.08 0.07 0.01 thought 0.05 marathon 0.13 Normalized weight @ Yi-Shin Chen, Text Mining Overview
  • 141. Event Candidate Recognition 2.  Check neighbor keywords with the secondary keyword 141 boston explosion prayer thought marathon 0.1 0.026 0.006 @ Yi-Shin Chen, Text Mining Overview
  • 142. Event Candidate Recognition 3.  Group the remaining keywords as one event candidate 142 boston explosion prayer marathon Event Candidate @ Yi-Shin Chen, Text Mining Overview
  • 143. Evolving Social Graph Analysis ▷ Bring in social relationships to estimate information propagation ▷ Social Relation Graph •  Vertex: users that mentioned one or more keywords in the candidate event •  Edge: the following relationship between users 143 Boston marathon explosion bomb Boston marathon Boston bomb Marathon explosion Marathon bomb @ Yi-Shin Chen, Text Mining Overview
  • 144. Evolving Social Graph Analysis ▷ Add in evolving idea (time frame increment): graph sequences 144 tf1 tf2 @ Yi-Shin Chen, Text Mining Overview
  • 145. Evolving Social Graph Analysis ▷ Information decay: •  Vertex weight, edge weight •  Decay mechanism ▷ Concept-Based Evolving Graph Sequences (cEGS): a sequence of directed graphs that demonstrate information propagation 145 tf1 tf2 tf3 @ Yi-Shin Chen, Text Mining Overview
  • 146. Methodology – Evolving Social Graph Analysis ▷ Hidden link – construct hidden relationship •  To model better interaction between users •  Sample data 146 @ Yi-Shin Chen, Text Mining Overview
  • 147. Evolving Social Graph Analysis ▷ Analysis of cEGS •  Number of vertices(nV) →  The number of users mentioned this event candidate •  Number of edges(nE): →  The number of following relationship in this cEGS •  Number of connected components(nC) →  The number of communities in this cEGS •  Reciprocity(R) →  The degree of mutual connections is in this cEGS 147 Reciprocity = ​ 𝑛 𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑑𝑔𝑒𝑠/𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑒𝑑𝑔𝑒𝑠  @ Yi-Shin Chen, Text Mining Overview
  • 148. Event Identification ▷ Type1 Event: One-shot event •  An event that receives attention in a short period of time →  Number of users, number of followings, number of connected components suddenly increase ▷ Type2 Event: Long-run event •  An event that attracts many discussion for a period of time →  Reciprocity ▷ Non-event 148 @ Yi-Shin Chen, Text Mining Overview
  • 149. Experimental Results ▷ April 15, “library jfk blast bostonmarathon prayforboston pd possible area boston police social explosion bomb bombing marathon confirm incident” 149 0 0.02 0.04 0.06 0.08 0.1 0.12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 0 2000 4000 6000 8000 10000 12000 Reciprocity Days in April Users,Followings,Connected Components Number of users Number of followings Number of connected components Reciprocity Hidden link reciprocity
  • 150. Contextual Text Mining Example 2 Location Identification 150
  • 151. Twitter and Geo-Tagging ▷ Geospatial features in Twitter have been available •  User’s Profile location •  Per tweet geo-tagging ▷ Users have been slow to adopt these features •  On a 1-million-record sample, 0.30% of tweets were geo-tagged ▷ Locations are not specific 151 @ Yi-Shin Chen, Text Mining Overview
  • 152. Our Goal ▷ Identify the location of a particular Twitter user at a given time •  Using exclusively the content of his/her tweets 152 @ Yi-Shin Chen, Text Mining Overview
  • 153. Data Sources ▷ Twitter Dataset •  13 million tweets, 1.53 million profiles •  From Nov. ‘11 to Apr. ’12 ▷ Local Cache •  Geospatial Resources •  GeoNames Spatial DB •  Wikiplaces DB •  Lexical Resources •  WordNet ▷ Web Resources 153 @ Yi-Shin Chen, Text Mining Overview
  • 154. Baseline Classification ▷ We are interested only in tweets that might suggest a location. ▷ Tweet Classification •  Direct Subject •  Has a first person personal pronoun – “I love New York” →  (I, me, myself,…) •  Anonymous Subject →  Starts with Verbs – “Rocking the bar tonight!” →  Composed of only nouns and adjectives – “Perfect day at the beach” •  Others 154 @ Yi-Shin Chen, Text Mining Overview
  • 155. Rule Generation ▷ By identifying certain association rules, we can identify certain locations ▷ Combinations of certain verbs and nouns (bigrams) imply some locations •  “Wait” + “Flight” → Airport •  “Go” + “Funeral” → Funeral Home ▷ Create combinations of all synonyms, meronyms and hyponyms related to a location ▷ Number of occurrences must be greater than K ▷ Each rule does not overlap with other categories 155 @ Yi-Shin Chen, Text Mining Overview
  • 156. Examples 156 Airport Airfield plan es Contr ol Towe r Gate s Airli ne Helicop ter Check- in Helipa d Fl y Heliport Hyponyms : Airfield is a kind of airport Meronyms : Gates is a part of airport Restaurant Gym Wa it Flig ht @ Yi-Shin Chen, Text Mining Overview
  • 157. N-gram Combinations ▷ Identify likely location candidates ▷ Contiguous sequences of n-items ▷ Only nouns and adjectives ▷ Extracted from Direct Subject and Anonymous Subject categories. 157 @ Yi-Shin Chen, Text Mining Overview
  • 158. Tweet Types ▷ Coordinates •  Tweet has geographical coordinates ▷ Explicit Specific •  “I Love Hsinchu” →  Toponyms ▷ Explicit General •  “Going to the Gym” →  Through Hypernyms ▷ Implicit •  “Buying a new scarf” - Department Store •  Emphasize on actions 158 Identified through a Web Search @ Yi-Shin Chen, Text Mining Overview
  • 159. Country Discovery ▷ Identify the user’s country ▷ Identified with OPTICS algorithm ▷ Cluster all previously marked n-grams ▷ Most significant cluster is retrieved •  User’s country 159
  • 160. Inner Region Discovery ▷ Identify user’s Hometown ▷ Remove toponyms ▷ Clustered with OPTICS ▷ Only locations within ± 2 σ 160 @ Yi-Shin Chen, Text Mining Overview
  • 161. Inner Region Discovery 161 •  Identify user’s Hometown •  Remove toponyms •  Clustered with OPTICS •  Only locations within ± 2 σ •  Most Representative cluster – Reversely Geocoded – Most representative cityCoordinates Lat: 41.3948029987514 Long : -73.472126500681 Reverse Geocoding @ Yi-Shin Chen, Text Mining Overview
  • 162. Web Search ▷ Use web services ▷ Example : •  “What's the weather like outside? I haven't left the library in three hours” •  Search: Library near Danbury, Connecticut , US 162
  • 163. Timeline Sorting 163 9AM “I have to agree that Washington is so nice at this time of the year” 11AM “Heat Wave in Texas” @ Yi-Shin Chen, Text Mining Overview
  • 164. Tweet A : 9AM “I have to agree that Washington is so nice at this time of the year” Tweet B : 11AM “Heat Wave in Texas” Tweet A Tweet B(Tweet A) – X days (Tweet B) + X days @ Yi-Shin Chen, Text Mining Overview 164
  • 165. Location Inferred ▷ Users are classified according to the particular findings : •  No Country (No Information) •  Just Country •  Timeline →  Current and past locations •  Timeline with Hometown →  Current and past locations →  User’s Hometown →  General Locations 165 @ Yi-Shin Chen, Text Mining Overview
  • 166. General Statistics @ Yi-Shin Chen, Text Mining Overview 166
  • 167. Some Remarks Some but not all 167 @ Yi-Shin Chen, Text Mining Overview
  • 168. Knowledge Discovery (KDD) Process 168 Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
  • 169. Always Remember ▷ Have a good and solid objective •  No goal no gold •  Know the relationships between them 169 @ Yi-Shin Chen, Text Mining Overview