The document discusses mining Twitter data in real-time for trend and information discovery. It describes two works: 1) Classifying emerging trending topics on Twitter with 78.4% accuracy using social features of tweets rather than text. 2) Summarizing live events tweeted on Twitter, such as soccer games, with over 80% precision and recall by detecting sub-events and selecting representative tweets. The outlook discusses further analyzing trend types and evaluating the summarization approach on other event types.
Mining Twitter for Real-Time Trend and Information Discovery
1. Mining Twitter for real-time trend and information
discovery
Yahoo! Research Barcelona
Arkaitz Zubiaga
NLP & IR Group @ UNED
December 19th, 2011
2. Motivation
Index
1
Motivation
2
Our Work (I): Classification of Trending Topics
3
Our Work (II): Real-Time Summarization of Events
4
Outlook
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
2 / 43
3. Motivation
Twitter
Twitter is a microblogging service with over 200 million users.
Users share short messages of up to 140 characters (tweets).
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
3 / 43
7. Motivation
Increase of activity on Twitter
As of October 2011, Twitter received 250 million tweets per day.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
7 / 43
9. Motivation
Usefulness of Twitter
Twitter provides...
1
...large amounts of data in real-time,
2
from a wide variety of sources,
3
with the ability to spread rapidly.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
9 / 43
11. Motivation
Using Twitter for... following events
(1) Live-tweeting about and following events.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
11 / 43
12. Motivation
Using Twitter for... helping others
(2) Helping others, as in natural disasters.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
12 / 43
13. Motivation
Using Twitter for... finding out about news
and (3) Finding out about breaking news.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
13 / 43
14. Motivation
Twitter on the media
Lots of researchers are analyzing tweets.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
14 / 43
15. Motivation
Trends on Twitter
The news about the Japan earthquake broke on Twitter.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
15 / 43
17. Motivation
Research on Twitter
Most of the research on Twitter focus on the analysis of streams after
they happened.
Very little research deals with the real-time analysis of streams.
Our goal: How can we mine Twitter streams to acquire real-time
knowledge about events and trends?
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
17 / 43
18. Our Work (I): Classification of Trending Topics
Index
1
Motivation
2
Our Work (I): Classification of Trending Topics
3
Our Work (II): Real-Time Summarization of Events
4
Outlook
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
18 / 43
19. Our Work (I): Classification of Trending Topics
Trending Topics on Twitter
Trending topics reflect the top conversations being discussed on
Twitter more than usually.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
19 / 43
20. Our Work (I): Classification of Trending Topics
What produces trending topics?
What kinds of events leverage those trending topics?
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
20 / 43
21. Our Work (I): Classification of Trending Topics
Typology of Trending Topics
News: Japan earthquake.
Current events: a soccer game.
Memes: funny and viral ideas.
Commemoratives: World AIDS Day.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
21 / 43
22. Our Work (I): Classification of Trending Topics
Goal
Find out the type of a trending topic as soon as it emerges.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
22 / 43
23. Our Work (I): Classification of Trending Topics
Dataset
1,036 unique trending topics, with up to 1,500 associated
tweets as soon as they trended.
Manual classification of trending topics:
616 current events.
251 memes.
142 news.
27 commemoratives.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
23 / 43
24. Our Work (I): Classification of Trending Topics
Experiment Settings
Support Vector Machines (one-against-all)
500 trends for the training set.
10 runs.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
24 / 43
25. Our Work (I): Classification of Trending Topics
Representation of Trending Topics
2 different representation approaches:
Twitter features: 15 straightforward language-independent
features that rely on the social spread of trends.
Bag-of-words: Text of tweets (TF).
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
25 / 43
26. Our Work (I): Classification of Trending Topics
Results
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
26 / 43
27. Our Work (I): Classification of Trending Topics
Results
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
27 / 43
28. Our Work (I): Classification of Trending Topics
Main findings
Trending topics can accurately (78.4%) be categorized using social
features:
Outperforming use of textual content.
Without making use of external data.
In real-time as the trending topic emerges.
Arkaitz Zubiaga, Damiano Spina, V´
ıctor Fresno, and Raquel Mart´
ınez.
2011. Classifying trending topics: a typology of conversation triggers on
Twitter. CIKM 2011.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
28 / 43
29. Our Work (II): Real-Time Summarization of Events
Index
1
Motivation
2
Our Work (I): Classification of Trending Topics
3
Our Work (II): Real-Time Summarization of Events
4
Outlook
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
29 / 43
30. Our Work (II): Real-Time Summarization of Events
Events on Twitter
When users live-tweet about events:
They produce vast amounts of tweets about events.
Users want to follow what others say.
Users cannot follow the overwhelming amounts of tweets.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
30 / 43
31. Our Work (II): Real-Time Summarization of Events
Stream summarization
Can we summarize streams of tweets in such a way that:
Users receive a reduced stream that they can follow?
Users do not miss any key sub-event occurred during the event?
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
31 / 43
32. Our Work (II): Real-Time Summarization of Events
Study of soccer games
Copa America 2011 (July 1-26, 2011):
26 soccer games.
11k-70k tweets per game.
Tweets are written in 30 languages.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
32 / 43
33. Our Work (II): Real-Time Summarization of Events
Gold Standard
Live reports gathered from Yahoo! Sports.
Yahoo! journalists provide annotations for:
Goals.
Penalties.
Red Cards.
Disallowed Goals.
Game Starts, Ends, Stops & Resumptions.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
33 / 43
34. Our Work (II): Real-Time Summarization of Events
Histogram of a Soccer Game
2500
tweet rate
2000
1500
1000
1310864000
1310862000
1310860000
1310858000
1310856000
1310854000
500
time elapsed
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
34 / 43
35. Our Work (II): Real-Time Summarization of Events
Summarization of soccer games
2-step summarization:
1
Sub-event detection.
2
Tweet selection.
Sub-event
Detection
Tweet
Selection
tweet
tweet
tweet
summary
tweets stream
real-time
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
35 / 43
36. Our Work (II): Real-Time Summarization of Events
1st Step: Sub-event Detection
Increase [Zhao et al., 2011]: a sub-event occurred when a sudden
increase is given in the tweeting rate (1.7 as much as the previous
rate).
Outliers: learns from audience. High tweeting rates as
compared to rates seen so far will be considered sub-events (90%
percentile).
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
36 / 43
37. Our Work (II): Real-Time Summarization of Events
1st Step: Results
Increase
Outliers
P
0.29
0.51
R
0.81
0.84
F1
0.41
0.63
#
45.4
25.6
Increase-based approach provides more sub-events, with many FPs
(recall-based).
Outlier-based approach (rather based on outstanding tweeting rates)
improves in P and R.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
37 / 43
38. Our Work (II): Real-Time Summarization of Events
2nd Step: Tweet Selection
Each term appearing in tweets in a given timeframe is given a weight
according to:
Frequency (TF).
Language Models (KLD).
These weightings enable to choose a representative tweet, as the tweet
with higher value adding up weights of its terms.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
38 / 43
40. Our Work (II): Real-Time Summarization of Events
Main findings
Use of state-of-the-art text analysis methods generates accurate
summaries:
With precision and recall values above 80% (100% for key
sub-events).
In real-time as the game is being played.
In 3 different languages (es, en, pt).
Without need of external data.
Damiano Spina, Arkaitz Zubiaga, Enrique Amig´, Julio Gonzalo. Towards
o
Real-Time Summarization of Events from Twitter Streams. To Appear.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
40 / 43
41. Outlook
Index
1
Motivation
2
Our Work (I): Classification of Trending Topics
3
Our Work (II): Real-Time Summarization of Events
4
Outlook
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
41 / 43
42. Outlook
Outlook
Work 1:
Further dig into each type of trending topic, in order to look for
subtypes of trends.
Work 2:
Evaluate the performance of the summarizer on other kinds of
scheduled events (award ceremonies, keynote talks,...)
Evaluate novelty of information garnered from tweets.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
42 / 43