Twitter has become the de facto medium for consumption of news in real time, and billions of events are generated and analyzed on a daily basis. To analyze these events, Twitter designed its own next-generation streaming system, Heron. Arun Kejariwal and Karthik Ramasamy walk you through how Heron is used to detect anomalies in real-time data streams. Although there’s been over 75 years of prior work in anomaly detection, most of the techniques cannot be used off the shelf because they’re not suitable for high-velocity data streams. Arun and Karthik explain how to make trade-offs between accuracy and speed and discuss incremental approaches that marry sampling with robust measures such as median and MCD for anomaly detection.
3. 3
DATA @ MZ
An Overview
GOW AND MOBILE STRIKE
Peaked at 1M events/sec
MARKETING
Serve >1B impressions/day worldwide
Integrated with >150 distinct advertising channels
POTPOURRI
~35B messages/day
Writes: 20TB/day
4. 4
SENSORS
Monitoring
Smartwatches, Refrigerators
Wearables
ACTUATORS
Automa,on
Manufacturing
Robo@cs
DRONES
Expanding the scope
Delivery, Real Estate
Power Transmission Lines
MOBILE
Life’s Remote Control
Personaliza@on
Produc@vity
EXPLOSION IN DATA VELOCITY AND VOLUME
5. 5
MANUFACTURING HEALTH
Care
POWER
Grid
GAS
Pipelines
SECURITY OPERATIONS ROBOTICS # TWEETS
per minute
ANOMALY DETECTION: WHY BOTHER?
DIGITAL
Marke,ng
CONNECTED
Cars
8. 8
RESEARCHED
FOR
>100 YEARS
Manufacturing
Econometrics
Networking
Image Processing
Computer Vision (Cyber)
Security
Text Mining
Signal Processing
Finance
Experimental Social Psychology
Web Opera@ons
Sta@s@cs (and Time Series Analysis)
Data Fidelity
Astronomy
ANOMALY DETECTION: APPLICATION DOMAINS
10. 10
FALSE
Posi@ve
Rate
FALSE
Nega@ve
Rate
SCALE
Data
Granularity
WHY NOT USE OFF-‐THE-‐SHELF?
Anomalies are CONTEXTUAL
11. 11
Severity
Data
Characteris@cs
Data
Fidelity
Different Ac@ons
Page or not
Sta@onarity, Normal
Distribu,on
Missing Data
Data Corrup,on
MOSTLY UNSUPERVISED
13. 13
MEAN AND STANDARD DEVIATION
Mean: Compute incrementally
Not robust in the presence of anomalies
COMMONLY USED STATISTICS
TRIMMED MEAN
Robust in the presence of anomalies
Small samples?
How to handle asymmetric distributions?
Results in a biased estimator
What should be the trimming boundaries?
WINSORIZED MEAN
L-ESTIMATORS
Linear combinations of order statistics
14. 14
ROBUST STATISTICS
MEDIAN AND MEDIAN ABSOLUTE DEVIATION (MAD)
Robust in the presence of anomalies
Not amenable to incremental computation
Use q-digest, t-digest
What if MAD is zero?
A sample with many similar values
BROADENED MEDIAN, M-ESTIMATORS, SN AND QN
15. 15
ANALYZE INDIVIDUAL TIME SERIES
Too many alerts
Not actionable
Alert Fatigue
MULTIPLE TIME SERIES
Methods
MINIMUM COVARIANCE DETERMINANT (MCD)
Proposed by Rousseeuw, 1984
Mahalanobis distance1
FastMCD
[1]
“On
the
generalised
distance
in
sta/s/cs”,
by
P.
C.
Mahalanobis,
1936.
16. 16
MULTIPLE TIME SERIES
Other Methods
CORRELATION
Direction
Magnitude
nxn Correlation Matrix?
Bake in context
Exploit topology
17. 17
CHALLENGES
Susceptible to Anomalies
Data Skew
Missing Data
Speed
MULTIPLE TIME SERIES
Other Methods
TECHNIQUES
Robust Correlation
Cross Correlation
Intersection Analysis
Trade-off between speed and accuracy
20. 20
RTplatform
Cloud-based platform built for connecting, processing,
and reacting to live data.
+ Extreme scale
+ High performance
+ Unprecedented reliability
+ Natively serverless
22. 22
Live Stream Bots
A backbone for live data:
Free Messaging for publishers
and subscribers
Filter, analyze and
transform messages
in live stream
Notify
Anomaly
detection
RTplatform
MESSAGING Real-time Pub/Sub with ultra-low latency and high fanout
QUERYING Filter, analyze, and transform messages live, in-stream
BOTS Deploy rule-based bots for real-time anomaly detection/reaction
25. 25
HERON DESIGN GOALS
Task isolation
Ease
of
debug-‐ability/isolaDon/profiling
Support for back pressure
Topologies
should
self
adjusDng
Efficiency
Reduce resource consumption
Off -the-shelf schedulers
Unmanaged
-‐
Apache
YARN/Mesos
Managed
-‐
Apache
Aurora,
Amazon
ECS
Use of main stream languages
C++,
Java
and
Python
Batching of tuples
AmorDzing
the
cost
of
transferring
tuples !
"#
G
4 !
30. 30
BACKPRESSURE
Stragglers are the norm in a mul2-‐tenant distributed systems
BAD HOST EXECUTION SKEW INADEQUATE
PROVISIONING
Ñ"
31. 31
SENDERS TO STRAGGLER: DROP DATA
BACKPRESSURE
Approaches to Handle Stragglers
DETECT STRAGGLERS AND RESCHEDULE THEM
SENDERS SLOW DOWN TO THE SPEED OF STRAGGLER
37. 37
IN MOST SCENARIOS BACK PRESSURE RECOVERS
Without any manual intervention
BACKPRESSURE
In Prac2ce
SOMETIMES USER PREFERS DROPPING OF DATA
Care about only latest data
SUSTAINED BACK PRESSURE
Irrecoverable GC cycles, Bad or faulty host
40. 40
PLUG AND PLAY COMPONENTS
As environment changes, core does not change
MULTI LANGUAGE INSTANCES
Support multiple language API with native instances
MULTIPLE PROCESSING SEMANTICS
Efficient stream managers for each semantics
EASE OF DEVELOPMENT
Faster development of components with little dependency
HERON: EXTENSIBLE STREAMING ENGINE
41. 41
REPEATED SERIALIZATION
Java objects —> Byte Arrays —> Protocol Buffers
EAGER DESERIALIZATION
Stream manager deserializes entire tuple even though full contents are not examined
IMMUTABILITY
Stream manager does not reuse any ProtoBuf objects
OPTIMIZING HERON
42. 42
HERON: PERFORMANCE
At most once seman2cs
0
2000
4000
6000
8000
10000
12000
25 100 200
MILLION TUPLES/MIN
SPOUT PARALLELISM
THROUGHPUT
Without Optimizations With Optimizations
0
5
10
15
20
25
30
35
25 100 200
MILLION TUPLES/MIN
SPOUT PARALLELISM
THROUGHPUT PER CORE
Without Optimizations With Optimizations
43. 43
HERON: PERFORMANCE
At least once seman2cs
0
500
1000
1500
2000
2500
25 100 200
MILLION TUPLES/MIN
SPOUT PARALLELISM
THROUGHPUT
Without Optimizations With Optimizations
0
20
40
60
80
100
120
140
160
180
25 100 200
MILLISECS
SPOUT PARALLELISM
LATENCY
Without Optimizations With Optimizations
44. 44
HERON: PERFORMANCE
At least once seman2cs -‐ Impact of Cache Drain Frequency
0
500
1000
1500
2000
2500
0 5 10 15 20 25 30 35
MILLION TUPLES/MIN
CACHE DRAIN FREQUENCY (MS)
THROUGHPUT VS CACHE DRAIN FREQUENCY
200 100 25
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35
LATENCY (MS)
CACHE DRAIN FREQUENCY (MS)
LATENCY VS CACHE DRAIN FREQUENCY
200 100 25
45. 45
HALBERT
Nakagawa
Co-‐Founder
&
CTO
FRANCOIS
Orsini
CTO
JOSH
Lulewicz
Head of Data Placorm
WE ARE HIRING!
KARTHIK
Ramasamy
Manager
47. 47
READINGS
STROM @ TWITTER
A. Toshniwall et. al, SIGMOD 2014.
TWITTER HERON: STREAM PROCESSING AT SCALE
S. Kulkarni et al., SIGMOD 2015.
STREAMING @ TWITTER
M. Fu, 2016.
TWITTER HERON: TOWARDS EXTENSIBLE STREAMING ENGINES
M. Fu, ICDE 2017.
48. 48
READINGS
LIMITS THEOREMS FOR THE MEDIAN DEVIATIONS
P. Hall and A. H. Welsh, 1985.
ALTERNATIVES TO MEDIAN ABSOLUTE DEVIATION
P. J. Rousseeuw and C. Croux, 1993.
ASYMPTOTIC INDEPENDENCE OF MEDIAN AND MAD
M. Falk, 1997.
BAHADUR REPRESENTATIONS FOR THE MEDIAN ABSOLUTE DEVIATION AND ITS
MODIFICATIONS
S. Mazumder and R. Serfling, 2009.
THE MINIMUM REGULARIZED COVARIANCE DETERMINANT ESTIMATOR
K. Boudt, P. J. Rousseeuw, S. Vanduffel and T. Verdonck, 2017.