A quick tour through Spark Streaming and some of the debugging and tuning tips we've discovered over the past few years. Plus how to use Storm to get that low latency when you need it.
2. www.mammothdata.com | @mammothdataco
The Leader in Big Data Consulting
● BI/Data Strategy
○ Development of a business intelligence/ data architecture strategy.
● Installation
○ Installation of Hadoop or relevant technology.
● Data Consolidation
○ Load data from diverse sources into a single scalable repository.
● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards,
feeds or computer-driven decision making processes to derive insights and make decisions.
● Visualization Tools
○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to
necessary employees who will analyze the data.
Mammoth Data, based in downtown Durham (right above Toast)
4. www.mammothdata.com | @mammothdataco
● Quick overview of Spark Streaming
● Reasons why Spark Streaming can be tricky in practice
● Performance and tuning tips we’ve learnt over the past two years
● …and when to pack it all in and use Storm instead
What This Talk Is About
7. www.mammothdata.com | @mammothdataco
● Spark & Storm - millions of requests / second on commodity
hardware
● Different problems at different scales!
Beyond Web Scale
8. www.mammothdata.com | @mammothdataco
● Directed Acyclic Graph Data Processing Engine
● Based around the Resilient Distributed Dataset (RDD) primitive
Spark
11. www.mammothdata.com | @mammothdataco
● Streaming by running batches very quickly!
● Batch length: can be as low as 0.5s / batch
● Every X seconds, get Y records (DStream/RDDs)
Spark Streaming — Overview
12. www.mammothdata.com | @mammothdataco
● Using same implementation (mostly) for batch and stream
processing (Lambda Architecture hipster points ahoy!)
● Access to rest of Spark - Dataframes, MLLib, GraphX, etc.
Spark Streaming — Good Things
13. www.mammothdata.com | @mammothdataco
● What happens if you can’t process Y records in X seconds?
● What happens if you require sub-second latency?
Spark Streaming — Bad Things!
15. www.mammothdata.com | @mammothdataco
● What happens if you can’t process Y records in X seconds?
● Data builds up in executors
● Executors run out of memory…
Spark Streaming — Bad Things!
16. www.mammothdata.com | @mammothdataco
● “Hey, we forgot to tell you Ops people that we have a major new
client adding stuff into the firehose sometime today. That’s fine,
right?”
Spark Streaming — Bad Things!
19. www.mammothdata.com | @mammothdataco
● Do you need low-latency?
● If so, a 10-minute nap is advisable!
● Everybody else, let’s dive in…
Spark Streaming — Tuning
23. www.mammothdata.com | @mammothdataco
● Easiest method — alter the batch window until it’s all fine!
● Tiny batches provide tight execution times!
Spark Streaming — Down In The Hole
24. www.mammothdata.com | @mammothdataco
● Use Kafka.
● Data source with the most love (e.g. exactly-once semantics
without Write Ahead Logs and receiver-less operation in 1.3+)
● (other sources get the features…eventually)
Spark Streaming — Tuning
25. www.mammothdata.com | @mammothdataco
● Use Scala.
● CPython = slower in execution
● PyPy is much faster…but…
● New features always come to Scala first.
Spark Streaming — Tuning
27. www.mammothdata.com | @mammothdataco
● Spark Streaming = data receivers + Spark
● spark.cores.max = x * number of receivers
● For Great Data Locality and Parallelism!
Spark Streaming — Cores
28. www.mammothdata.com | @mammothdataco
● Are you using a foreachRDD loop?
rdd.foreachRDD{ rdd =>
rdd.cache()
…
rdd.unpersist()
}
Spark Streaming — Caching
29. www.mammothdata.com | @mammothdataco
● If routing to multiple stores / iterating over an RDD multiple
times using cache() is a quick win
● It really shouldn’t work so well…
Spark Streaming — Caching
30. www.mammothdata.com | @mammothdataco
● Hurrah for Spark 1.5!
● spark.streaming.backpressure.enabled = true
● Spark dynamically alters incoming data rates (keeping the data in
Kafka rather than in the executors)
● Works for all data sources (for once!)
Spark Streaming — Backpressure
36. www.mammothdata.com | @mammothdataco
● Unbounded stream of tuples
● Tuples are defined via schema (usual base types plus custom
serializers)
Storm — Streams
37. www.mammothdata.com | @mammothdataco
● Sources of tuples in a topology
● Read from external sources (e.g. Kafka) and emitting them
● Can emit multiple streams from a spout!
Storm — Spouts
38. www.mammothdata.com | @mammothdataco
● Where your processing happens
● Roll your own aggregations / filtering / windowing
● Bolts can feed into other bolts
● Potentially easier to test than Spark Streaming
● Many Bolt connectors for external sources (e.g. Cassandra,
Redis, Hive, etc)
Storm — Bolts
39. www.mammothdata.com | @mammothdataco
● The DAG of the spouts and bolts
● Built programmatically in code and submitted to the Storm
cluster
● Flux - Do It In YAML (and then complain about whitespace)
Storm — Topologies
40. www.mammothdata.com | @mammothdataco
● Each bolt or spout runs 'tasks' across the cluster
● How parallelism works in Storm
● Set in topology submission
Storm — Tasks
41. www.mammothdata.com | @mammothdataco
● Where the topology runs
● 1 worker = 1 JVM
● Tasks run as threads on a worker
● Storm distributes tasks evenly across cluster
Storm — Workers
42. www.mammothdata.com | @mammothdataco
● True Streaming
● Tuples processed as they enter topology - low latency
● Scales far beyond Spark Streaming (currently)
Storm — Good Things
43. www.mammothdata.com | @mammothdataco
● Battle-tested at Twitter & Yahoo!
● Yahoo! has 300-node clusters and working to support 1000+
nodes
● Single node clocked at over 1.5m tuples / second at Twitter
Storm — Good Things
44. www.mammothdata.com | @mammothdataco
● Very DIY (bring your own aggregations, ML, etc)
● Your DAG construction may not be optimal
● Operationally more complex (and Storm WebUI is more primitive)
● Where’s Me REPL?
Storm — Bad Things
50. www.mammothdata.com | @mammothdataco
● For almost anything else? Spark.
● High-level vs. Low-level
● Each new version of Spark delivers improvements!
Spark or Storm?
51. www.mammothdata.com | @mammothdataco
● Other frameworks that show promise:
○ Flink
○ Apex
○ Samza
○ Heron (Twitter’s not-public Storm replacement)
Other Listing Magazines Are Available