All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco
The Leader in Big Data Consulting
● BI/Data Strategy
○ Development of a business intelligence/ data architecture strategy.
● Installation
○ Installation of Hadoop or relevant technology.
● Data Consolidation
○ Load data from diverse sources into a single scalable repository.
● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards,
feeds or computer-driven decision making processes to derive insights and make decisions.
● Visualization Tools
○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to
necessary employees who will analyze the data.
Mammoth Data, based in downtown Durham (right above Toast)

● Lead Consultant on all things DevOps and Spark
● @carsondial on Twitter
Me!

● Quick overview of Spark Streaming
● Reasons why Spark Streaming can be tricky in practice
● Performance and tuning tips we’ve learnt over the past two years
● …and when to pack it all in and use Storm instead
What This Talk Is About

This IS WEB SCALE!

● I kid, Rails!
● (mostly)
Beyond Web Scale

● Spark & Storm - millions of requests / second on commodity
hardware
● Different problems at different scales!
Beyond Web Scale

● Directed Acyclic Graph Data Processing Engine
● Based around the Resilient Distributed Dataset (RDD) primitive
Spark

Spark Streaming — Overview

Spark Streaming — In Production?
● Yes!
● (Alibaba, AutoTrader, Cisco, Netflix, etc.)

● Streaming by running batches very quickly!
● Batch length: can be as low as 0.5s / batch
● Every X seconds, get Y records (DStream/RDDs)
Spark Streaming — Overview

● Using same implementation (mostly) for batch and stream
processing (Lambda Architecture hipster points ahoy!)
● Access to rest of Spark - Dataframes, MLLib, GraphX, etc.
Spark Streaming — Good Things

● What happens if you can’t process Y records in X seconds?
● What happens if you require sub-second latency?
Spark Streaming — Bad Things!

Spark Streaming — I’m so sorry.

● What happens if you can’t process Y records in X seconds?
● Data builds up in executors
● Executors run out of memory…

● “Hey, we forgot to tell you Ops people that we have a major new
client adding stuff into the firehose sometime today. That’s fine,
right?”

Spark Streaming — It Will Be Okay

● As a former Ops person:
● WE WILL REMEMBER.

● Do you need low-latency?
● If so, a 10-minute nap is advisable!
● Everybody else, let’s dive in…
Spark Streaming — Tuning

Spark Streaming — Down In The Hole

● Easiest method — alter the batch window until it’s all fine!
● Tiny batches provide tight execution times!
Spark Streaming — Down In The Hole

● Use Kafka.
● Data source with the most love (e.g. exactly-once semantics
without Write Ahead Logs and receiver-less operation in 1.3+)
● (other sources get the features…eventually)

● Use Scala.
● CPython = slower in execution
● PyPy is much faster…but…
● New features always come to Scala first.

● (or Java if you really must)

● Spark Streaming = data receivers + Spark
● spark.cores.max = x * number of receivers
● For Great Data Locality and Parallelism!
Spark Streaming — Cores

● Are you using a foreachRDD loop?
rdd.foreachRDD{ rdd =>
rdd.cache()
…
rdd.unpersist()
}
Spark Streaming — Caching

● If routing to multiple stores / iterating over an RDD multiple
times using cache() is a quick win
● It really shouldn’t work so well…
Spark Streaming — Caching

● Hurrah for Spark 1.5!
● spark.streaming.backpressure.enabled = true
● Spark dynamically alters incoming data rates (keeping the data in
Kafka rather than in the executors)
● Works for all data sources (for once!)
Spark Streaming — Backpressure

● I really need that low-latency response!
Storm

● Directed Acyclic Graph Data Processing Engine
Storm

Spark
“Very Good, Sir”

Storm
“Here you go!”

● Stream of tuples
● Bolts
● Spouts
● Topologies
Storm Concepts

● Unbounded stream of tuples
● Tuples are defined via schema (usual base types plus custom
serializers)
Storm — Streams

● Sources of tuples in a topology
● Read from external sources (e.g. Kafka) and emitting them
● Can emit multiple streams from a spout!
Storm — Spouts

● Where your processing happens
● Roll your own aggregations / filtering / windowing
● Bolts can feed into other bolts
● Potentially easier to test than Spark Streaming
● Many Bolt connectors for external sources (e.g. Cassandra,
Redis, Hive, etc)
Storm — Bolts

● The DAG of the spouts and bolts
● Built programmatically in code and submitted to the Storm
cluster
● Flux - Do It In YAML (and then complain about whitespace)
Storm — Topologies

● Each bolt or spout runs 'tasks' across the cluster
● How parallelism works in Storm
● Set in topology submission
Storm — Tasks

● Where the topology runs
● 1 worker = 1 JVM
● Tasks run as threads on a worker
● Storm distributes tasks evenly across cluster
Storm — Workers

● True Streaming
● Tuples processed as they enter topology - low latency
● Scales far beyond Spark Streaming (currently)
Storm — Good Things

● Battle-tested at Twitter & Yahoo!
● Yahoo! has 300-node clusters and working to support 1000+
nodes
● Single node clocked at over 1.5m tuples / second at Twitter
Storm — Good Things

● Very DIY (bring your own aggregations, ML, etc)
● Your DAG construction may not be optimal
● Operationally more complex (and Storm WebUI is more primitive)
● Where’s Me REPL?
Storm — Bad Things

Spark or Storm?

● SLA on latency?
Spark or Storm?

● Storm!
● (though simply because it’s possible doesn’t mean you’ll get it!)
Spark or Storm?

● Insane data needs (e.g. ~100m records/second?)
Spark or Storm?

● Storm!
● (though, again, it’s not a magic bullet!)
Spark or Storm?

● For almost anything else? Spark.
● High-level vs. Low-level
● Each new version of Spark delivers improvements!
Spark or Storm?

● Other frameworks that show promise:
○ Flink
○ Apex
○ Samza
○ Heron (Twitter’s not-public Storm replacement)
Other Listing Magazines Are Available

Questions?

All Things Open - Spark & Storm - Where & When?

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (12)

Último

Último (20)

All Things Open - Spark & Storm - Where & When?