3. We are in the Era of Big Data
Google processes 20 PB a day (2008)
Wayback Machine has 3 PB + 100 TB/month (3/2009)
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
Facebook has 36 PB of user data + 80-90 TB/day
(6/2010)
CERN’s LHC: 15 PB a year (any day now)
LSST: 6-10 PB a year (~2015)
From http://www.umiacs.umd.edu/~jimmylin/
4. Who are the “Big Data” Practitioners?
Data analysts
Report generation, data mining, ad optimization, …
Computational scientists
Computational biology, economics, journalism, …
Statisticians and machine-learning researchers
Systems researchers, developers, and testers
Distributed systems, networking, security, …
You!
4
5. Practitioners want a MAD System
Magnetic system
Users want to get fresh new data into the system quickly
Data may be of multiple formats, with missing fields, etc.
Agile system and analytics
Change (data, workload, needs) is constant, make it easy
Complex data gathering & processing pipelines (real-time)
Deep analytics
Sophisticated aggregation/statistical analysis
Users want to use interfaces they are familiar with or the
best available: SQL, MapReduce, Java, Python, R, …
5
6. Hadoop is as MAD as it gets!
Magnetic:
Load data into HDFS as files
Load first, ask questions later
Agile:
Hadoop is extremely malleable: pluggable data formats, storage
engines/filesystems, scheduler, instrumentation, …
Not just a querying tool: supports the end-to-end data pipeline
Built for elastic computing: fine-grained scheduler, highly fault tolerant,
dynamic node addition and dropping
Deep:
Well integrated with programming languages
MapReduce is a powerful programming model, plus other interfaces (Pig
Latin, HiveQL, JAQL) on top
6
7. MAD + Good Performance
Users want good performance, without having to
understand and tune system internals
Performance is multidimensional: time, cost, scalability
Learn from the troubled history of database tuning
Tuning a MAD system is highly challenging
Data is opaque until it is accessed
Data loaded/accessed as files (Vs. organized DB stores)
MapReduce programs pose different challenges than SQL
Simpler in some ways, more complex in others
Heavy use of programming languages (e.g., Java/python)
Elasticity is wonderful, but hard to achieve (Hadoop has many useful
mechanisms, but policies are lacking)
Terabyte-scale data cycles
7
8. The Starfish Philosophy
Goal: A high-performance MAD system
Build on Hadoop’s strengths
Hadoop is MAD & has a rapidly growing user base
How can users get good performance automatically?
Without having to understand & tune system internals
Recall: Perf. is multidimensional (time, cost, scalability)
8
9. Starfish: Self-Tuning System
Our goal: Provide good performance automatically
NOT our goal: Improve Hadoop’s peak performance
Java Client Pig Hive Oozie Elastic MR …
Analytics System
Starfish
Hadoop
MapReduce Execution Engine
Distributed File System
9
11. Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as a
MapReduce job
12. Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as a
MapReduce job
13. Lifecycle of a MapReduce Job
Time
Input Map Map Reduce Reduce
Splits Wave 1 Wave 2 Wave 1 Wave 2
How are the number of splits, number of map and reduce
tasks, memory allocation to tasks, etc., determined?
16. Challenges faced by Practitioners
• Joe Public can now provision a 100-node Hadoop cluster
in minutes. Joe may need to answers to:
– How many reduce tasks to use in MapReduce job J for getting
the best perf. on my 8-node production cluster?
– My current cluster needs more than 6 hours to process 1 day’s
worth of data. Want to reduce that to under 3 hours.
• Are the MapReduce job workflows running optimally?
• How many and what type of Amazon EC2 nodes to use?
17. Users (username, GeoInfo (ipaddr, Clicks (username,
Amazon S3
age, ipaddr) region) url, value)
storage
Copy Copy Copy Copy Copy
Filter value >0
Partition by age into
<20, ≤25, ≤35, >35
Join
Count users Join
per age <20 Join Filter age >35 Filter url is
“Sports” type
Count users
per region with
Count clicks
age > 25
per url type
Count clicks
per region,age Count clicks Count clicks
per age per age
S3 II III IV V VI
I
storage
18. Performance Vs. Price Tradeoff
2 nodes 4 nodes 6 nodes 2 nodes 4 nodes 6 nodes
12000 $6.00
Execution Time (sec)
10000 $5.00
Cost ($)
8000 $4.00
6000 $3.00
4000 $2.00
2000 $1.00
0 $0.00
m1.small m1.large m1.xlarge m1.small m1.large m1.xlarge
Node Type on Amazon EC2 Node Type on Amazon EC2
22. Job Configuration Parameters
Over 190 parameters
Many affect performance in complex ways
Impact depends on Job, Data, and Cluster properties
22
23. Current Approaches
Rules of
thumb
Rules of thumb
mapred.reduce.tasks = 0.9 * number_of_reduce_slots
io.sort.record.percent = 16 / (16 + average_record_size)
Rules of thumb may not suffice
23
24. Just-in-Time Job Optimization
Just-in-Time Optimizer
Searches through the space of parameter settings
Profiler
Collects information about MapReduce job executions
Sampler
Collects statistics about input, intermediate, and output
key-value spaces of MapReduce jobs
What-if Engine
Uses mix of simulation and model-based estimation
Code is ready for release! Demo after the talk
24
25. Job Profiler
Dynamic instrumentation
Monitors phases of MapReduce job execution
Benefits
Zero overhead when turned off
Works with unmodified MapReduce programs
Used to construct a job profile
Concise representation of job execution
Allows for in-depth analysis of job behavior
25
26. Insights from Job Profiles
WordCount A WordCount B
Few, large spills Many, small spills
Combiner gave high data Combiner gave smaller data
reduction reduction
Combiner made Mappers Better resource utilization in
CPU bound Mappers
26
27. Estimates from the What-if Engine
True surface Surface estimated by
What-if Engine
29. MapReduce Job Workflows
Producer-Consumer
relationships among jobs
Data layout crucial for later
jobs in the workflow
Avoid unbalanced data
layouts
Make effective use of
parallelism
29
30. Workflow-Aware Optimizer
Goal: Optimize overall performance of workflow
Select best data layout + job parameters
Overall approach is same as job-level optimizer, but
larger space of options
We hope to support Amazon Elastic MapReduce in
the near future – summer project for Duke undergrad
30
34. Starfish: Self-Tuning System
Focus simultaneously on
Different workload granularities
Workload
Workflows
Jobs (procedural & declarative)
Across various decision points
Provisioning
Optimization We welcome
Scheduling your
Data layout collaboration!
34