2. CONTENTS
• Introduction to Big data
• Hadoop
• Tuning problems
• Starfish Architecture
• Usage of Starfish
• Conclusion
3. INTRODUCTION TO BIG DATA
Big data is the term for data sets so large and complicated
that it becomes difficult to process using traditional data
management tools or processing applications
What are the tools of Big data?
Features of Big data Analytics
4. BIG DATA PRACTITIONERS
• Data analysts
Report generation, data mining, ad optimization
• Computational scientists
Computational biology, economics, journalism
• Statisticians and machine-learning researchers
• Systems researchers, developers, and testers
Distributed systems, networking, security, …
5. Practitioners want a MAD system-HADOOP
Hadoop is as MAD as it is!
Magnetism “Attracts” or welcomes all sources of data,
regardless of structure, values, etc.
Agility Adaptive, remains in sync with rapid data
evolution and modification
Depth More than just your typical analytics, we
need to support complex operations like statistical analysis
and machine learning
6. MADDER
Data-lifecycle Do more than just queries,
Awareness optimize the movement,
storage, and processing of big
Elasticity Dynamically adjust resource usage
and user requirements
Robustness Provide storage and querying
services even in the
event of some failures
7. Tuning Challenges
• Heavy use of programming languages for
MapReduce programs
• Data loaded/accessed as opaque files
• Large space of tuning choices
• Elasticity is wonderful, but hard to achieve
• Terabyte-scale data cycles.
9. Starfish’s Core Approach to Tuning
Profiler
Collects concise
summaries of
execution
Cluster
What-if Engine
Estimates impact of
hypothetical changes
on execution
Optimizers
Search through space of tuning choices
Job
Workflow
Workload
Data layout
10. THE STARFISH PHILOSOPHY
• Goal: A high-performance MAD system
• Build on Hadoop’s strengths
• How can users get good performance
automatically?
12. VISUALIZE WITH STARFISH
• See how MapReduce apps are working
• Understand Bottlenecks in Hadoop
• Find Misconfigured Hadoop Parameters
• Learn to develop MapReduce apps
13. OPTIMIZE WITH STARFISH
• Tune Hadoop easily
• Find Optimal parameters settings for
MapReduce applications
14. STRATEGIZE WITH STARFISH
• Make intelligent resource allocation choices for
Hadoop.
• Find Instances for Workloads.
• Meet time and cost budgets with ease.
16. Cntd…
• First Step: collect the profiling the data from your
Hadoop cluster.
• Second Step: import the profiling data into profile
store.
• Third Step: Fire up the Graphical or Command Line
interfaces to invoke visualize, optimize and strategize
features.
17. CONCLUSION
Hadoop is now a viable competitor to existing
systems for big data analytics.
Starfish fills a different void by enabling Hadoop
users and applications to get good performance
automatically throughout the data lifecycle in analytics.
18. REFERENCES
• Herodotou, Herodotos, et al. "Starfish: A self-tuning
system for big data analytics." Proc. of the Fifth CIDR
Conf. 2011.
• Dong, Fei. Extending Starfish to Support the Growing
Hadoop Ecosystem. Diss. Duke University, 2012.
• Herodotou, Herodotos, Fei Dong, and Shivnath Babu.
"MapReduce programming and cost-based
optimization? Crossing this chasm with Starfish."
Proceedings of the VLDB Endowment 4.12 (2011).
• http://www.cs.duke.edu/starfish/
• http://www.youtube.com/watch?v=Upxe2dzE1uk
Notas do Editor
Profiler
Collect summaries of jobs
Collect information on a task basis
What-if Engine
Answers questions after the Profiler is run
Optimizers
Enumerate & Search through decision space to satisfy the requirements.