SlideShare a Scribd company logo
1 of 34
Download to read offline
Herodotos Herodotou, Harold Lim, Gang
Luo, Nedyalko Borisov, Liang Dong, Fatma
       Bilgen Cetin, Shivnath Babu
            Duke University
Outline
 Why Starfish?
 What should Starfish be able to do?
 What can Starfish do so far?




                                        2
We are in the Era of Big Data
      Google processes 20 PB a day (2008)
      Wayback Machine has 3 PB + 100 TB/month (3/2009)
      eBay has 6.5 PB of user data + 50 TB/day (5/2009)
      Facebook has 36 PB of user data + 80-90 TB/day
       (6/2010)
      CERN’s LHC: 15 PB a year (any day now)
      LSST: 6-10 PB a year (~2015)




From http://www.umiacs.umd.edu/~jimmylin/
Who are the “Big Data” Practitioners?
 Data analysts
    Report generation, data mining, ad optimization, …
 Computational scientists
    Computational biology, economics, journalism, …
 Statisticians and machine-learning researchers
 Systems researchers, developers, and testers
    Distributed systems, networking, security, …
 You!




                                                          4
Practitioners want a MAD System
 Magnetic system
   Users want to get fresh new data into the system quickly
   Data may be of multiple formats, with missing fields, etc.
 Agile system and analytics
   Change (data, workload, needs) is constant, make it easy
   Complex data gathering & processing pipelines (real-time)
 Deep analytics
   Sophisticated aggregation/statistical analysis
   Users want to use interfaces they are familiar with or the
    best available: SQL, MapReduce, Java, Python, R, …

                                                            5
Hadoop is as MAD as it gets!
 Magnetic:
   Load data into HDFS as files
   Load first, ask questions later

 Agile:
   Hadoop is extremely malleable: pluggable data formats, storage
    engines/filesystems, scheduler, instrumentation, …
   Not just a querying tool: supports the end-to-end data pipeline
   Built for elastic computing: fine-grained scheduler, highly fault tolerant,
    dynamic node addition and dropping
 Deep:
   Well integrated with programming languages
   MapReduce is a powerful programming model, plus other interfaces (Pig
    Latin, HiveQL, JAQL) on top

                                                                                  6
MAD + Good Performance
 Users want good performance, without having to
  understand and tune system internals
   Performance is multidimensional: time, cost, scalability
   Learn from the troubled history of database tuning

 Tuning a MAD system is highly challenging
   Data is opaque until it is accessed
   Data loaded/accessed as files (Vs. organized DB stores)
   MapReduce programs pose different challenges than SQL
     Simpler in some ways, more complex in others
     Heavy use of programming languages (e.g., Java/python)

   Elasticity is wonderful, but hard to achieve (Hadoop has many useful
    mechanisms, but policies are lacking)
   Terabyte-scale data cycles


                                                                           7
The Starfish Philosophy
 Goal: A high-performance MAD system
 Build on Hadoop’s strengths
    Hadoop is MAD & has a rapidly growing user base
 How can users get good performance automatically?
    Without having to understand & tune system internals
    Recall: Perf. is multidimensional (time, cost, scalability)




                                                                   8
Starfish: Self-Tuning System
 Our goal: Provide good performance automatically
 NOT our goal: Improve Hadoop’s peak performance
  Java Client   Pig   Hive   Oozie   Elastic MR   …
         Analytics System
                       Starfish
          Hadoop
            MapReduce Execution Engine
                 Distributed File System


                                                      9
Outline
 Why Starfish?
 What should Starfish be able to do?
 What can Starfish do so far?




                                        10
Lifecycle of a MapReduce Job

                     Map function



                     Reduce function




                   Run this program as a
                     MapReduce job
Lifecycle of a MapReduce Job

                     Map function



                     Reduce function




                   Run this program as a
                     MapReduce job
Lifecycle of a MapReduce Job
                                             Time




       Input     Map      Map          Reduce   Reduce
       Splits   Wave 1   Wave 2        Wave 1   Wave 2


How are the number of splits, number of map and reduce
  tasks, memory allocation to tasks, etc., determined?
Job Configuration Parameters
                 • 190+ parameters in
                   Hadoop
                 • Set manually or defaults
                   are used
                    – Rules-of-thumb
MapReduce Job Tuning in Hadoop
    2-dim Projection of a 13-dim Surface
Challenges faced by Practitioners
• Joe Public can now provision a 100-node Hadoop cluster
  in minutes. Joe may need to answers to:
   – How many reduce tasks to use in MapReduce job J for getting
     the best perf. on my 8-node production cluster?
   – My current cluster needs more than 6 hours to process 1 day’s
     worth of data. Want to reduce that to under 3 hours.
       • Are the MapReduce job workflows running optimally?
       • How many and what type of Amazon EC2 nodes to use?
Users (username,              GeoInfo (ipaddr,        Clicks (username,
Amazon S3
                      age, ipaddr)                    region)                 url, value)
 storage

         Copy                  Copy        Copy           Copy                     Copy


                                                                               Filter value >0
  Partition by age into
   <20, ≤25, ≤35, >35
                                                          Join

   Count users        Join
   per age <20                              Join                  Filter age >35      Filter url is
                                                                                     “Sports” type
                  Count users
                 per region with
                                                         Count clicks
                    age > 25
                                                         per url type
                                       Count clicks
                                      per region,age                  Count clicks        Count clicks
                                                                        per age             per age


   S3                     II               III               IV            V                  VI
        I
storage
Performance Vs. Price Tradeoff

                           2 nodes    4 nodes   6 nodes                         2 nodes    4 nodes   6 nodes
                       12000                                            $6.00
Execution Time (sec)




                       10000                                            $5.00




                                                             Cost ($)
                       8000                                             $4.00
                       6000                                             $3.00
                       4000                                             $2.00
                       2000                                             $1.00
                           0                                            $0.00
                               m1.small m1.large m1.xlarge                      m1.small   m1.large m1.xlarge
                               Node Type on Amazon EC2                            Node Type on Amazon EC2
Outline
 Why Starfish?
 What should Starfish be able to do?
 What can Starfish do so far?




                                        19
Starfish Architecture
 Workload-level tuning
    Workload Optimizer        Elastisizer
 Workflow-level tuning
            Workflow-aware                    What-if
               Optimizer                      Engine
  Job-level tuning
                 Just-in-Time Optimizer
                Profiler        Sampler
                     Data Manager
     Metadata        Intermediate      Data Layout &
      Mgr.            Data Mgr.         Storage Mgr.

                                                        20
Starfish Architecture
 Workload-level tuning
    Workload Optimizer        Elastisizer
 Workflow-level tuning
            Workflow-aware                    What-if
               Optimizer                      Engine
  Job-level tuning
                 Just-in-Time Optimizer
                Profiler        Sampler
                     Data Manager
     Metadata        Intermediate      Data Layout &
      Mgr.            Data Mgr.         Storage Mgr.

                                                        21
Job Configuration Parameters




 Over 190 parameters
 Many affect performance in complex ways
 Impact depends on Job, Data, and Cluster properties


                                                        22
Current Approaches
                             Rules of
                              thumb




 Rules of thumb
    mapred.reduce.tasks = 0.9 * number_of_reduce_slots
    io.sort.record.percent = 16 / (16 + average_record_size)
 Rules of thumb may not suffice

                                                                23
Just-in-Time Job Optimization
 Just-in-Time Optimizer
    Searches through the space of parameter settings
 Profiler
    Collects information about MapReduce job executions
 Sampler
    Collects statistics about input, intermediate, and output
     key-value spaces of MapReduce jobs
 What-if Engine
    Uses mix of simulation and model-based estimation

 Code is ready for release! Demo after the talk
                                                                 24
Job Profiler
 Dynamic instrumentation
    Monitors phases of MapReduce job execution
 Benefits
    Zero overhead when turned off
    Works with unmodified MapReduce programs
 Used to construct a job profile
    Concise representation of job execution
    Allows for in-depth analysis of job behavior




                                                    25
Insights from Job Profiles
     WordCount A                 WordCount B
 Few, large spills          Many, small spills
 Combiner gave high data    Combiner gave smaller data
  reduction                   reduction
 Combiner made Mappers      Better resource utilization in
  CPU bound                   Mappers




                                                               26
Estimates from the What-if Engine
  True surface      Surface estimated by
                       What-if Engine
Starfish Architecture
 Workload-level tuning
    Workload Optimizer        Elastisizer
 Workflow-level tuning
            Workflow-aware                    What-if
               Optimizer                      Engine
  Job-level tuning
                 Just-in-Time Optimizer
                Profiler        Sampler
                     Data Manager
     Metadata        Intermediate      Data Layout &
      Mgr.            Data Mgr.         Storage Mgr.

                                                        28
MapReduce Job Workflows
 Producer-Consumer
  relationships among jobs
 Data layout crucial for later
  jobs in the workflow
    Avoid unbalanced data
     layouts
    Make effective use of
     parallelism




                                  29
Workflow-Aware Optimizer
 Goal: Optimize overall performance of workflow
    Select best data layout + job parameters
 Overall approach is same as job-level optimizer, but
  larger space of options
 We hope to support Amazon Elastic MapReduce in
 the near future – summer project for Duke undergrad




                                                         30
Starfish Architecture
 Workload-level tuning
    Workload Optimizer        Elastisizer
 Workflow-level tuning
            Workflow-aware                    What-if
               Optimizer                      Engine
  Job-level tuning
                 Just-in-Time Optimizer
                Profiler        Sampler
                     Data Manager
     Metadata        Intermediate      Data Layout &
      Mgr.            Data Mgr.         Storage Mgr.

                                                        31
Elastisizer – Hadoop Provisioning
   Goal: Make provisioning decisions based on workload
    requirements (e.g., completion time, cost)
                           2 nodes    4 nodes   6 nodes                         2 nodes    4 nodes   6 nodes
                       12000                                            $6.00
Execution Time (sec)




                       10000                                            $5.00




                                                             Cost ($)
                       8000                                             $4.00
                       6000                                             $3.00
                       4000                                             $2.00
                       2000                                             $1.00
                           0                                            $0.00
                               m1.small m1.large m1.xlarge                      m1.small   m1.large m1.xlarge
                               Node Type on Amazon EC2                            Node Type on Amazon EC2


                                                                                                               32
Optimizing Hadoop Workloads
 Data-flow sharing
 Materialization
 Reorganization

  Java Client   Pig      Hive    Oozie    Elastic MR   …
         Analytics System
                             Starfish
          Hadoop
                MapReduce Execution Engine
                      Distributed File System

                                                           33
Starfish: Self-Tuning System
Focus simultaneously on
 Different workload granularities
   Workload
   Workflows
   Jobs (procedural & declarative)
 Across various decision points
   Provisioning
   Optimization                 We welcome
   Scheduling                       your
   Data layout                 collaboration!
                                                 34

More Related Content

What's hot

Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataRichard McDougall
 
Performance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ ApplicationsPerformance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ ApplicationsMichael Kopp
 
Hadoop World 2011, Apache Hadoop MapReduce Next Gen
Hadoop World 2011, Apache Hadoop MapReduce Next GenHadoop World 2011, Apache Hadoop MapReduce Next Gen
Hadoop World 2011, Apache Hadoop MapReduce Next GenHortonworks
 
Windows Azure Uzerinden Alinabilen Hizmetler
Windows Azure Uzerinden Alinabilen HizmetlerWindows Azure Uzerinden Alinabilen Hizmetler
Windows Azure Uzerinden Alinabilen HizmetlerMustafa
 
Achieving Predictability with Agile - Doing Scrum in a complex multi-discipli...
Achieving Predictability with Agile - Doing Scrum in a complex multi-discipli...Achieving Predictability with Agile - Doing Scrum in a complex multi-discipli...
Achieving Predictability with Agile - Doing Scrum in a complex multi-discipli...AgileSparks
 

What's hot (8)

Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
 
Performance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ ApplicationsPerformance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ Applications
 
Netmagic Cloud Computing Services
Netmagic Cloud Computing ServicesNetmagic Cloud Computing Services
Netmagic Cloud Computing Services
 
Hadoop World 2011, Apache Hadoop MapReduce Next Gen
Hadoop World 2011, Apache Hadoop MapReduce Next GenHadoop World 2011, Apache Hadoop MapReduce Next Gen
Hadoop World 2011, Apache Hadoop MapReduce Next Gen
 
10c introduction
10c introduction10c introduction
10c introduction
 
Windows Azure Uzerinden Alinabilen Hizmetler
Windows Azure Uzerinden Alinabilen HizmetlerWindows Azure Uzerinden Alinabilen Hizmetler
Windows Azure Uzerinden Alinabilen Hizmetler
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
Achieving Predictability with Agile - Doing Scrum in a complex multi-discipli...
Achieving Predictability with Agile - Doing Scrum in a complex multi-discipli...Achieving Predictability with Agile - Doing Scrum in a complex multi-discipli...
Achieving Predictability with Agile - Doing Scrum in a complex multi-discipli...
 

Viewers also liked

Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Shivkumar Babshetty
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveDataWorks Summit
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache TezGal Vinograd
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveDataWorks Summit
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataDataWorks Summit
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performanceDataWorks Summit
 

Viewers also liked (19)

Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache Tez
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 

Similar to Duke Researchers Develop Starfish System for Automatic Big Data Optimization

DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012Amazon Web Services
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSAmazon Web Services
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the DatabusAmy W. Tang
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Nathan Bijnens
 
The Yin and Yang of Software
The Yin and Yang of SoftwareThe Yin and Yang of Software
The Yin and Yang of Softwareelliando dias
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseHBaseCon
 
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...Nagios
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascadingDataiku
 
AWS Presentation at JasperWorld APAC
AWS Presentation at JasperWorld APACAWS Presentation at JasperWorld APAC
AWS Presentation at JasperWorld APACAmazon Web Services
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
TIBCO Advanced Analytics Meetup (TAAM) - June 2015
TIBCO Advanced Analytics Meetup (TAAM) - June 2015TIBCO Advanced Analytics Meetup (TAAM) - June 2015
TIBCO Advanced Analytics Meetup (TAAM) - June 2015Bipin Singh
 
PostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolPostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolEDB
 
Time Series Analytics Azure ADX
Time Series Analytics Azure ADXTime Series Analytics Azure ADX
Time Series Analytics Azure ADXRiccardo Zamana
 
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...Julien SIMON
 
SQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataDenny Lee
 

Similar to Duke Researchers Develop Starfish System for Automatic Big Data Optimization (20)

DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 
The Cloud Changing the Game
The Cloud Changing the GameThe Cloud Changing the Game
The Cloud Changing the Game
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013
 
The Yin and Yang of Software
The Yin and Yang of SoftwareThe Yin and Yang of Software
The Yin and Yang of Software
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascading
 
AWS Presentation at JasperWorld APAC
AWS Presentation at JasperWorld APACAWS Presentation at JasperWorld APAC
AWS Presentation at JasperWorld APAC
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Yoda fifth elephant
Yoda fifth elephantYoda fifth elephant
Yoda fifth elephant
 
TIBCO Advanced Analytics Meetup (TAAM) - June 2015
TIBCO Advanced Analytics Meetup (TAAM) - June 2015TIBCO Advanced Analytics Meetup (TAAM) - June 2015
TIBCO Advanced Analytics Meetup (TAAM) - June 2015
 
PostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolPostgreSQL as a Strategic Tool
PostgreSQL as a Strategic Tool
 
Time Series Analytics Azure ADX
Time Series Analytics Azure ADXTime Series Analytics Azure ADX
Time Series Analytics Azure ADX
 
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
 
SQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big Data
 

More from Grant Ingersoll

This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineGrant Ingersoll
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Grant Ingersoll
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopGrant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xGrant Ingersoll
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and MahoutGrant Ingersoll
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopGrant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Grant Ingersoll
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsGrant Ingersoll
 

More from Grant Ingersoll (20)

Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Taming Text
Taming TextTaming Text
Taming Text
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
 

Recently uploaded

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Recently uploaded (20)

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Duke Researchers Develop Starfish System for Automatic Big Data Optimization

  • 1. Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, Shivnath Babu Duke University
  • 2. Outline  Why Starfish?  What should Starfish be able to do?  What can Starfish do so far? 2
  • 3. We are in the Era of Big Data  Google processes 20 PB a day (2008)  Wayback Machine has 3 PB + 100 TB/month (3/2009)  eBay has 6.5 PB of user data + 50 TB/day (5/2009)  Facebook has 36 PB of user data + 80-90 TB/day (6/2010)  CERN’s LHC: 15 PB a year (any day now)  LSST: 6-10 PB a year (~2015) From http://www.umiacs.umd.edu/~jimmylin/
  • 4. Who are the “Big Data” Practitioners?  Data analysts  Report generation, data mining, ad optimization, …  Computational scientists  Computational biology, economics, journalism, …  Statisticians and machine-learning researchers  Systems researchers, developers, and testers  Distributed systems, networking, security, …  You! 4
  • 5. Practitioners want a MAD System  Magnetic system  Users want to get fresh new data into the system quickly  Data may be of multiple formats, with missing fields, etc.  Agile system and analytics  Change (data, workload, needs) is constant, make it easy  Complex data gathering & processing pipelines (real-time)  Deep analytics  Sophisticated aggregation/statistical analysis  Users want to use interfaces they are familiar with or the best available: SQL, MapReduce, Java, Python, R, … 5
  • 6. Hadoop is as MAD as it gets!  Magnetic:  Load data into HDFS as files  Load first, ask questions later  Agile:  Hadoop is extremely malleable: pluggable data formats, storage engines/filesystems, scheduler, instrumentation, …  Not just a querying tool: supports the end-to-end data pipeline  Built for elastic computing: fine-grained scheduler, highly fault tolerant, dynamic node addition and dropping  Deep:  Well integrated with programming languages  MapReduce is a powerful programming model, plus other interfaces (Pig Latin, HiveQL, JAQL) on top 6
  • 7. MAD + Good Performance  Users want good performance, without having to understand and tune system internals  Performance is multidimensional: time, cost, scalability  Learn from the troubled history of database tuning  Tuning a MAD system is highly challenging  Data is opaque until it is accessed  Data loaded/accessed as files (Vs. organized DB stores)  MapReduce programs pose different challenges than SQL  Simpler in some ways, more complex in others  Heavy use of programming languages (e.g., Java/python)  Elasticity is wonderful, but hard to achieve (Hadoop has many useful mechanisms, but policies are lacking)  Terabyte-scale data cycles 7
  • 8. The Starfish Philosophy  Goal: A high-performance MAD system  Build on Hadoop’s strengths  Hadoop is MAD & has a rapidly growing user base  How can users get good performance automatically?  Without having to understand & tune system internals  Recall: Perf. is multidimensional (time, cost, scalability) 8
  • 9. Starfish: Self-Tuning System  Our goal: Provide good performance automatically  NOT our goal: Improve Hadoop’s peak performance Java Client Pig Hive Oozie Elastic MR … Analytics System Starfish Hadoop MapReduce Execution Engine Distributed File System 9
  • 10. Outline  Why Starfish?  What should Starfish be able to do?  What can Starfish do so far? 10
  • 11. Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job
  • 12. Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job
  • 13. Lifecycle of a MapReduce Job Time Input Map Map Reduce Reduce Splits Wave 1 Wave 2 Wave 1 Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?
  • 14. Job Configuration Parameters • 190+ parameters in Hadoop • Set manually or defaults are used – Rules-of-thumb
  • 15. MapReduce Job Tuning in Hadoop 2-dim Projection of a 13-dim Surface
  • 16. Challenges faced by Practitioners • Joe Public can now provision a 100-node Hadoop cluster in minutes. Joe may need to answers to: – How many reduce tasks to use in MapReduce job J for getting the best perf. on my 8-node production cluster? – My current cluster needs more than 6 hours to process 1 day’s worth of data. Want to reduce that to under 3 hours. • Are the MapReduce job workflows running optimally? • How many and what type of Amazon EC2 nodes to use?
  • 17. Users (username, GeoInfo (ipaddr, Clicks (username, Amazon S3 age, ipaddr) region) url, value) storage Copy Copy Copy Copy Copy Filter value >0 Partition by age into <20, ≤25, ≤35, >35 Join Count users Join per age <20 Join Filter age >35 Filter url is “Sports” type Count users per region with Count clicks age > 25 per url type Count clicks per region,age Count clicks Count clicks per age per age S3 II III IV V VI I storage
  • 18. Performance Vs. Price Tradeoff 2 nodes 4 nodes 6 nodes 2 nodes 4 nodes 6 nodes 12000 $6.00 Execution Time (sec) 10000 $5.00 Cost ($) 8000 $4.00 6000 $3.00 4000 $2.00 2000 $1.00 0 $0.00 m1.small m1.large m1.xlarge m1.small m1.large m1.xlarge Node Type on Amazon EC2 Node Type on Amazon EC2
  • 19. Outline  Why Starfish?  What should Starfish be able to do?  What can Starfish do so far? 19
  • 20. Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow-aware What-if Optimizer Engine Job-level tuning Just-in-Time Optimizer Profiler Sampler Data Manager Metadata Intermediate Data Layout & Mgr. Data Mgr. Storage Mgr. 20
  • 21. Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow-aware What-if Optimizer Engine Job-level tuning Just-in-Time Optimizer Profiler Sampler Data Manager Metadata Intermediate Data Layout & Mgr. Data Mgr. Storage Mgr. 21
  • 22. Job Configuration Parameters  Over 190 parameters  Many affect performance in complex ways  Impact depends on Job, Data, and Cluster properties 22
  • 23. Current Approaches Rules of thumb  Rules of thumb  mapred.reduce.tasks = 0.9 * number_of_reduce_slots  io.sort.record.percent = 16 / (16 + average_record_size)  Rules of thumb may not suffice 23
  • 24. Just-in-Time Job Optimization  Just-in-Time Optimizer  Searches through the space of parameter settings  Profiler  Collects information about MapReduce job executions  Sampler  Collects statistics about input, intermediate, and output key-value spaces of MapReduce jobs  What-if Engine  Uses mix of simulation and model-based estimation Code is ready for release! Demo after the talk 24
  • 25. Job Profiler  Dynamic instrumentation  Monitors phases of MapReduce job execution  Benefits  Zero overhead when turned off  Works with unmodified MapReduce programs  Used to construct a job profile  Concise representation of job execution  Allows for in-depth analysis of job behavior 25
  • 26. Insights from Job Profiles WordCount A WordCount B  Few, large spills  Many, small spills  Combiner gave high data  Combiner gave smaller data reduction reduction  Combiner made Mappers  Better resource utilization in CPU bound Mappers 26
  • 27. Estimates from the What-if Engine True surface Surface estimated by What-if Engine
  • 28. Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow-aware What-if Optimizer Engine Job-level tuning Just-in-Time Optimizer Profiler Sampler Data Manager Metadata Intermediate Data Layout & Mgr. Data Mgr. Storage Mgr. 28
  • 29. MapReduce Job Workflows  Producer-Consumer relationships among jobs  Data layout crucial for later jobs in the workflow  Avoid unbalanced data layouts  Make effective use of parallelism 29
  • 30. Workflow-Aware Optimizer  Goal: Optimize overall performance of workflow  Select best data layout + job parameters  Overall approach is same as job-level optimizer, but larger space of options  We hope to support Amazon Elastic MapReduce in the near future – summer project for Duke undergrad 30
  • 31. Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow-aware What-if Optimizer Engine Job-level tuning Just-in-Time Optimizer Profiler Sampler Data Manager Metadata Intermediate Data Layout & Mgr. Data Mgr. Storage Mgr. 31
  • 32. Elastisizer – Hadoop Provisioning  Goal: Make provisioning decisions based on workload requirements (e.g., completion time, cost) 2 nodes 4 nodes 6 nodes 2 nodes 4 nodes 6 nodes 12000 $6.00 Execution Time (sec) 10000 $5.00 Cost ($) 8000 $4.00 6000 $3.00 4000 $2.00 2000 $1.00 0 $0.00 m1.small m1.large m1.xlarge m1.small m1.large m1.xlarge Node Type on Amazon EC2 Node Type on Amazon EC2 32
  • 33. Optimizing Hadoop Workloads  Data-flow sharing  Materialization  Reorganization Java Client Pig Hive Oozie Elastic MR … Analytics System Starfish Hadoop MapReduce Execution Engine Distributed File System 33
  • 34. Starfish: Self-Tuning System Focus simultaneously on  Different workload granularities  Workload  Workflows  Jobs (procedural & declarative)  Across various decision points  Provisioning  Optimization We welcome  Scheduling your  Data layout collaboration! 34