SlideShare uma empresa Scribd logo
1 de 91
Baixar para ler offline
Hadoop Overview &
   Architecture	


            Milind Bhandarkar	

  Chief Scientist, Machine Learning Platforms,	

        Greenplum, A Division of EMC	

            (Twitter: @techmilind)
About Me	

•    http://www.linkedin.com/in/milindb	


•    Founding member of Hadoop team at Yahoo! [2005-2010]	


•    Contributor to Apache Hadoop since v0.1	


•    Built and led Grid Solutions Team at Yahoo! [2007-2010]	


•    Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)	


•    Center for Development of Advanced Computing (C-DAC), National Center
     for Supercomputing Applications (NCSA), Center for Simulation of Advanced
     Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn,
     and EMC-Greenplum
Agenda	

• Motivation	

• Hadoop	

 • Map-Reduce	

 • Distributed File System	

 • Hadoop Architecture	

 • Next Generation MapReduce	

• Q & A	

                    2
Hadoop At Scale
      (Some Statistics)	

• 40,000 + machines in 20+ clusters	

• Largest cluster is 4,000 machines	

• 170 Petabytes of storage	

• 1000+ users	

• 1,000,000+ jobs/month	

                      3
BEHIND
EVERY CLICK
Hadoop Workflow
Who Uses Hadoop ?
Why Hadoop ?	



       7
Big Datasets
(Data-Rich Computing theme proposal. J. Campbell, et al, 2007)
Cost Per Gigabyte                             
(http://www.mkomo.com/cost-per-gigabyte)
Storage Trends
(Graph by Adam Leventhal, ACM Queue, Dec 2009)
Motivating Examples	



          11
Yahoo! Search Assist
Search Assist	

• Insight: Related concepts appear close
  together in text corpus	

• Input: Web pages	

 • 1 Billion Pages, 10K bytes each	

 • 10 TB of input data	

• Output: List(word, List(related words))	

                       13
Search Assist	

// Input: List(URL, Text)	
	
foreach URL in Input :	
	
    Words = Tokenize(Text(URL));	
	
    foreach word in Tokens :	
	
        Insert (word, Next(word, Tokens)) in Pairs;	
	
        Insert (word, Previous(word, Tokens)) in Pairs;	
	
// Result: Pairs = List (word, RelatedWord)	
	
Group Pairs by word;	
// Code Samples	
// Result: List (word, List(RelatedWords)	
	
foreach word in Pairs :	
	
    Count RelatedWords in GroupedPairs;	
	
// Result: List (word, List(RelatedWords, count))	
	
foreach word in CountedPairs :	
	
    Sort Pairs(word, *) descending by count;	
	
    choose Top 5 Pairs;	
	
// Result: List (word, Top5(RelatedWords))	

                            14
People You May Know
People You May Know	

• Insight: You might also know Joe Smith if a
  lot of folks you know, know Joe Smith	

 • if you don t know Joe Smith already	

• Numbers:	

 • 100 MM users	

 • Average connections per user is 100	

                      16
People You May Know	

	
	
	 Input: List(UserName, List(Connections))	
//
	
	

	
foreach u in UserList : // 100 MM	
	
	   foreach x in Connections(u) : // 100	
// Code foreach y in Connections(x) : // 100	
        Samples	
	           if (y not in Connections(u)) :	
	                Count(u, y)++; // 1 Trillion Iterations	
	   Sort (u,y) in descending order of Count(u,y);	
	   Choose Top 3 y;	
	   Store (u, {y0, y1, y2}) for serving;	
	
	

                            17
Performance	

• 101 Random accesses for each user	

 • Assume 1 ms per random access	

 • 100 ms per user	

• 100 MM users	

 • 100 days on a single machine	

                     18
MapReduce Paradigm	



         19
Map  Reduce	


• Primitives in Lisp ( Other functional
  languages) 1970s	

• Google Paper 2004	

 • http://labs.google.com/papers/
    mapreduce.html	



                        20
Map	


Output_List = Map (Input_List)	




Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) =	
	
(1, 4, 9, 16, 25, 36,49, 64, 81, 100)	




                            21
Reduce	


Output_Element = Reduce (Input_List)	




Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385	




                            22
Parallelism	

• Map is inherently parallel	

 • Each list element processed independently	

• Reduce is inherently sequential	

 • Unless processing multiple lists	

• Grouping to produce multiple lists	

                      23
Search Assist Map	

// Input: http://hadoop.apache.org  	

	
Pairs = Tokenize_And_Pair ( Text ( Input ) )	




	
	
// Example	
	
	



                            24
Search Assist Reduce	

// Input: GroupedList (word, GroupedList(words))	
	
CountedPairs = CountOccurrences (word, RelatedWords)	




Output = {	
(hadoop, apache, 7) (hadoop, DFS, 3) (hadoop, streaming,
4) (hadoop, mapreduce, 9) ...	
}	



                            25
Issues with Large Data	


• Map Parallelism: Chunking input data	

• Reduce Parallelism: Grouping related data	

• Dealing with failures  load imbalance	



                       26
Apache Hadoop	


• January 2006: Subproject of Lucene	

• January 2008: Top-level Apache project	

• Stable Version: 1.0.3	

• Latest Version: 2.0.0 (Alpha)	


                      28
Apache Hadoop	

• Reliable, Performant Distributed file system	

• MapReduce Programming framework	

• Ecosystem: HBase, Hive, Pig, Howl, Oozie,
  Zookeeper, Chukwa, Mahout, Cascading,
  Scribe, Cassandra, Hypertable, Voldemort,
  Azkaban, Sqoop, Flume, Avro ...	



                       29
Problem: Bandwidth to
        Data	

• Scan 100TB Datasets on 1000 node cluster	

 • Remote storage @ 10MB/s = 165 mins	

 • Local storage @ 50-200MB/s = 33-8 mins	

• Moving computation is more efficient than
  moving data	

  • Need visibility into data placement	

                       30
Problem: Scaling Reliably	

• Failure is not an option, it s a rule !	

 • 1000 nodes, MTBF  1 day	

 • 4000 disks, 8000 cores, 25 switches, 1000
    NICs, 2000 DIMMS (16TB RAM)	

• Need fault tolerant store with reasonable
  availability guarantees	

  • Handle hardware faults transparently	

                         31
Hadoop Goals	

• Scalable: Petabytes (10      15   Bytes) of data on
  thousands on nodes	

• Economical: Commodity components only	

• Reliable	

 • Engineering reliability into every application
    is expensive	



                       32
Hadoop MapReduce	



        33
Think MapReduce	


• Record = (Key, Value)	

• Key : Comparable, Serializable	

• Value: Serializable	

• Input, Map, Shuffle, Reduce, Output	


                      34
Seems Familiar ?	

	
	
	
	
	
cat /var/log/auth.log* |  	
	
grep “session opened” | cut -d’ ‘ -f10 | 	
	
sort | Samples	
// Code	
uniq -c  	
	
~/userlist 	
	
	
	
	
	
	

                            35
Map	


• Input: (Key , Value )	

              1         1

• Output: List(Key , Value )	

                    2               2

• Projections, Filtering, Transformation	



                            36
Shuffle	


• Input: List(Key , Value )	

                  2             2

• Output	

 • Sort(Partition(List(Key , List(Value ))))	

                                    2    2

• Provided by Hadoop	


                        37
Reduce	


• Input: List(Key , List(Value ))	

                   2                   2

• Output: List(Key , Value )	

                       3           3

• Aggregation	



                           38
Hadoop Streaming	

• Hadoop is written in Java	

 • Java MapReduce code is native 	

• What about Non-Java Programmers ?	

 • Perl, Python, Shell, R	

 • grep, sed, awk, uniq as Mappers/Reducers	

• Text Input and Output	

                      39
Hadoop Streaming	

• Thin Java wrapper for Map  Reduce Tasks	

• Forks actual Mapper  Reducer	

• IPC via stdin, stdout, stderr	

• Key.toString() t Value.toString() n	

• Slower than Java programs	

 • Allows for quick prototyping / debugging	

                      40
Hadoop Streaming	

	
	
	 bin/hadoop jar hadoop-streaming.jar 	
$
	     -input in-files -output out-dir 	
	     -mapper mapper.sh -reducer reducer.sh	
	
	 mapper.sh	
#
// Code Samples	
	
sed -e 's/ /n/g' | grep .	
	
	
#
	 reducer.sh	
	
	
uniq -c | awk '{print $2 t $1}'	
	
	

                            41
Hadoop Distributed File
   System (HDFS)	



           42
HDFS	

• Data is organized into files and directories	

• Files are divided into uniform sized blocks
  (default 128MB) and distributed across
  cluster nodes	

• HDFS exposes block placement so that
  computation can be migrated to data	



                        43
HDFS	

• Blocks are replicated (default 3) to handle
  hardware failure	

• Replication for performance and fault
  tolerance (Rack-Aware placement)	

• HDFS keeps checksums of data for
  corruption detection and recovery	



                        44
HDFS	


• Master-Worker Architecture	

• Single NameNode	

• Many (Thousands) DataNodes	



                    45
HDFS Master
         (NameNode)	

• Manages filesystem namespace	

• File metadata (i.e. inode )	

• Mapping inode to list of blocks + locations	

• Authorization  Authentication	

• Checkpoint  journal namespace changes	

                       46
Namenode	

• Mapping of datanode to list of blocks	

• Monitor datanode health	

• Replicate missing blocks	

• Keeps ALL namespace in memory	

• 60M objects (File/Block) in 16GB	

                       47
Datanodes	

• Handle block storage on multiple volumes 
  block integrity	

• Clients access the blocks directly from data
  nodes	

• Periodically send heartbeats and block
  reports to Namenode	

• Blocks are stored as underlying OS s files	

                       48
HDFS Architecture
Example: Unigrams	


• Input: Huge text corpus	

 • Wikipedia Articles (40GB uncompressed)	

• Output: List of words sorted in descending
  order of frequency
Unigrams	


$ cat ~/wikipedia.txt | 	
sed -e 's/ /n/g' | grep . | 	
sort | 	
uniq -c  	
~/frequencies.txt	
	
$ cat ~/frequencies.txt | 	
# cat | 	
sort -n -k1,1 -r |	
# cat  	
~/unigrams.txt
MR for Unigrams	


mapper (filename, file-contents):	
    	for each word in file-contents:	
    	    	emit (word, 1)	
	
reducer (word, values):	
    	sum = 0	
    	for each value in values:	
    	    	sum = sum + value	
    	emit (word, sum)
MR for Unigrams	



mapper (word, frequency):	
    	emit (frequency, word)	
	
reducer (frequency, words):	
    	for each word in words:	
    	    	emit (word, frequency)
Unigrams: Java Mapper	

public static class MapClass extends MapReduceBase	
     	implements MapperLongWritable, Text, Text, IntWritable {	
	
         public void map(LongWritable key, Text value,      	    	
     	      	OutputCollectorText, IntWritable output,	
     	      	Reporter reporter) throws IOException {	
	
     	      	String line = value.toString();	
     	      	StringTokenizer itr = new StringTokenizer(line);	
     	      	while (itr.hasMoreTokens()) {	
     	      	     	Text word = new Text(itr.nextToken());	
     	      	     	output.collect(word, new IntWritable(1));	
     	      	}	
     	}	
}
Unigrams: Java Reducer	


public static class Reduce extends MapReduceBase	
     	implements ReducerText, IntWritable, Text, IntWritable {	
	
     	public void reduce(Text key,IteratorIntWritable values,	
     	     	OutputCollectorText,IntWritable output,	
     	     	Reporter reporter) throws IOException {	
     	     		
     	     	int sum = 0;	
     	     	while (values.hasNext()) {	
     	     	     	sum += values.next().get();	
     	     	}	
     	     	output.collect(key, new IntWritable(sum));	
     	}	
}
Unigrams: Driver	


public void run(String inputPath, String outputPath) throws
Exception	
{	
     	JobConf conf = new JobConf(WordCount.class);	
     	conf.setJobName(wordcount);	
     	conf.setMapperClass(MapClass.class);	
     	conf.setReducerClass(Reduce.class);	
     	FileInputFormat.addInputPath(conf, new Path(inputPath)); 	
     	FileOutputFormat.setOutputPath(conf, new Path(outputPath));	
     	JobClient.runJob(conf);	
}
Configuration	

•  Unified Mechanism for	

  •  Configuring Daemons	

  •  Runtime environment for Jobs/Tasks	

•  Defaults: *-default.xml	

•  Site-Specific: *-site.xml	

•  final parameters
Example	

configuration	
    	property	
    	    	namemapred.job.tracker/name	
    	    	valuehead.server.node.com:9001/value	
    	/property	
    	property	
    	    	namefs.default.name/name	
    	    	valuehdfs://head.server.node.com:9000/value	
    	/property	
    	property	
    	namemapred.child.java.opts/name	
    	value-Xmx512m/value	
    	finaltrue/final	
    	/property	
....	
/configuration
Running Hadoop Jobs
Running a Job	

[milindb@gateway ~]$ hadoop jar 	
$HADOOP_HOME/hadoop-examples.jar wordcount 	
/data/newsarchive/20080923 /tmp/
newsoutinput.FileInputFormat: Total input paths to
process : 4	
mapred.JobClient: Running job: job_200904270516_5709	
mapred.JobClient: map 0% reduce 0%	
mapred.JobClient: map 3% reduce 0%	
mapred.JobClient: map 7% reduce 0%	
....	
mapred.JobClient: map 100% reduce 21%	
mapred.JobClient: map 100% reduce 31%	
mapred.JobClient: map 100% reduce 33%	
mapred.JobClient: map 100% reduce 66%	
mapred.JobClient: map 100% reduce 100%	
mapred.JobClient: Job complete: job_200904270516_5709
Running a Job	


mapred.JobClient: Counters: 18	
mapred.JobClient:   Job Counters	
mapred.JobClient:     Launched reduce tasks=1	
mapred.JobClient:     Rack-local map tasks=10	
mapred.JobClient:     Launched map tasks=25	
mapred.JobClient:     Data-local map tasks=1	
mapred.JobClient:   FileSystemCounters	
mapred.JobClient:     FILE_BYTES_READ=491145085	
mapred.JobClient:     HDFS_BYTES_READ=3068106537	
mapred.JobClient:     FILE_BYTES_WRITTEN=724733409	
mapred.JobClient:     HDFS_BYTES_WRITTEN=377464307
Running a Job	


mapred.JobClient:   Map-Reduce Framework	
mapred.JobClient:     Combine output records=73828180	
mapred.JobClient:     Map input records=36079096	
mapred.JobClient:     Reduce shuffle bytes=233587524	
mapred.JobClient:     Spilled Records=78177976	
mapred.JobClient:     Map output bytes=4278663275	
mapred.JobClient:     Combine input records=371084796	
mapred.JobClient:     Map output records=313041519	
mapred.JobClient:     Reduce input records=15784903
JobTracker WebUI	

	
	
	
	
	
	
	
// Code Samples
JobTracker Status
Jobs Status
Job Details
Job Counters
Job Progress
All Tasks
Task Details
Task Counters
Task Logs
MapReduce Dataflow
MapReduce
Job Submission
Initialization
Scheduling
Execution
Map Task
Sort Buffer
Reduce Task
Next Generation
  MapReduce	



       82
MapReduce Today	

    (Courtesy: Arun Murthy, Hortonworks)
Why ?	

• Scalability Limitations today	

 • Maximum cluster size: 4000 nodes	

 • Maximum Concurrent tasks: 40,000	

• Job Tracker SPOF	

• Fixed map and reduce containers (slots)	

 • Punishes pleasantly parallel apps	

                      84
Why ? (contd)	

• MapReduce is not suitable for every
  application	

• Fine-Grained Iterative applications	

 • HaLoop: Hadoop in a Loop	

• Message passing applications	

 • Graph Processing	

                        85
Requirements	


• Need scalable cluster resources manager	

• Separate scheduling from resource
  management	

• Multi-Lingual Communication Protocols	


                      86
Bottom Line	

• @techmilind #mrng (MapReduce, Next
  Gen) is in reality, #rmng (Resource Manager,
  Next Gen)	

• Expect different programming paradigms to
  be implemented	

 • Including MPI (soon)	

                      87
Architecture	

  (Courtesy: Arun Murthy, Hortonworks)
The New World	

•  Resource Manager	

  •  Allocates resources (containers) to applications	

•  Node Manager	

  •  Manages containers on nodes	

•  Application Master	

  •  Specific to paradigm e.g. MapReduce application master,
      MPI application master etc	



                               89
Container	


• In current terminology: A Task Slot	

• Slice of the node s hardware resources	

• #of cores, virtual memory, disk size, disk and
  network bandwidth etc	

  • Currently, only memory usage is sliced	


                       90

Mais conteúdo relacionado

Mais procurados

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 

Mais procurados (20)

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 

Destaque

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...DataWorks Summit
 
Hdp security overview
Hdp security overview Hdp security overview
Hdp security overview Hortonworks
 
Information security in big data -privacy and data mining
Information security in big data -privacy and data miningInformation security in big data -privacy and data mining
Information security in big data -privacy and data miningharithavijay94
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop SecurityDataWorks Summit
 
Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)Peter Wood
 
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersApache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersDataWorks Summit
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureUwe Printz
 
Apache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXApache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXAbhishek Mallick
 

Destaque (20)

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...
 
Hdp security overview
Hdp security overview Hdp security overview
Hdp security overview
 
Information security in big data -privacy and data mining
Information security in big data -privacy and data miningInformation security in big data -privacy and data mining
Information security in big data -privacy and data mining
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)
 
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersApache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Apache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXApache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOX
 

Semelhante a Hadoop Overview & Architecture

Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindEMC
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataDhanashri Yadav
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveSharjeel Imtiaz
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at myliferesponseteam
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalMichael Rainey
 

Semelhante a Hadoop Overview & Architecture (20)

Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 

Mais de EMC

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDEMC
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote EMC
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOEMC
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremioEMC
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereEMC
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History EMC
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewEMC
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeEMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic EMC
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityEMC
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeEMC
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesEMC
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsEMC
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookEMC
 

Mais de EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Último

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Último (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Hadoop Overview & Architecture

  • 1. Hadoop Overview & Architecture Milind Bhandarkar Chief Scientist, Machine Learning Platforms, Greenplum, A Division of EMC (Twitter: @techmilind)
  • 2. About Me •  http://www.linkedin.com/in/milindb •  Founding member of Hadoop team at Yahoo! [2005-2010] •  Contributor to Apache Hadoop since v0.1 •  Built and led Grid Solutions Team at Yahoo! [2007-2010] •  Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu) •  Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and EMC-Greenplum
  • 3. Agenda • Motivation • Hadoop • Map-Reduce • Distributed File System • Hadoop Architecture • Next Generation MapReduce • Q & A 2
  • 4. Hadoop At Scale (Some Statistics) • 40,000 + machines in 20+ clusters • Largest cluster is 4,000 machines • 170 Petabytes of storage • 1000+ users • 1,000,000+ jobs/month 3
  • 9. Big Datasets (Data-Rich Computing theme proposal. J. Campbell, et al, 2007)
  • 10. Cost Per Gigabyte (http://www.mkomo.com/cost-per-gigabyte)
  • 11. Storage Trends (Graph by Adam Leventhal, ACM Queue, Dec 2009)
  • 14. Search Assist • Insight: Related concepts appear close together in text corpus • Input: Web pages • 1 Billion Pages, 10K bytes each • 10 TB of input data • Output: List(word, List(related words)) 13
  • 15. Search Assist // Input: List(URL, Text) foreach URL in Input : Words = Tokenize(Text(URL)); foreach word in Tokens : Insert (word, Next(word, Tokens)) in Pairs; Insert (word, Previous(word, Tokens)) in Pairs; // Result: Pairs = List (word, RelatedWord) Group Pairs by word; // Code Samples // Result: List (word, List(RelatedWords) foreach word in Pairs : Count RelatedWords in GroupedPairs; // Result: List (word, List(RelatedWords, count)) foreach word in CountedPairs : Sort Pairs(word, *) descending by count; choose Top 5 Pairs; // Result: List (word, Top5(RelatedWords)) 14
  • 17. People You May Know • Insight: You might also know Joe Smith if a lot of folks you know, know Joe Smith • if you don t know Joe Smith already • Numbers: • 100 MM users • Average connections per user is 100 16
  • 18. People You May Know Input: List(UserName, List(Connections)) // foreach u in UserList : // 100 MM foreach x in Connections(u) : // 100 // Code foreach y in Connections(x) : // 100 Samples if (y not in Connections(u)) : Count(u, y)++; // 1 Trillion Iterations Sort (u,y) in descending order of Count(u,y); Choose Top 3 y; Store (u, {y0, y1, y2}) for serving; 17
  • 19. Performance • 101 Random accesses for each user • Assume 1 ms per random access • 100 ms per user • 100 MM users • 100 days on a single machine 18
  • 21. Map Reduce • Primitives in Lisp ( Other functional languages) 1970s • Google Paper 2004 • http://labs.google.com/papers/ mapreduce.html 20
  • 22. Map Output_List = Map (Input_List) Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) = (1, 4, 9, 16, 25, 36,49, 64, 81, 100) 21
  • 23. Reduce Output_Element = Reduce (Input_List) Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385 22
  • 24. Parallelism • Map is inherently parallel • Each list element processed independently • Reduce is inherently sequential • Unless processing multiple lists • Grouping to produce multiple lists 23
  • 25. Search Assist Map // Input: http://hadoop.apache.org Pairs = Tokenize_And_Pair ( Text ( Input ) ) // Example 24
  • 26. Search Assist Reduce // Input: GroupedList (word, GroupedList(words)) CountedPairs = CountOccurrences (word, RelatedWords) Output = { (hadoop, apache, 7) (hadoop, DFS, 3) (hadoop, streaming, 4) (hadoop, mapreduce, 9) ... } 25
  • 27. Issues with Large Data • Map Parallelism: Chunking input data • Reduce Parallelism: Grouping related data • Dealing with failures load imbalance 26
  • 28.
  • 29. Apache Hadoop • January 2006: Subproject of Lucene • January 2008: Top-level Apache project • Stable Version: 1.0.3 • Latest Version: 2.0.0 (Alpha) 28
  • 30. Apache Hadoop • Reliable, Performant Distributed file system • MapReduce Programming framework • Ecosystem: HBase, Hive, Pig, Howl, Oozie, Zookeeper, Chukwa, Mahout, Cascading, Scribe, Cassandra, Hypertable, Voldemort, Azkaban, Sqoop, Flume, Avro ... 29
  • 31. Problem: Bandwidth to Data • Scan 100TB Datasets on 1000 node cluster • Remote storage @ 10MB/s = 165 mins • Local storage @ 50-200MB/s = 33-8 mins • Moving computation is more efficient than moving data • Need visibility into data placement 30
  • 32. Problem: Scaling Reliably • Failure is not an option, it s a rule ! • 1000 nodes, MTBF 1 day • 4000 disks, 8000 cores, 25 switches, 1000 NICs, 2000 DIMMS (16TB RAM) • Need fault tolerant store with reasonable availability guarantees • Handle hardware faults transparently 31
  • 33. Hadoop Goals • Scalable: Petabytes (10 15 Bytes) of data on thousands on nodes • Economical: Commodity components only • Reliable • Engineering reliability into every application is expensive 32
  • 35. Think MapReduce • Record = (Key, Value) • Key : Comparable, Serializable • Value: Serializable • Input, Map, Shuffle, Reduce, Output 34
  • 36. Seems Familiar ? cat /var/log/auth.log* | grep “session opened” | cut -d’ ‘ -f10 | sort | Samples // Code uniq -c ~/userlist 35
  • 37. Map • Input: (Key , Value ) 1 1 • Output: List(Key , Value ) 2 2 • Projections, Filtering, Transformation 36
  • 38. Shuffle • Input: List(Key , Value ) 2 2 • Output • Sort(Partition(List(Key , List(Value )))) 2 2 • Provided by Hadoop 37
  • 39. Reduce • Input: List(Key , List(Value )) 2 2 • Output: List(Key , Value ) 3 3 • Aggregation 38
  • 40. Hadoop Streaming • Hadoop is written in Java • Java MapReduce code is native • What about Non-Java Programmers ? • Perl, Python, Shell, R • grep, sed, awk, uniq as Mappers/Reducers • Text Input and Output 39
  • 41. Hadoop Streaming • Thin Java wrapper for Map Reduce Tasks • Forks actual Mapper Reducer • IPC via stdin, stdout, stderr • Key.toString() t Value.toString() n • Slower than Java programs • Allows for quick prototyping / debugging 40
  • 42. Hadoop Streaming bin/hadoop jar hadoop-streaming.jar $ -input in-files -output out-dir -mapper mapper.sh -reducer reducer.sh mapper.sh # // Code Samples sed -e 's/ /n/g' | grep . # reducer.sh uniq -c | awk '{print $2 t $1}' 41
  • 43. Hadoop Distributed File System (HDFS) 42
  • 44. HDFS • Data is organized into files and directories • Files are divided into uniform sized blocks (default 128MB) and distributed across cluster nodes • HDFS exposes block placement so that computation can be migrated to data 43
  • 45. HDFS • Blocks are replicated (default 3) to handle hardware failure • Replication for performance and fault tolerance (Rack-Aware placement) • HDFS keeps checksums of data for corruption detection and recovery 44
  • 47. HDFS Master (NameNode) • Manages filesystem namespace • File metadata (i.e. inode ) • Mapping inode to list of blocks + locations • Authorization Authentication • Checkpoint journal namespace changes 46
  • 48. Namenode • Mapping of datanode to list of blocks • Monitor datanode health • Replicate missing blocks • Keeps ALL namespace in memory • 60M objects (File/Block) in 16GB 47
  • 49. Datanodes • Handle block storage on multiple volumes block integrity • Clients access the blocks directly from data nodes • Periodically send heartbeats and block reports to Namenode • Blocks are stored as underlying OS s files 48
  • 51. Example: Unigrams • Input: Huge text corpus • Wikipedia Articles (40GB uncompressed) • Output: List of words sorted in descending order of frequency
  • 52. Unigrams $ cat ~/wikipedia.txt | sed -e 's/ /n/g' | grep . | sort | uniq -c ~/frequencies.txt $ cat ~/frequencies.txt | # cat | sort -n -k1,1 -r | # cat ~/unigrams.txt
  • 53. MR for Unigrams mapper (filename, file-contents): for each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum)
  • 54. MR for Unigrams mapper (word, frequency): emit (frequency, word) reducer (frequency, words): for each word in words: emit (word, frequency)
  • 55. Unigrams: Java Mapper public static class MapClass extends MapReduceBase implements MapperLongWritable, Text, Text, IntWritable { public void map(LongWritable key, Text value, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { Text word = new Text(itr.nextToken()); output.collect(word, new IntWritable(1)); } } }
  • 56. Unigrams: Java Reducer public static class Reduce extends MapReduceBase implements ReducerText, IntWritable, Text, IntWritable { public void reduce(Text key,IteratorIntWritable values, OutputCollectorText,IntWritable output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 57. Unigrams: Driver public void run(String inputPath, String outputPath) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName(wordcount); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); }
  • 58. Configuration •  Unified Mechanism for •  Configuring Daemons •  Runtime environment for Jobs/Tasks •  Defaults: *-default.xml •  Site-Specific: *-site.xml •  final parameters
  • 59. Example configuration property namemapred.job.tracker/name valuehead.server.node.com:9001/value /property property namefs.default.name/name valuehdfs://head.server.node.com:9000/value /property property namemapred.child.java.opts/name value-Xmx512m/value finaltrue/final /property .... /configuration
  • 61. Running a Job [milindb@gateway ~]$ hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount /data/newsarchive/20080923 /tmp/ newsoutinput.FileInputFormat: Total input paths to process : 4 mapred.JobClient: Running job: job_200904270516_5709 mapred.JobClient: map 0% reduce 0% mapred.JobClient: map 3% reduce 0% mapred.JobClient: map 7% reduce 0% .... mapred.JobClient: map 100% reduce 21% mapred.JobClient: map 100% reduce 31% mapred.JobClient: map 100% reduce 33% mapred.JobClient: map 100% reduce 66% mapred.JobClient: map 100% reduce 100% mapred.JobClient: Job complete: job_200904270516_5709
  • 62. Running a Job mapred.JobClient: Counters: 18 mapred.JobClient: Job Counters mapred.JobClient: Launched reduce tasks=1 mapred.JobClient: Rack-local map tasks=10 mapred.JobClient: Launched map tasks=25 mapred.JobClient: Data-local map tasks=1 mapred.JobClient: FileSystemCounters mapred.JobClient: FILE_BYTES_READ=491145085 mapred.JobClient: HDFS_BYTES_READ=3068106537 mapred.JobClient: FILE_BYTES_WRITTEN=724733409 mapred.JobClient: HDFS_BYTES_WRITTEN=377464307
  • 63. Running a Job mapred.JobClient: Map-Reduce Framework mapred.JobClient: Combine output records=73828180 mapred.JobClient: Map input records=36079096 mapred.JobClient: Reduce shuffle bytes=233587524 mapred.JobClient: Spilled Records=78177976 mapred.JobClient: Map output bytes=4278663275 mapred.JobClient: Combine input records=371084796 mapred.JobClient: Map output records=313041519 mapred.JobClient: Reduce input records=15784903
  • 83. Next Generation MapReduce 82
  • 84. MapReduce Today (Courtesy: Arun Murthy, Hortonworks)
  • 85. Why ? • Scalability Limitations today • Maximum cluster size: 4000 nodes • Maximum Concurrent tasks: 40,000 • Job Tracker SPOF • Fixed map and reduce containers (slots) • Punishes pleasantly parallel apps 84
  • 86. Why ? (contd) • MapReduce is not suitable for every application • Fine-Grained Iterative applications • HaLoop: Hadoop in a Loop • Message passing applications • Graph Processing 85
  • 87. Requirements • Need scalable cluster resources manager • Separate scheduling from resource management • Multi-Lingual Communication Protocols 86
  • 88. Bottom Line • @techmilind #mrng (MapReduce, Next Gen) is in reality, #rmng (Resource Manager, Next Gen) • Expect different programming paradigms to be implemented • Including MPI (soon) 87
  • 89. Architecture (Courtesy: Arun Murthy, Hortonworks)
  • 90. The New World •  Resource Manager •  Allocates resources (containers) to applications •  Node Manager •  Manages containers on nodes •  Application Master •  Specific to paradigm e.g. MapReduce application master, MPI application master etc 89
  • 91. Container • In current terminology: A Task Slot • Slice of the node s hardware resources • #of cores, virtual memory, disk size, disk and network bandwidth etc • Currently, only memory usage is sliced 90