SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Basics of Big Data
Analytics &
Hadoop
Ambuj Kumar
Ambuj_kumar@aol.com
http://ambuj4bigdata.blogspot.in
http://ambujworld.wordpress.com
Agenda
 Big Data –
 Concepts overview
 Analytics –
 Concepts overview
 Hadoop –
 Concepts overview
 HDFS
 Concepts overview
 Data Flow - Read & Write Operation
 MapReduce
 Concepts overview
 WordCount Program
 Use Cases
 Landscape
 Hadoop Features & Summary
What is Big Data?
Big data is data which is too large, complex and dynamic for any conventional data tools to capture,
store, manage and analyze.
Challenges of Big Data
• Storage (~ Petabytes)1
• Processing (Timely manner)2
• Variety of Data (Structured,Semi
Structured,Un-structured)3
• Cost4
Big Data Analytics
Big data analytics is the process of examining large amounts of
data of a variety of types (big data) to uncover hidden patterns,
unknown correlations and other useful information.
Big Data Analytics Solutions
There are many different Big Data Analytics Solutions out in the
market.
 Tableau – visualization tools
 SAS – Statistical computing
 IBM and Oracle –They have a range of tools for Big DataAnalysis
 Revolution – Statistical computing
 R – Open source tool for Statistical computing
What is Hadoop?
 Open-source data storage and processing API
 Massively scalable, automatically parallelizable
 Based on work from Google
 GFS + MapReduce + BigTable
 Current Distributions based on Open Source and Vendor Work
 Apache Hadoop
 Cloudera – CDH4
 Hortonworks
 MapR
 AWS
 Windows Azure HDInsight
Why Use Hadoop?
Cheaper
Scales to Petabytes
or more
Faster
Parallel data
processing
Better
Suited for particular
types of BigData
problems
Hadoop History
In 2008, Hadoop became Apache Top Level Project
Comparing: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query
Response Time
Can be near immediate Has latency (due to batch
processing)
Where is Hadoop used?
Industry
Technology
Use Cases
Search
People you may know
Movie recommendations
Banks
Fraud Detection
Regulatory
Risk management
Media
Retail
Marketing analytics
Customer service
Product recommendations
Manufacturing Preventive maintenance
Companies Using Hadoop
 Search
Yahoo,Amazon, Zvents
 Log Processing
Facebook,Yahoo,
ContextWeb.Joost,Last.fm
 Recommendation Systems
Facebook,Linkedin
 DataWarehouse
Facebook,AOL
 Video & Image Analysis
NewYorkTimes,Eyealike
------- Almost in every domain!
Hadoop is a set of Apache
Frameworks and more…
 Data storage (HDFS)
 Runs on commodity hardware (usually Linux)
 Horizontally scalable
 Processing (MapReduce)
 Parallelized (scalable) processing
 Fault Tolerant
 Other Tools / Frameworks
 Data Access
 HBase, Hive, Pig, Mahout
 Tools
 Hue, Sqoop
 Monitoring
 Greenplum, Cloudera
Hadoop Core - HDFS
MapReduce API
Data Access
Tools & Libraries
Monitoring & Alerting
Core parts of Hadoop distribution
HDFS Storage
Redundant (3 copies)
For large files – large
blocks
64 or 128 MB / block
Can scale to 1000s of
nodes
MapReduce API
Batch (Job) processing
Distributed and Localized
to clusters (Map)
Auto-Parallelizable for
huge amounts of data
Fault-tolerant (auto
retries)
Adds high availability and
more
Other Libraries
Pig
Hive
HBase
Others
Hadoop Cluster HDFS (Physical)
Storage
Name Node
Data Node 1 Data Node 2 Data Node 3
Secondary
Name Node
• Contains web site to view
cluster information
• V2 Hadoop uses multiple
Name Nodes for HA
One Name Node
• 3 copies of each node by
default
Many Data Nodes
• Using common Linux shell
commands
• Block size is 64 or 128 MB
Work with data in HDFS
MapReduce Job – LogicalView
Hadoop Ecosystem
Common Hadoop Distributions
Open Source
Apache
Commercial
Cloudera
Hortonworks
MapR
AWS MapReduce
Microsoft HDInsight
HDFS :Architecture
Master
NameNode
Slave
Bunch of DataNodes
HDFS Layers
NameNode
Storage
…………
NS
Block Management
NameNode
DataNode
DataNode DataNode DataNode DataNode DataNode
DataNode
Name
Space
Block
Storage
HDFS : Basic Features
Highly fault-tolerant
High throughput
Suitable for applications with large data sets
Streaming access to file system data
Can be built out of commodity hardware
HDFS Write (1/2)
Client Name Node
1
2
Data Node
A
Data Node
B
Data Node
C
Data Node
D
A2 A3 A4A1
3
Client contacts NameNode to write data
NameNode says write it to these nodes
Client sequentially writes
blocks to DataNode
HDFS Write (2/2)
Client Name Node
Data Node
A
Data Node
B
Data Node
C
Data Node
D
A1
DataNodes replicate data
blocks, orchestrated
by the NameNode
A2
A4
A2 A1
A3
A3 A2
A4
A4 A1
A3
HDFS Read
Client Name Node
1
2
Data Node
A
Data Node
B
Data Node
C
Data Node
D
A1
3
Client contacts NameNode to read data
NameNode says you can find it here
Client sequentially
reads blocks from
DataNode
A2
A4
A2 A1
A3
A3 A2
A4
A4 A1
A3
HA (High Availability) for
NameNode
NameNode (StandBy)
DataNode
NameNode (Active)
Active NameNode
Do normal namenode’s operation
Standby NameNode
Maintain NameNode’s data
Ready to be active NameNode
DataNode DataNode DataNode DataNode
MapReduce
 MapReduce job consist of two tasks
 Map Task
 Reduce Task
 Blocks of data distributed across several machines are
processed by map tasks parallel
 Results are aggregated in the reducer
 Works only on KEY/VALUE pair
MapReduce:Word Count
Deer 1
Bear 1
River 1
Car 1
Car 1
River 1
Deer 1
Car 1
Bear 1
Bear 2
Car 3
Deer 2
River 2
Can we do word count in parallel?
Deer Bear River
Car Car River
Deer Car Bear
MapReduce:Word Count Program
Data Flow in a MapReduce
Program in Hadoop
Mapper Class
Package ambuj.com.wc;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends
Mapper<LongWritable, Text, Text, LongWritable> {
private final static LongWritable one = new LongWritable(1);
private Text word = new Text();
@Override
public void map(LongWritable inputKey, Text inputVal, Context context)
throws IOException, InterruptedException {
String line = inputVal.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
Reducer Class
package ambuj.com.wc;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends
Reducer<Text, LongWritable, Text, LongWritable> {
@Override
public void reduce(Text key, Iterable<LongWritable> listOfValues,
Context context) throws IOException, InterruptedException {
long sum = 0;
for (LongWritable val : listOfValues) {
sum = sum + val.get();
}
context.write(key, new LongWritable(sum));
}
}
Driver Class
package ambuj.com.wc;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCountDriver extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "WordCount");
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
ToolRunner.run(new WordCountDriver(), args);
}
}
A view of Hadoop
Client Job
Data Node
Task
Tracker
Task
Task
Task
Job Tracker Name Node
Data Node
Task
Tracker
Task
Task
Task
Data Node
Task
Tracker
Task
Task
Task
MasterSlave
Blocks HDFS
MapReduce
Use Cases
 Utilities want to predict power consumption
 Banks and insurance companies want to understand risk
 Fraud detection
 Marketing departments want to understand customers
 Recommendations
 Location-Based Ad Targeting
 Threat Analysis
Big Data Landscape
Hadoop Features & Summary
Distributed frame work for processing and storing data
generally on commodity hardware.Completely open
source and written in Java.
 Store anything
 Unstructuredor semi structured data,
 Storage capacity
 Scale linearly, cost in not exponential.
 Data locality and process in your way.
 Code moves to data
 In MR you specify the actual steps in processing the data and drive the out put.
 Stream access: Process data in any language.
 Failure and fault tolerance:
 Detect Failure and Heals itself.
 Reliable,data replicated, failed task are rerun , no need maintain backup of data
 Cost effective: Hadoop is designed to be a scale-out architecture operating on a cluster of commodity
PC machines.
The Hadoop framework transparently for customization to provides applications both reliability, adaption
and data motion.
Primarily used for batch processing, not real-time/ transactional user applications.
References - Hadoop
 Hadoop:The Definitive Guide,Third Edition by Tom
White.
 http://hadoop.apache.org
 http://www.cloudera.com
 http://ambuj4bigdata.blogspot.com
 http://ambujworld.wordpress.com
ThankYou

Mais conteúdo relacionado

Mais procurados

Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovVasil Remeniuk
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1Giovanna Roda
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationSameer Tiwari
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 

Mais procurados (20)

Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop
HadoopHadoop
Hadoop
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animation
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 

Destaque

Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Introduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleIntroduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleSpringPeople
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreHadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreTrendwise Analytics
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in DepthSyed Hadoop
 
BigDataEurope - Big Data & Food and Agriculture
BigDataEurope - Big Data & Food and AgricultureBigDataEurope - Big Data & Food and Agriculture
BigDataEurope - Big Data & Food and AgricultureBigData_Europe
 
ATHOKPAM NABAKUMAR SINGH's HADOOP ADMIN
ATHOKPAM NABAKUMAR SINGH's HADOOP ADMINATHOKPAM NABAKUMAR SINGH's HADOOP ADMIN
ATHOKPAM NABAKUMAR SINGH's HADOOP ADMINAthokpam Nabakumar
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...yashbheda
 
Back to Basics With Big Data By Nancy Adzentoivich
Back to Basics With Big Data By Nancy AdzentoivichBack to Basics With Big Data By Nancy Adzentoivich
Back to Basics With Big Data By Nancy AdzentoivichSearch Marketing Expo - SMX
 

Destaque (20)

Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Introduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleIntroduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeople
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data
Big DataBig Data
Big Data
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreHadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and More
 
What is big data?
What is big data?What is big data?
What is big data?
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
 
BigDataEurope - Big Data & Food and Agriculture
BigDataEurope - Big Data & Food and AgricultureBigDataEurope - Big Data & Food and Agriculture
BigDataEurope - Big Data & Food and Agriculture
 
ATHOKPAM NABAKUMAR SINGH's HADOOP ADMIN
ATHOKPAM NABAKUMAR SINGH's HADOOP ADMINATHOKPAM NABAKUMAR SINGH's HADOOP ADMIN
ATHOKPAM NABAKUMAR SINGH's HADOOP ADMIN
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
 
Back to Basics With Big Data By Nancy Adzentoivich
Back to Basics With Big Data By Nancy AdzentoivichBack to Basics With Big Data By Nancy Adzentoivich
Back to Basics With Big Data By Nancy Adzentoivich
 
Big Data analytics
Big Data analyticsBig Data analytics
Big Data analytics
 

Semelhante a Basics of Big Data Analytics & Hadoop

Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data Amar kumar
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabsSiva Sankar
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune amrutupre
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introductiondewang_mistry
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop Sudarshan Pant
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoopmarklpollack
 

Semelhante a Basics of Big Data Analytics & Hadoop (20)

Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Big data
Big dataBig data
Big data
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 

Último

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 

Último (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 

Basics of Big Data Analytics & Hadoop

  • 1. Basics of Big Data Analytics & Hadoop Ambuj Kumar Ambuj_kumar@aol.com http://ambuj4bigdata.blogspot.in http://ambujworld.wordpress.com
  • 2. Agenda  Big Data –  Concepts overview  Analytics –  Concepts overview  Hadoop –  Concepts overview  HDFS  Concepts overview  Data Flow - Read & Write Operation  MapReduce  Concepts overview  WordCount Program  Use Cases  Landscape  Hadoop Features & Summary
  • 3. What is Big Data? Big data is data which is too large, complex and dynamic for any conventional data tools to capture, store, manage and analyze.
  • 4. Challenges of Big Data • Storage (~ Petabytes)1 • Processing (Timely manner)2 • Variety of Data (Structured,Semi Structured,Un-structured)3 • Cost4
  • 5. Big Data Analytics Big data analytics is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns, unknown correlations and other useful information. Big Data Analytics Solutions There are many different Big Data Analytics Solutions out in the market.  Tableau – visualization tools  SAS – Statistical computing  IBM and Oracle –They have a range of tools for Big DataAnalysis  Revolution – Statistical computing  R – Open source tool for Statistical computing
  • 6. What is Hadoop?  Open-source data storage and processing API  Massively scalable, automatically parallelizable  Based on work from Google  GFS + MapReduce + BigTable  Current Distributions based on Open Source and Vendor Work  Apache Hadoop  Cloudera – CDH4  Hortonworks  MapR  AWS  Windows Azure HDInsight
  • 7. Why Use Hadoop? Cheaper Scales to Petabytes or more Faster Parallel data processing Better Suited for particular types of BigData problems
  • 8. Hadoop History In 2008, Hadoop became Apache Top Level Project
  • 9. Comparing: RDBMS vs. Hadoop Traditional RDBMS Hadoop / MapReduce Data Size Gigabytes (Terabytes) Petabytes (Hexabytes) Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear Query Response Time Can be near immediate Has latency (due to batch processing)
  • 10. Where is Hadoop used? Industry Technology Use Cases Search People you may know Movie recommendations Banks Fraud Detection Regulatory Risk management Media Retail Marketing analytics Customer service Product recommendations Manufacturing Preventive maintenance
  • 11. Companies Using Hadoop  Search Yahoo,Amazon, Zvents  Log Processing Facebook,Yahoo, ContextWeb.Joost,Last.fm  Recommendation Systems Facebook,Linkedin  DataWarehouse Facebook,AOL  Video & Image Analysis NewYorkTimes,Eyealike ------- Almost in every domain!
  • 12. Hadoop is a set of Apache Frameworks and more…  Data storage (HDFS)  Runs on commodity hardware (usually Linux)  Horizontally scalable  Processing (MapReduce)  Parallelized (scalable) processing  Fault Tolerant  Other Tools / Frameworks  Data Access  HBase, Hive, Pig, Mahout  Tools  Hue, Sqoop  Monitoring  Greenplum, Cloudera Hadoop Core - HDFS MapReduce API Data Access Tools & Libraries Monitoring & Alerting
  • 13. Core parts of Hadoop distribution HDFS Storage Redundant (3 copies) For large files – large blocks 64 or 128 MB / block Can scale to 1000s of nodes MapReduce API Batch (Job) processing Distributed and Localized to clusters (Map) Auto-Parallelizable for huge amounts of data Fault-tolerant (auto retries) Adds high availability and more Other Libraries Pig Hive HBase Others
  • 14. Hadoop Cluster HDFS (Physical) Storage Name Node Data Node 1 Data Node 2 Data Node 3 Secondary Name Node • Contains web site to view cluster information • V2 Hadoop uses multiple Name Nodes for HA One Name Node • 3 copies of each node by default Many Data Nodes • Using common Linux shell commands • Block size is 64 or 128 MB Work with data in HDFS
  • 15. MapReduce Job – LogicalView
  • 17. Common Hadoop Distributions Open Source Apache Commercial Cloudera Hortonworks MapR AWS MapReduce Microsoft HDInsight
  • 18. HDFS :Architecture Master NameNode Slave Bunch of DataNodes HDFS Layers NameNode Storage ………… NS Block Management NameNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode Name Space Block Storage
  • 19. HDFS : Basic Features Highly fault-tolerant High throughput Suitable for applications with large data sets Streaming access to file system data Can be built out of commodity hardware
  • 20. HDFS Write (1/2) Client Name Node 1 2 Data Node A Data Node B Data Node C Data Node D A2 A3 A4A1 3 Client contacts NameNode to write data NameNode says write it to these nodes Client sequentially writes blocks to DataNode
  • 21. HDFS Write (2/2) Client Name Node Data Node A Data Node B Data Node C Data Node D A1 DataNodes replicate data blocks, orchestrated by the NameNode A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
  • 22. HDFS Read Client Name Node 1 2 Data Node A Data Node B Data Node C Data Node D A1 3 Client contacts NameNode to read data NameNode says you can find it here Client sequentially reads blocks from DataNode A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
  • 23. HA (High Availability) for NameNode NameNode (StandBy) DataNode NameNode (Active) Active NameNode Do normal namenode’s operation Standby NameNode Maintain NameNode’s data Ready to be active NameNode DataNode DataNode DataNode DataNode
  • 24. MapReduce  MapReduce job consist of two tasks  Map Task  Reduce Task  Blocks of data distributed across several machines are processed by map tasks parallel  Results are aggregated in the reducer  Works only on KEY/VALUE pair
  • 25. MapReduce:Word Count Deer 1 Bear 1 River 1 Car 1 Car 1 River 1 Deer 1 Car 1 Bear 1 Bear 2 Car 3 Deer 2 River 2 Can we do word count in parallel? Deer Bear River Car Car River Deer Car Bear
  • 27. Data Flow in a MapReduce Program in Hadoop
  • 28. Mapper Class Package ambuj.com.wc; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordCountMapper extends Mapper<LongWritable, Text, Text, LongWritable> { private final static LongWritable one = new LongWritable(1); private Text word = new Text(); @Override public void map(LongWritable inputKey, Text inputVal, Context context) throws IOException, InterruptedException { String line = inputVal.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
  • 29. Reducer Class package ambuj.com.wc; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> { @Override public void reduce(Text key, Iterable<LongWritable> listOfValues, Context context) throws IOException, InterruptedException { long sum = 0; for (LongWritable val : listOfValues) { sum = sum + val.get(); } context.write(key, new LongWritable(sum)); } }
  • 30. Driver Class package ambuj.com.wc; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class WordCountDriver extends Configured implements Tool { @Override public int run(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "WordCount"); job.setJarByClass(WordCountDriver.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(LongWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); return 0; } public static void main(String[] args) throws Exception { ToolRunner.run(new WordCountDriver(), args); } }
  • 31. A view of Hadoop Client Job Data Node Task Tracker Task Task Task Job Tracker Name Node Data Node Task Tracker Task Task Task Data Node Task Tracker Task Task Task MasterSlave Blocks HDFS MapReduce
  • 32. Use Cases  Utilities want to predict power consumption  Banks and insurance companies want to understand risk  Fraud detection  Marketing departments want to understand customers  Recommendations  Location-Based Ad Targeting  Threat Analysis
  • 34. Hadoop Features & Summary Distributed frame work for processing and storing data generally on commodity hardware.Completely open source and written in Java.  Store anything  Unstructuredor semi structured data,  Storage capacity  Scale linearly, cost in not exponential.  Data locality and process in your way.  Code moves to data  In MR you specify the actual steps in processing the data and drive the out put.  Stream access: Process data in any language.  Failure and fault tolerance:  Detect Failure and Heals itself.  Reliable,data replicated, failed task are rerun , no need maintain backup of data  Cost effective: Hadoop is designed to be a scale-out architecture operating on a cluster of commodity PC machines. The Hadoop framework transparently for customization to provides applications both reliability, adaption and data motion. Primarily used for batch processing, not real-time/ transactional user applications.
  • 35. References - Hadoop  Hadoop:The Definitive Guide,Third Edition by Tom White.  http://hadoop.apache.org  http://www.cloudera.com  http://ambuj4bigdata.blogspot.com  http://ambujworld.wordpress.com