SlideShare uma empresa Scribd logo
1 de 50
Big Data Analysis With
RHadoop
David Chiu (Yu-Wei, Chiu)
@ML/DM Monday
2014/03/17
About Me
 Co-Founder of NumerInfo
 Ex-Trend Micro Engineer
 ywchiu-tw.appspot.com
R + Hadoop
http://cdn.overdope.com/wp-content/uploads/2014/03/dragon-ball-z-x-hmn-alns-fusion-1.jpg
 Scaling R
Hadoop enables R to do parallel computing
 Do not have to learn new language
Learning to use Java takes time
Why Using RHadoop
Rhadoop Architecture
rhdfs
rhbase
rmr2
HDFS
MapReduce
Hbase Thrift
Gateway
Streaming API
Hbase
R
 Enable developer to write Mapper/Reducer in
any scripting language(R, python, perl)
 Mapper, reducer, and optional combiner
processes are written to read from standard
input and to write to standard output
 Streaming Job would have additional overhead
of starting a scripting VM
Streaming v.s. Native Java.
 Writing MapReduce Using R
 mapreduce function
Mapreduce(input output, map, reduce…)
 Changelog
rmr 3.0.0 (2014/02/10): 10X faster than rmr 2.3.0
rmr 2.3.0 (2013/10/07): support plyrmr
rmr2
 Access HDFS From R
 Exchange data from R dataframe and HDFS
rhdfs
 Exchange data from R to Hbase
 Using Thrift API
rhbase
 Perform common data manipulation operations,
as found in plyr and reshape2
 It provides a familiar plyr-like interface while
hiding many of the mapreduce details
 plyr: Tools for splitting, applying and
combining data
NEW! plyrmr
RHadoop
Installation
 R and related packages should be installed on
each tasknode of the cluster
 A Hadoop cluster, CDH3 and higher or Apache
1.0.2 and higher but limited to mr1, not mr2.
Compatibility with mr2 from Apache 2.2.0 or
HDP2
Prerequisites
Getting Ready (Cloudera VM)
 Download
http://www.cloudera.com/content/cloudera-content/cloudera-
docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html
 This VM runs
CentOS 6.2
CDH4.4
R 3.0.1
Java 1.6.0_32
CDH 4.4
Get RHadoop
 https://github.com/RevolutionAnalytics/RHadoop
/wiki/Downloads
Installing rmr2 dependencies
 Make sure the package is installed system wise
$ sudo R
> install.packages(c("codetools", "R", "Rcpp",
"RJSONIO", "bitops", "digest", "functional", "stringr",
"plyr", "reshape2", "rJava“, “caTools”))
Install rmr2
$ wget --no-check-certificate
https://raw.github.com/RevolutionAnalytics/rmr2/3.0.0/build/rmr2_3.0.0.tar.gz
$ sudo R CMD INSTALL rmr2_3.0.0.tar.gz
Installing…
 http://cran.r-project.org/src/contrib/Archive/Rcpp/
Downgrade Rcpp
$ wget --no-check-certificate http://cran.r-
project.org/src/contrib/Archive/Rcpp/Rcpp_0.11.0.t
ar.gz
$sudo R CMD INSTALL Rcpp_0.11.0.tar.gz
Install Rcpp_0.11.0
$ sudo R CMD INSTALL rmr2_3.0.0.tar.gz
Install rmr2 again
Install RHDFS
$ wget -no-check-certificate
https://raw.github.com/RevolutionAnalytics/rhdfs/m
aster/build/rhdfs_1.0.8.tar.gz
$ sudo HADOOP_CMD=/usr/bin/hadoop R CMD
INSTALL rhdfs_1.0.8.tar.gz
Enable hdfs
> Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
> Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20-
mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-
cdh4.4.0.jar")
> library(rmr2)
> library(rhdfs)
> hdfs.init()
Javareconf error
$ sudo R CMD javareconf
javareconf with correct JAVA_HOME
$ echo $JAVA_HOME
$ sudo JAVA_HOME=/usr/java/jdk1.6.0_32 R CMD javareconf
MapReduce
With RHadoop
MapReduce
 mapreduce(input, output, map, reduce)
 Like sapply, lapply, tapply within R
Hello World – For Hadoop
http://www.rabidgremlin.com/data20/MapReduceWordCountOverview1.png
Move File Into HDFS
# Put data into hdfs
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20-
mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-
cdh4.4.0.jar")
library(rmr2)
library(rhdfs)
hdfs.init()
hdfs.mkdir(“/user/cloudera/wordcount/data”)
hdfs.put("wc_input.txt", "/user/cloudera/wordcount/data")
$ hadoop fs –mkdir /user/cloudera/wordcount/data
$ hadoop fs –put wc_input.txt /user/cloudera/word/count/data
Wordcount Mapper
map <- function(k,lines) {
words.list <- strsplit(lines, 's')
words <- unlist(words.list)
return( keyval(words, 1) )
}
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
#Mapper
Wordcount Reducer
reduce <- function(word, counts) {
keyval(word, sum(counts))
}
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
#Reducer
Call Wordcount
hdfs.root <- 'wordcount'
hdfs.data <- file.path(hdfs.root, 'data')
hdfs.out <- file.path(hdfs.root, 'out')
wordcount <- function (input, output=NULL) {
mapreduce(input=input, output=output,
input.format="text", map=map, reduce=reduce)
}
out <- wordcount(hdfs.data, hdfs.out)
Read data from HDFS
results <- from.dfs(out)
results$key[order(results$val, decreasing
= TRUE)][1:10]
$ hadoop fs –cat /user/cloudera/wordcount/out/part-00000 |
sort –k 2 –nr | head –n 10
MapReduce Benchmark
> a.time <- proc.time()
> small.ints2=1:100000
> result.normal = sapply(small.ints2, function(x) x^2)
> proc.time() - a.time
> b.time <- proc.time()
> small.ints= to.dfs(1:100000)
> result = mapreduce(input = small.ints, map = function(k,v)
cbind(v,v^2))
> proc.time() - b.time
sapply
Elapsed 0.982 second
mapreduce
Elapsed 102.755 seconds
 HDFS stores your files as data chunk distributed on
multiple datanodes
 M/R runs multiple programs called mapper on each of
the data chunks or blocks. The (key,value) output of
these mappers are compiled together as result by
reducers.
 It takes time for mapper and reducer being spawned on
these distributed system.
Hadoop Latency
Its not possible to apply built in machine learning
method on MapReduce Program
kcluster= kmeans((mydata, 4, iter.max=10)
Kmeans Clustering
kmeans =
function(points, ncenters, iterations = 10, distfun = NULL) {
if(is.null(distfun))
distfun = function(a,b) norm(as.matrix(a-b), type = 'F')
newCenters =
kmeans.iter(
points,
distfun,
ncenters = ncenters)
# interatively choosing new centers
for(i in 1:iterations) {
newCenters = kmeans.iter(points, distfun,
centers = newCenters)
}
newCenters
}
Kmeans in MapReduce Style
kmeans.iter =
function(points, distfun, ncenters = dim(centers)[1], centers = NULL)
{
from.dfs(mapreduce(input = points,
map =
if (is.null(centers)) { #give random point as sample
function(k,v) keyval(sample(1:ncenters,1),v)}
else {
function(k,v) { #find center of minimum distance
distances = apply(centers, 1, function(c) distfun(c,v))
keyval(centers[which.min(distances),], v)}},
reduce = function(k,vv) keyval(NULL,
apply(do.call(rbind, vv), 2, mean))),
to.data.frame = T)
}
Kmeans in MapReduce Style
One More Thing…
plyrmr
 Perform common data manipulation operations,
as found in plyr and reshape2
 It provides a familiar plyr-like interface while
hiding many of the mapreduce details
 plyr: Tools for splitting, applying and
combining data
NEW! plyrmr
Installation plyrmr dependencies
$ yum install libxml2-devel
$ sudo yum install curl-devel
$ sudo R
> Install.packages(c(“ Rcurl”, “httr”), dependencies = TRUE
> Install.packages(“devtools”, dependencies = TRUE)
> library(devtools)
> install_github("pryr", "hadley")
> Install.packages(c(“ R.methodsS3”, “hydroPSO”), dependencies = TRUE)
$ wget -no-check-certificate
https://raw.github.com/RevolutionAnalytics/plyrmr/master/build/plyrmr_0.1.0.tar.gz
$ sudo R CMD INSTALL plyrmr_0.1.0.tar.gz
Installation plyrmr
> data(mtcars)
> head(mtcars)
> transform(mtcars, carb.per.cyl = carb/cyl)
> library(plyrmr)
> output(input(mtcars), "/tmp/mtcars")
> as.data.frame(transform(input("/tmp/mtcars"),
carb.per.cyl = carb/cyl))
> output(transform(input("/tmp/mtcars"), carb.per.cyl =
carb/cyl), "/tmp/mtcars.out")
Transform in plyrmr
 where(
select(
mtcars,
carb.per.cyl = carb/cyl,
.replace = FALSE),
carb.per.cyl >= 1)
select and where
https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md
 as.data.frame(
select(
group(
input("/tmp/mtcars"),
cyl),
mean.mpg = mean(mpg)))
Group by
https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md
Reference
 https://github.com/RevolutionAnalytics/RHadoop/wik
i
 http://www.slideshare.net/RevolutionAnalytics/rhado
op-r-meets-hadoop
 http://www.slideshare.net/Hadoop_Summit/enabling-
r-on-hadoop
 Website
ywchiu-tw.appspot.com
 Email
david@numerinfo.com
tr.ywchiu@gmail.com
 Company
numerinfo.com
Contact
Big Data Analysis With RHadoop

Mais conteúdo relacionado

Mais procurados

apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Titus Damaiyanti
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaGlenn K. Lockwood
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and PipesHanborq Inc.
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Integrate Hive and R
Integrate Hive and RIntegrate Hive and R
Integrate Hive and RJunHo Cho
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesDataWorks Summit/Hadoop Summit
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 

Mais procurados (20)

apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Integrate Hive and R
Integrate Hive and RIntegrate Hive and R
Integrate Hive and R
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 

Destaque

新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)David Chiu
 
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정Aiden Seonghak Hong
 
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수Aiden Seonghak Hong
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)David Chiu
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopDataWorks Summit
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisAravind Babu
 
Social Network Analysis With R
Social Network Analysis With RSocial Network Analysis With R
Social Network Analysis With RDavid Chiu
 
Data Analysis - Making Big Data Work
Data Analysis - Making Big Data WorkData Analysis - Making Big Data Work
Data Analysis - Making Big Data WorkDavid Chiu
 
07 2
07 207 2
07 2a_b_g
 
Machine Learning With R
Machine Learning With RMachine Learning With R
Machine Learning With RDavid Chiu
 
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...Yahoo Developer Network
 
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Market Basket Analysis Algorithm with Map/Reduce of Cloud ComputingMarket Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Market Basket Analysis Algorithm with Map/Reduce of Cloud ComputingJongwook Woo
 
R server and spark
R server and sparkR server and spark
R server and sparkBAINIDA
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computingBAINIDA
 
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop Jongwook Woo
 
Super Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
Super Barcode Training Camp - Motorola AirDefense Wireless Security PresentationSuper Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
Super Barcode Training Camp - Motorola AirDefense Wireless Security PresentationSystem ID Warehouse
 

Destaque (20)

Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
 
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
 
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis
 
Social Network Analysis With R
Social Network Analysis With RSocial Network Analysis With R
Social Network Analysis With R
 
Data Analysis - Making Big Data Work
Data Analysis - Making Big Data WorkData Analysis - Making Big Data Work
Data Analysis - Making Big Data Work
 
07 2
07 207 2
07 2
 
Machine Learning With R
Machine Learning With RMachine Learning With R
Machine Learning With R
 
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
 
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Market Basket Analysis Algorithm with Map/Reduce of Cloud ComputingMarket Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
R server and spark
R server and sparkR server and spark
R server and spark
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
 
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
 
Super Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
Super Barcode Training Camp - Motorola AirDefense Wireless Security PresentationSuper Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
Super Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
 

Semelhante a Big Data Analysis With RHadoop

The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)Revolution Analytics
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance ComputersDave Hiltbrand
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 

Semelhante a Big Data Analysis With RHadoop (20)

RHadoop - beginners
RHadoop - beginnersRHadoop - beginners
RHadoop - beginners
 
Unit 2
Unit 2Unit 2
Unit 2
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
Data Science
Data ScienceData Science
Data Science
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Hadoop
HadoopHadoop
Hadoop
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

Último

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

Big Data Analysis With RHadoop

  • 1. Big Data Analysis With RHadoop David Chiu (Yu-Wei, Chiu) @ML/DM Monday 2014/03/17
  • 2. About Me  Co-Founder of NumerInfo  Ex-Trend Micro Engineer  ywchiu-tw.appspot.com
  • 4.  Scaling R Hadoop enables R to do parallel computing  Do not have to learn new language Learning to use Java takes time Why Using RHadoop
  • 6.  Enable developer to write Mapper/Reducer in any scripting language(R, python, perl)  Mapper, reducer, and optional combiner processes are written to read from standard input and to write to standard output  Streaming Job would have additional overhead of starting a scripting VM Streaming v.s. Native Java.
  • 7.  Writing MapReduce Using R  mapreduce function Mapreduce(input output, map, reduce…)  Changelog rmr 3.0.0 (2014/02/10): 10X faster than rmr 2.3.0 rmr 2.3.0 (2013/10/07): support plyrmr rmr2
  • 8.  Access HDFS From R  Exchange data from R dataframe and HDFS rhdfs
  • 9.  Exchange data from R to Hbase  Using Thrift API rhbase
  • 10.  Perform common data manipulation operations, as found in plyr and reshape2  It provides a familiar plyr-like interface while hiding many of the mapreduce details  plyr: Tools for splitting, applying and combining data NEW! plyrmr
  • 12.  R and related packages should be installed on each tasknode of the cluster  A Hadoop cluster, CDH3 and higher or Apache 1.0.2 and higher but limited to mr1, not mr2. Compatibility with mr2 from Apache 2.2.0 or HDP2 Prerequisites
  • 13. Getting Ready (Cloudera VM)  Download http://www.cloudera.com/content/cloudera-content/cloudera- docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html  This VM runs CentOS 6.2 CDH4.4 R 3.0.1 Java 1.6.0_32
  • 16. Installing rmr2 dependencies  Make sure the package is installed system wise $ sudo R > install.packages(c("codetools", "R", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "rJava“, “caTools”))
  • 17. Install rmr2 $ wget --no-check-certificate https://raw.github.com/RevolutionAnalytics/rmr2/3.0.0/build/rmr2_3.0.0.tar.gz $ sudo R CMD INSTALL rmr2_3.0.0.tar.gz
  • 20. $ wget --no-check-certificate http://cran.r- project.org/src/contrib/Archive/Rcpp/Rcpp_0.11.0.t ar.gz $sudo R CMD INSTALL Rcpp_0.11.0.tar.gz Install Rcpp_0.11.0
  • 21. $ sudo R CMD INSTALL rmr2_3.0.0.tar.gz Install rmr2 again
  • 22. Install RHDFS $ wget -no-check-certificate https://raw.github.com/RevolutionAnalytics/rhdfs/m aster/build/rhdfs_1.0.8.tar.gz $ sudo HADOOP_CMD=/usr/bin/hadoop R CMD INSTALL rhdfs_1.0.8.tar.gz
  • 23. Enable hdfs > Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") > Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20- mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1- cdh4.4.0.jar") > library(rmr2) > library(rhdfs) > hdfs.init()
  • 24. Javareconf error $ sudo R CMD javareconf
  • 25. javareconf with correct JAVA_HOME $ echo $JAVA_HOME $ sudo JAVA_HOME=/usr/java/jdk1.6.0_32 R CMD javareconf
  • 27. MapReduce  mapreduce(input, output, map, reduce)  Like sapply, lapply, tapply within R
  • 28. Hello World – For Hadoop http://www.rabidgremlin.com/data20/MapReduceWordCountOverview1.png
  • 29. Move File Into HDFS # Put data into hdfs Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20- mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1- cdh4.4.0.jar") library(rmr2) library(rhdfs) hdfs.init() hdfs.mkdir(“/user/cloudera/wordcount/data”) hdfs.put("wc_input.txt", "/user/cloudera/wordcount/data") $ hadoop fs –mkdir /user/cloudera/wordcount/data $ hadoop fs –put wc_input.txt /user/cloudera/word/count/data
  • 30. Wordcount Mapper map <- function(k,lines) { words.list <- strsplit(lines, 's') words <- unlist(words.list) return( keyval(words, 1) ) } public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } #Mapper
  • 31. Wordcount Reducer reduce <- function(word, counts) { keyval(word, sum(counts)) } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } #Reducer
  • 32. Call Wordcount hdfs.root <- 'wordcount' hdfs.data <- file.path(hdfs.root, 'data') hdfs.out <- file.path(hdfs.root, 'out') wordcount <- function (input, output=NULL) { mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce) } out <- wordcount(hdfs.data, hdfs.out)
  • 33. Read data from HDFS results <- from.dfs(out) results$key[order(results$val, decreasing = TRUE)][1:10] $ hadoop fs –cat /user/cloudera/wordcount/out/part-00000 | sort –k 2 –nr | head –n 10
  • 34. MapReduce Benchmark > a.time <- proc.time() > small.ints2=1:100000 > result.normal = sapply(small.ints2, function(x) x^2) > proc.time() - a.time > b.time <- proc.time() > small.ints= to.dfs(1:100000) > result = mapreduce(input = small.ints, map = function(k,v) cbind(v,v^2)) > proc.time() - b.time
  • 37.  HDFS stores your files as data chunk distributed on multiple datanodes  M/R runs multiple programs called mapper on each of the data chunks or blocks. The (key,value) output of these mappers are compiled together as result by reducers.  It takes time for mapper and reducer being spawned on these distributed system. Hadoop Latency
  • 38. Its not possible to apply built in machine learning method on MapReduce Program kcluster= kmeans((mydata, 4, iter.max=10) Kmeans Clustering
  • 39. kmeans = function(points, ncenters, iterations = 10, distfun = NULL) { if(is.null(distfun)) distfun = function(a,b) norm(as.matrix(a-b), type = 'F') newCenters = kmeans.iter( points, distfun, ncenters = ncenters) # interatively choosing new centers for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters) } newCenters } Kmeans in MapReduce Style
  • 40. kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs(mapreduce(input = points, map = if (is.null(centers)) { #give random point as sample function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { #find center of minimum distance distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))), to.data.frame = T) } Kmeans in MapReduce Style
  • 42.  Perform common data manipulation operations, as found in plyr and reshape2  It provides a familiar plyr-like interface while hiding many of the mapreduce details  plyr: Tools for splitting, applying and combining data NEW! plyrmr
  • 43. Installation plyrmr dependencies $ yum install libxml2-devel $ sudo yum install curl-devel $ sudo R > Install.packages(c(“ Rcurl”, “httr”), dependencies = TRUE > Install.packages(“devtools”, dependencies = TRUE) > library(devtools) > install_github("pryr", "hadley") > Install.packages(c(“ R.methodsS3”, “hydroPSO”), dependencies = TRUE)
  • 45. > data(mtcars) > head(mtcars) > transform(mtcars, carb.per.cyl = carb/cyl) > library(plyrmr) > output(input(mtcars), "/tmp/mtcars") > as.data.frame(transform(input("/tmp/mtcars"), carb.per.cyl = carb/cyl)) > output(transform(input("/tmp/mtcars"), carb.per.cyl = carb/cyl), "/tmp/mtcars.out") Transform in plyrmr
  • 46.  where( select( mtcars, carb.per.cyl = carb/cyl, .replace = FALSE), carb.per.cyl >= 1) select and where https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md
  • 47.  as.data.frame( select( group( input("/tmp/mtcars"), cyl), mean.mpg = mean(mpg))) Group by https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md