Pig, Hive, Flink, Kafka, Zeppelin... if you now wonder if someone just tried to offend you or are those just Pokemon names, then this talk is just for you! Big Data is everywhere and new tools for it are released almost at the speed of new JavaScript frameworks. During this entry level presentation we will walk though the challenges which Big Data presents, reflect how big is big and introduce currently most fancy and popular (mostly open source) tools. We'll try to spark off interest in Big Data by showing application areas and by throwing ideas where you can later dive into.
13. Batch Processing
● Processes data from 1 or more
sources from bigger period of
time (e.g. day, month)
● Source: db, Apache Parquet, ...
● Not real-time
● Can take hours or more
14. Apache Hadoop
● Based on Google paper
● First release in 2006!
● Map -> (shuffle) -> Reduce
● Was the beginning of many
projectsMapReduce
15. public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Hadoop Wordcount - part I
Source:
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
16. public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Hadoop Wordcount - part II
Source:
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
17. public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Hadoop Wordcount - part III
Source:
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
20. Apache Flink
DAGs with iterations
● Batch & native streaming
● FlinkML
● Table API & SQL (Beta)
● Gelly - graph analytics
● FlinkCEP - detect patterns in
data streams
● Compatible with Apache
Hadoop and Apache Storm APIs
● API: Scala, Java, Python*
21. Stream
processing
● Near real-time/real time
● Processing (usually) does not
end
● Source: files, Apache Kafka,
Socket, Akka Actors, Twitter,
RabbitMQ etc
● Event time vs processing time
● Windows - fixed, sliding, session
● Watermarks
● State
32. Conclusions
● There is a lot of it!
● https://pixelastic.github.io/pokem
onorbigdata/
● If you want to learn, start with
SMACK stack (Spark, Mesos,
Akka, Cassandra, Kafka)