JDD 2016 - Michal Matloka - Small Intro To Big Data

Small intro to Big Data
Michał Matłoka @mmatloka

Outline
1. What is Big Data?
2. Storing
3. Batch & Streams processing
4. Resource Managers
5. Machine Learning
6. Analysis & Visualization
7. Other

What is Big Data?
● Volume
● Velocity
● Variety

CAP theorem
(Brewer’s theorem)
In distributed system you can only
have two of three guarantees:
● Consistency
● Availability
● Partition Tolerance

Relational scaling
(horizontal)
Example limitations:
● Max 48 nodes
● Read-only nodes
● Cross-shard joins…
● Auto-increments
● Distributed transactions,
possible, but…
It can work!

NoSQL
(Not only SQL)
● Key-value (Redis, Dynamo, ...)
● Column (Cassandra, HBase, ...)
● Document (MongoDB, … )
● Graph (Neo4J, …
● Multi-model (OrientDB, …)
Apple - 115k Cassandra nodes with
over 10PB of data!

Source: http://blog.nahurst.com/visual-guide-to-nosql-systems

Batch Processing
● Processes data from 1 or more
sources from bigger period of
time (e.g. day, month)
● Source: db, Apache Parquet, ...
● Not real-time
● Can take hours or more

Apache Hadoop
● Based on Google paper
● First release in 2006!
● Map -> (shuffle) -> Reduce
● Was the beginning of many
projectsMapReduce

public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Hadoop Wordcount - part I
Source:
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html

public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Hadoop Wordcount - part II
Source:

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Hadoop Wordcount - part III
Source:

Apache Spark
RDD (Resilient Distributed
Dataset)
DAG (Directed acyclic graph)
● RDD - map, filter, count etc
● Spark SQL
● MLib
● GraphX
● Spark Streaming
● API: Scala, Java, Python, R*

val textFile = sc.textFile("hdfs://...")
val counts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Spark Wordcount
Source: http://spark.apache.org/examples.html

Apache Flink
DAGs with iterations
● Batch & native streaming
● FlinkML
● Table API & SQL (Beta)
● Gelly - graph analytics
● FlinkCEP - detect patterns in
data streams
● Compatible with Apache
Hadoop and Apache Storm APIs
● API: Scala, Java, Python*

Stream
processing
● Near real-time/real time
● Processing (usually) does not
end
● Source: files, Apache Kafka,
Socket, Akka Actors, Twitter,
RabbitMQ etc
● Event time vs processing time
● Windows - fixed, sliding, session
● Watermarks
● State

Stream
processing
● Native/micro-batch
● Latency
● Throughput
● Delivery guarantees
● Resources managers
● API - compositional/declarative
● Maturity
Differences

Stream processing
● Apache Storm
● Apache Storm Trident
● Alibaba JStorm
● Twitter Heron
● Apache Spark Streaming
● Apache Flink
● Apache Beam
● Apache Kafka Streams
● Apache Samza
● Apache Gearpump
● Apache Apex
● Apache Ignite Streaming
● Apache S4
● ...

Resource
management
● Apache YARN
● Apache Mesos (1.0.1!)
● Apache Slider - deploy existing
apps on YARN
● Apache Myriad - YARN on
Mesos
● DC/OS

Source:
https://docs.mesosphere.com/wp-content/uploads/2016/04/dashb
oard-ee-600x395@2x.gif

Analysis
SQL Engines & Querying
● Apache Hive
● Apache Pig
● Apache HAWQ
● Apache Impala
● Apache Phoenix
● Apache Spark SQL
● Apache Drill
● Facebook Presto
● ...

Machine Learning
● Apache Mahout
● Apache Samoa
● Spark MLib
● FlinkML
● H2O
● TensorFlow

Notebooks
● IPython
● Jupyter
● Spark Notebook
● Apache Zeppelin

Source: https://zeppelin.apache.org/assets/themes/zeppelin/img/notebook.png

Hadoop-related
● Apache Sqoop
● Apache Flume
● Apache Oozie
● Hue
● Apache HDFS
● Apache Ambari
● Apache Knox
● Apache ZooKeeper

Awesome Big Data
https://github.com/onurakpolat
/awesome-bigdata

Conclusions
● There is a lot of it!
● https://pixelastic.github.io/pokem
onorbigdata/
● If you want to learn, start with
SMACK stack (Spark, Mesos,
Akka, Cassandra, Kafka)

Articles & references
● https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-t
echnologies/
● http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-
frameworks-part-1
● https://dcos.io/
● https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn
● http://spark.apache.org/
● https://flink.apache.org/
● http://www.51zero.com/blog/2015/12/13/why-apache-flink-is-the-4th-generation-of-
big-data-analytics-frameworks
● http://www.slideshare.net/AndyPiper1/reactconf-2014-event-stream-processing
● https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Thank you, Q&A?
@mmatloka
http://www.slideshare.net/softwaremill
https://softwaremill.com/blog/

JDD 2016 - Michal Matloka - Small Intro To Big Data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (9)

Semelhante a JDD 2016 - Michal Matloka - Small Intro To Big Data

Semelhante a JDD 2016 - Michal Matloka - Small Intro To Big Data (20)

Último

Último (20)

JDD 2016 - Michal Matloka - Small Intro To Big Data