SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
Small intro to Big Data
Michał Matłoka @mmatloka
Outline
1. What is Big Data?
2. Storing
3. Batch & Streams processing
4. Resource Managers
5. Machine Learning
6. Analysis & Visualization
7. Other
What is Big Data?
● Volume
● Velocity
● Variety
How big is Big?
ThoughtWorks: Big Data envy
Storing
CAP theorem
(Brewer’s theorem)
In distributed system you can only
have two of three guarantees:
● Consistency
● Availability
● Partition Tolerance
Relational scaling
(horizontal)
Example limitations:
● Max 48 nodes
● Read-only nodes
● Cross-shard joins…
● Auto-increments
● Distributed transactions,
possible, but…
It can work!
You don’t always need ACID
BASE might be enough
NoSQL
(Not only SQL)
● Key-value (Redis, Dynamo, ...)
● Column (Cassandra, HBase, ...)
● Document (MongoDB, … )
● Graph (Neo4J, …
● Multi-model (OrientDB, …)
Apple - 115k Cassandra nodes with
over 10PB of data!
Source: http://blog.nahurst.com/visual-guide-to-nosql-systems
Batch Processing
● Processes data from 1 or more
sources from bigger period of
time (e.g. day, month)
● Source: db, Apache Parquet, ...
● Not real-time
● Can take hours or more
Apache Hadoop
● Based on Google paper
● First release in 2006!
● Map -> (shuffle) -> Reduce
● Was the beginning of many
projectsMapReduce
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Hadoop Wordcount - part I
Source:
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Hadoop Wordcount - part II
Source:
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Hadoop Wordcount - part III
Source:
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
Apache Spark
RDD (Resilient Distributed
Dataset)
DAG (Directed acyclic graph)
● RDD - map, filter, count etc
● Spark SQL
● MLib
● GraphX
● Spark Streaming
● API: Scala, Java, Python, R*
val textFile = sc.textFile("hdfs://...")
val counts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Spark Wordcount
Source: http://spark.apache.org/examples.html
Apache Flink
DAGs with iterations
● Batch & native streaming
● FlinkML
● Table API & SQL (Beta)
● Gelly - graph analytics
● FlinkCEP - detect patterns in
data streams
● Compatible with Apache
Hadoop and Apache Storm APIs
● API: Scala, Java, Python*
Stream
processing
● Near real-time/real time
● Processing (usually) does not
end
● Source: files, Apache Kafka,
Socket, Akka Actors, Twitter,
RabbitMQ etc
● Event time vs processing time
● Windows - fixed, sliding, session
● Watermarks
● State
Stream
processing
● Native/micro-batch
● Latency
● Throughput
● Delivery guarantees
● Resources managers
● API - compositional/declarative
● Maturity
Differences
Stream processing
● Apache Storm
● Apache Storm Trident
● Alibaba JStorm
● Twitter Heron
● Apache Spark Streaming
● Apache Flink
● Apache Beam
● Apache Kafka Streams
● Apache Samza
● Apache Gearpump
● Apache Apex
● Apache Ignite Streaming
● Apache S4
● ...
Resource
management
● Apache YARN
● Apache Mesos (1.0.1!)
● Apache Slider - deploy existing
apps on YARN
● Apache Myriad - YARN on
Mesos
● DC/OS
Source:
https://docs.mesosphere.com/wp-content/uploads/2016/04/dashb
oard-ee-600x395@2x.gif
Analysis
SQL Engines & Querying
● Apache Hive
● Apache Pig
● Apache HAWQ
● Apache Impala
● Apache Phoenix
● Apache Spark SQL
● Apache Drill
● Facebook Presto
● ...
Machine Learning
● Apache Mahout
● Apache Samoa
● Spark MLib
● FlinkML
● H2O
● TensorFlow
Notebooks
● IPython
● Jupyter
● Spark Notebook
● Apache Zeppelin
Source: https://zeppelin.apache.org/assets/themes/zeppelin/img/notebook.png
Hadoop-related
● Apache Sqoop
● Apache Flume
● Apache Oozie
● Hue
● Apache HDFS
● Apache Ambari
● Apache Knox
● Apache ZooKeeper
Awesome Big Data
https://github.com/onurakpolat
/awesome-bigdata
Conclusions
● There is a lot of it!
● https://pixelastic.github.io/pokem
onorbigdata/
● If you want to learn, start with
SMACK stack (Spark, Mesos,
Akka, Cassandra, Kafka)
Articles & references
● https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-t
echnologies/
● http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-
frameworks-part-1
● https://dcos.io/
● https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn
● http://spark.apache.org/
● https://flink.apache.org/
● http://www.51zero.com/blog/2015/12/13/why-apache-flink-is-the-4th-generation-of-
big-data-analytics-frameworks
● http://www.slideshare.net/AndyPiper1/reactconf-2014-event-stream-processing
● https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Thank you, Q&A?
@mmatloka
http://www.slideshare.net/softwaremill
https://softwaremill.com/blog/

Mais conteúdo relacionado

Mais procurados

Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDBMongoDB
 
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...Athens Big Data
 
Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Regunath B
 
Apache Spark part of Eindhoven Java Meetup
Apache Spark part of Eindhoven Java MeetupApache Spark part of Eindhoven Java Meetup
Apache Spark part of Eindhoven Java MeetupPatrick Deenen
 
Solr Payloads for Ranking Data
Solr Payloads for Ranking Data Solr Payloads for Ranking Data
Solr Payloads for Ranking Data BloomReach
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015NoSQLmatters
 
Webinar: When to Use MongoDB
Webinar: When to Use MongoDBWebinar: When to Use MongoDB
Webinar: When to Use MongoDBMongoDB
 
Machine Learning with Apache Spark - HackNY Masters
Machine Learning with Apache Spark - HackNY Masters Machine Learning with Apache Spark - HackNY Masters
Machine Learning with Apache Spark - HackNY Masters Evan Casey
 
Building an API layer for C* at Coursera
Building an API layer for C* at CourseraBuilding an API layer for C* at Coursera
Building an API layer for C* at CourseraDaniel Jin Hao Chia
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillMapR Technologies
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDBMongoDB
 
Performance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4jPerformance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4jArangoDB Database
 
An introduction to Nosql
An introduction to NosqlAn introduction to Nosql
An introduction to Nosqlgreprep
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...NoSQLmatters
 
Nosql databases for the .net developer
Nosql databases for the .net developerNosql databases for the .net developer
Nosql databases for the .net developerJesus Rodriguez
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidYahoo Developer Network
 

Mais procurados (20)

Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Taming NoSQL with Spring Data
Taming NoSQL with Spring DataTaming NoSQL with Spring Data
Taming NoSQL with Spring Data
 
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
 
Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3
 
Apache Spark part of Eindhoven Java Meetup
Apache Spark part of Eindhoven Java MeetupApache Spark part of Eindhoven Java Meetup
Apache Spark part of Eindhoven Java Meetup
 
Solr Payloads for Ranking Data
Solr Payloads for Ranking Data Solr Payloads for Ranking Data
Solr Payloads for Ranking Data
 
Scalding @ Coursera
Scalding @ CourseraScalding @ Coursera
Scalding @ Coursera
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
 
Webinar: When to Use MongoDB
Webinar: When to Use MongoDBWebinar: When to Use MongoDB
Webinar: When to Use MongoDB
 
Machine Learning with Apache Spark - HackNY Masters
Machine Learning with Apache Spark - HackNY Masters Machine Learning with Apache Spark - HackNY Masters
Machine Learning with Apache Spark - HackNY Masters
 
Building an API layer for C* at Coursera
Building an API layer for C* at CourseraBuilding an API layer for C* at Coursera
Building an API layer for C* at Coursera
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache Drill
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
Performance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4jPerformance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4j
 
An introduction to Nosql
An introduction to NosqlAn introduction to Nosql
An introduction to Nosql
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
 
Nosql databases for the .net developer
Nosql databases for the .net developerNosql databases for the .net developer
Nosql databases for the .net developer
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
 
Pandas
PandasPandas
Pandas
 

Destaque

Do you think you're doing microservice architecture? What about infrastructur...
Do you think you're doing microservice architecture? What about infrastructur...Do you think you're doing microservice architecture? What about infrastructur...
Do you think you're doing microservice architecture? What about infrastructur...Marcin Grzejszczak
 
JDD 2016 - Tomasz Gagor, Pawel Torbus - A Needle In A Logstack
JDD 2016 - Tomasz Gagor, Pawel Torbus - A Needle In A LogstackJDD 2016 - Tomasz Gagor, Pawel Torbus - A Needle In A Logstack
JDD 2016 - Tomasz Gagor, Pawel Torbus - A Needle In A LogstackPROIDEA
 
PLNOG 18 - Michał Sajdak - IoT hacking w praktyce
PLNOG 18 - Michał Sajdak - IoT hacking w praktycePLNOG 18 - Michał Sajdak - IoT hacking w praktyce
PLNOG 18 - Michał Sajdak - IoT hacking w praktycePROIDEA
 
4Developers: Mateusz Stasch- Domain Events - czyli jak radzić sobie z rzeczyw...
4Developers: Mateusz Stasch- Domain Events - czyli jak radzić sobie z rzeczyw...4Developers: Mateusz Stasch- Domain Events - czyli jak radzić sobie z rzeczyw...
4Developers: Mateusz Stasch- Domain Events - czyli jak radzić sobie z rzeczyw...PROIDEA
 
JDD 2016 - Wojciech Oczkowski - Testowanie Wydajnosci Za Pomoca Narzedzia JMH
JDD 2016 - Wojciech Oczkowski - Testowanie Wydajnosci Za Pomoca Narzedzia JMHJDD 2016 - Wojciech Oczkowski - Testowanie Wydajnosci Za Pomoca Narzedzia JMH
JDD 2016 - Wojciech Oczkowski - Testowanie Wydajnosci Za Pomoca Narzedzia JMHPROIDEA
 
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course PROIDEA
 
PLNOG 18 - Alan Kuczewski - "Mały może więcej" - Rozwiązanie wysokiej dostępn...
PLNOG 18 - Alan Kuczewski - "Mały może więcej" - Rozwiązanie wysokiej dostępn...PLNOG 18 - Alan Kuczewski - "Mały może więcej" - Rozwiązanie wysokiej dostępn...
PLNOG 18 - Alan Kuczewski - "Mały może więcej" - Rozwiązanie wysokiej dostępn...PROIDEA
 
PLNOG 18 - Marcin Kuczera- ONT idealny
PLNOG 18 - Marcin Kuczera- ONT idealny PLNOG 18 - Marcin Kuczera- ONT idealny
PLNOG 18 - Marcin Kuczera- ONT idealny PROIDEA
 
4Developers: Aplikacja od SaaSa do IdaaSa
4Developers: Aplikacja od SaaSa do IdaaSa4Developers: Aplikacja od SaaSa do IdaaSa
4Developers: Aplikacja od SaaSa do IdaaSaTomek Onyszko
 

Destaque (9)

Do you think you're doing microservice architecture? What about infrastructur...
Do you think you're doing microservice architecture? What about infrastructur...Do you think you're doing microservice architecture? What about infrastructur...
Do you think you're doing microservice architecture? What about infrastructur...
 
JDD 2016 - Tomasz Gagor, Pawel Torbus - A Needle In A Logstack
JDD 2016 - Tomasz Gagor, Pawel Torbus - A Needle In A LogstackJDD 2016 - Tomasz Gagor, Pawel Torbus - A Needle In A Logstack
JDD 2016 - Tomasz Gagor, Pawel Torbus - A Needle In A Logstack
 
PLNOG 18 - Michał Sajdak - IoT hacking w praktyce
PLNOG 18 - Michał Sajdak - IoT hacking w praktycePLNOG 18 - Michał Sajdak - IoT hacking w praktyce
PLNOG 18 - Michał Sajdak - IoT hacking w praktyce
 
4Developers: Mateusz Stasch- Domain Events - czyli jak radzić sobie z rzeczyw...
4Developers: Mateusz Stasch- Domain Events - czyli jak radzić sobie z rzeczyw...4Developers: Mateusz Stasch- Domain Events - czyli jak radzić sobie z rzeczyw...
4Developers: Mateusz Stasch- Domain Events - czyli jak radzić sobie z rzeczyw...
 
JDD 2016 - Wojciech Oczkowski - Testowanie Wydajnosci Za Pomoca Narzedzia JMH
JDD 2016 - Wojciech Oczkowski - Testowanie Wydajnosci Za Pomoca Narzedzia JMHJDD 2016 - Wojciech Oczkowski - Testowanie Wydajnosci Za Pomoca Narzedzia JMH
JDD 2016 - Wojciech Oczkowski - Testowanie Wydajnosci Za Pomoca Narzedzia JMH
 
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
 
PLNOG 18 - Alan Kuczewski - "Mały może więcej" - Rozwiązanie wysokiej dostępn...
PLNOG 18 - Alan Kuczewski - "Mały może więcej" - Rozwiązanie wysokiej dostępn...PLNOG 18 - Alan Kuczewski - "Mały może więcej" - Rozwiązanie wysokiej dostępn...
PLNOG 18 - Alan Kuczewski - "Mały może więcej" - Rozwiązanie wysokiej dostępn...
 
PLNOG 18 - Marcin Kuczera- ONT idealny
PLNOG 18 - Marcin Kuczera- ONT idealny PLNOG 18 - Marcin Kuczera- ONT idealny
PLNOG 18 - Marcin Kuczera- ONT idealny
 
4Developers: Aplikacja od SaaSa do IdaaSa
4Developers: Aplikacja od SaaSa do IdaaSa4Developers: Aplikacja od SaaSa do IdaaSa
4Developers: Aplikacja od SaaSa do IdaaSa
 

Semelhante a JDD 2016 - Michal Matloka - Small Intro To Big Data

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...Alexey Zinoviev
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data Omid Vahdaty
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaJose Mº Muñoz
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupRadu Chilom
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 

Semelhante a JDD 2016 - Michal Matloka - Small Intro To Big Data (20)

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
In-memory database
In-memory databaseIn-memory database
In-memory database
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 

Último

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Último (20)

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

JDD 2016 - Michal Matloka - Small Intro To Big Data

  • 1. Small intro to Big Data Michał Matłoka @mmatloka
  • 2. Outline 1. What is Big Data? 2. Storing 3. Batch & Streams processing 4. Resource Managers 5. Machine Learning 6. Analysis & Visualization 7. Other
  • 3. What is Big Data? ● Volume ● Velocity ● Variety
  • 4. How big is Big?
  • 7. CAP theorem (Brewer’s theorem) In distributed system you can only have two of three guarantees: ● Consistency ● Availability ● Partition Tolerance
  • 8. Relational scaling (horizontal) Example limitations: ● Max 48 nodes ● Read-only nodes ● Cross-shard joins… ● Auto-increments ● Distributed transactions, possible, but… It can work!
  • 9. You don’t always need ACID
  • 10. BASE might be enough
  • 11. NoSQL (Not only SQL) ● Key-value (Redis, Dynamo, ...) ● Column (Cassandra, HBase, ...) ● Document (MongoDB, … ) ● Graph (Neo4J, … ● Multi-model (OrientDB, …) Apple - 115k Cassandra nodes with over 10PB of data!
  • 13. Batch Processing ● Processes data from 1 or more sources from bigger period of time (e.g. day, month) ● Source: db, Apache Parquet, ... ● Not real-time ● Can take hours or more
  • 14. Apache Hadoop ● Based on Google paper ● First release in 2006! ● Map -> (shuffle) -> Reduce ● Was the beginning of many projectsMapReduce
  • 15. public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } Hadoop Wordcount - part I Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html
  • 16. public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } Hadoop Wordcount - part II Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html
  • 17. public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } Hadoop Wordcount - part III Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html
  • 18. Apache Spark RDD (Resilient Distributed Dataset) DAG (Directed acyclic graph) ● RDD - map, filter, count etc ● Spark SQL ● MLib ● GraphX ● Spark Streaming ● API: Scala, Java, Python, R*
  • 19. val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Spark Wordcount Source: http://spark.apache.org/examples.html
  • 20. Apache Flink DAGs with iterations ● Batch & native streaming ● FlinkML ● Table API & SQL (Beta) ● Gelly - graph analytics ● FlinkCEP - detect patterns in data streams ● Compatible with Apache Hadoop and Apache Storm APIs ● API: Scala, Java, Python*
  • 21. Stream processing ● Near real-time/real time ● Processing (usually) does not end ● Source: files, Apache Kafka, Socket, Akka Actors, Twitter, RabbitMQ etc ● Event time vs processing time ● Windows - fixed, sliding, session ● Watermarks ● State
  • 22. Stream processing ● Native/micro-batch ● Latency ● Throughput ● Delivery guarantees ● Resources managers ● API - compositional/declarative ● Maturity Differences
  • 23. Stream processing ● Apache Storm ● Apache Storm Trident ● Alibaba JStorm ● Twitter Heron ● Apache Spark Streaming ● Apache Flink ● Apache Beam ● Apache Kafka Streams ● Apache Samza ● Apache Gearpump ● Apache Apex ● Apache Ignite Streaming ● Apache S4 ● ...
  • 24. Resource management ● Apache YARN ● Apache Mesos (1.0.1!) ● Apache Slider - deploy existing apps on YARN ● Apache Myriad - YARN on Mesos ● DC/OS
  • 26. Analysis SQL Engines & Querying ● Apache Hive ● Apache Pig ● Apache HAWQ ● Apache Impala ● Apache Phoenix ● Apache Spark SQL ● Apache Drill ● Facebook Presto ● ...
  • 27. Machine Learning ● Apache Mahout ● Apache Samoa ● Spark MLib ● FlinkML ● H2O ● TensorFlow
  • 28. Notebooks ● IPython ● Jupyter ● Spark Notebook ● Apache Zeppelin
  • 30. Hadoop-related ● Apache Sqoop ● Apache Flume ● Apache Oozie ● Hue ● Apache HDFS ● Apache Ambari ● Apache Knox ● Apache ZooKeeper
  • 32. Conclusions ● There is a lot of it! ● https://pixelastic.github.io/pokem onorbigdata/ ● If you want to learn, start with SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka)
  • 33. Articles & references ● https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-t echnologies/ ● http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing- frameworks-part-1 ● https://dcos.io/ ● https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn ● http://spark.apache.org/ ● https://flink.apache.org/ ● http://www.51zero.com/blog/2015/12/13/why-apache-flink-is-the-4th-generation-of- big-data-analytics-frameworks ● http://www.slideshare.net/AndyPiper1/reactconf-2014-event-stream-processing ● https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed ● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 ● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102