An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.

•

10 gostaram•2,160 visualizações

Jay Kreps, Open Source Visionary and Co Founder of Confluent and several open source projects will be visiting LA. I have asked him to come present at our group. He will present his vision and will answer questions regarding Kafka and other projects Bio:- Jay is the co-founder and CEO at Confluent a company built around realtime data streams and the open source messaging system Apache Kafka. He is the original author of several of open source projects including Apache Kafka, Apache Samza, Voldemort, and Azkaban.

Tecnologia

Apache Kafka
and the
Rise of the
Stream Data Platform
Jay Kreps
1

• Data coverage
• Many source systems
• Relational DBs
• Log files
• Metrics
• Messaging systems
• Many data formats
• Constant change
• New schemas
• New data sources
Problems

Problems
• Throughput
• Batch systems
• Persistence
• Stream Processing
• Ordering guarantees
• Partitioning

Kafka: A Modern Distributed System for Streams
Scalability of a filesystem
◦ Hundreds of MB/sec/server throughput
◦ Many TB per server
Guarantees of a database
◦ Messages strictly ordered
◦ All data persistent
Distributed by default
◦ Replication
◦ Partitioning model
Producers, Consumers, and Brokers all fault tolerant and horizontally scalable

Stream processing is a
generalization
of batch processing
and request/response processing

Request/Response processing:
One input => One output

Batch processing:
All inputs => All outputs

Stream Processing:
Some inputs => some outputs
(you choose how much “some” is)

Stream Processing a la carte
cat input | grep “foo” | wc -l

Stream Processing with Frameworks
+ = Stream
Processing

Usage at LinkedIn
• Everything in the company is a real-time stream
• > 1.2 trillion messages written per day
• > 3.4 trillion messages read per day
• ~ 1 PB of stream data
• Tens of thousands of producer processes
• Backbone for data stores
• Search
• Social Graph
• Newsfeed
• Primary storage (in progress)
• Basis for stream processing

• Mission: Make this a practical reality everywhere
• Product: Confluent Platform
• Apache Kafka
• Schemas and metadata management
• Connectors for common systems
• Monitor data flow end-to-end
• Stream processing integration

Sneak Preview
• 0.9 Release
• Security
• Quotas
• New consumer client
• Connector framework
• Confluent Platform 2.0
• C Client
• Connectors
• etc

Resources
• Confluent
• @confluentinc
• http://confluent.io
• http://blog.confluent.io/2015/02/25/stream-data-
platform-1
• Apache Kafka
• @apachekafka
• http://kafka.apache.org
• http://linkd.in/199iMwY
• Me
• @jaykreps

Mais conteúdo relacionado

Mais procurados

Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineData Con LA

Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...confluent

Current and Future of Apache KafkaJoe Stein

Real-time Data Streaming from Oracle to Apache Kafka confluent

Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA

Apache Kafka at LinkedInDiscover Pinterest

Singer, Pinterest's Logging InfrastructureDiscover Pinterest

January 2011 HUG: Kafka PresentationYahoo Developer Network

Apache kafka-a distributed streaming platformconfluent

Design Patterns for working with Fast DataMapR Technologies

Real time Messages at Scale with Apache Kafka and CouchbaseWill Gardella

Scalable and Reliable Logging at PinterestKrishna Gade

Architecture of a Kafka camus infrastructuremattlieber

Data integration with Apache Kafkaconfluent

Papers we love realtime at facebookGwen (Chen) Shapira

Confluent building a real-time streaming platform using kafka streams and k...Thomas Alex

Kafka presentationMohammed Fazuluddin

Espresso Database Replication with Kafka, Tom Quiggleconfluent

How Kafka and MemSQL Became the Dynamic Duo (Sarung Tripathi, MemSQL) Kafka S...HostedbyConfluent

Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumarconfluent

Mais procurados (20)

Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine

Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...

Current and Future of Apache Kafka

Real-time Data Streaming from Oracle to Apache Kafka

Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...

Apache Kafka at LinkedIn

Singer, Pinterest's Logging Infrastructure

January 2011 HUG: Kafka Presentation

Apache kafka-a distributed streaming platform

Design Patterns for working with Fast Data

Real time Messages at Scale with Apache Kafka and Couchbase

Scalable and Reliable Logging at Pinterest

Architecture of a Kafka camus infrastructure

Data integration with Apache Kafka

Papers we love realtime at facebook

Confluent building a real-time streaming platform using kafka streams and k...

Kafka presentation

Espresso Database Replication with Kafka, Tom Quiggle

How Kafka and MemSQL Became the Dynamic Duo (Sarung Tripathi, MemSQL) Kafka S...

Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Destaque

Bases de Datos No Relacionales (NoSQL): Cassandra, CouchDB, MongoDB y Neo4jDiego López-de-Ipiña González-de-Artaza

Frank C Finley resumeFrank Finley

Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterData Con LA

Big Data Day LA 2015 - Using data visualization to find patterns in multidime...Data Con LA

Dot pab forum september 2011The Social Executive

101129 tokyopref bochibochiredgang

Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Data Con LA

Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Data Con LA

Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...Data Con LA

Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...Data Con LA

Big Data Day LA 2016/ Data Science Track - Data Storytelling for Impact - Dav...Data Con LA

Do you know how the ultra affluent use social media? Find out.The Social Executive

Spark after Dark by Chris Fregly of DatabricksData Con LA

Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA

6 damaging myths about social media and the truths behind themThe Social Executive

Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Data Con LA

Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Data Con LA

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Data Con LA

Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...Data Con LA

Big Data Day LA 2016/ NoSQL track - Big Data and Real Estate, Jon Zifcak, CEO...Data Con LA

Destaque (20)

Bases de Datos No Relacionales (NoSQL): Cassandra, CouchDB, MongoDB y Neo4j

Frank C Finley resume

Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter

Big Data Day LA 2015 - Using data visualization to find patterns in multidime...

Dot pab forum september 2011

101129 tokyopref bochibochi

Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...

Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...

Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...

Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...

Big Data Day LA 2016/ Data Science Track - Data Storytelling for Impact - Dav...

Do you know how the ultra affluent use social media? Find out.

Spark after Dark by Chris Fregly of Databricks

Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...

6 damaging myths about social media and the truths behind them

Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...

Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...

Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...

Big Data Day LA 2016/ NoSQL track - Big Data and Real Estate, Jon Zifcak, CEO...

Semelhante a An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.

kafka for db as postgresPivotalOpenSourceHub

Building real time data-driven productsLars Albertsson

Distributed Kafka Architecture Taboola ScaleApache Kafka TLV

Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...LINE Corporation

Debunking Common Myths in Stream ProcessingDataWorks Summit/Hadoop Summit

Data streaming-systemsimcpune

Streaming Data Ingest and Processing with Apache KafkaAttunity

Apache Kafka - Scalable Message-Processing and more !Guido Schmutz

Trend Micro Big Data Platform and Apache BigtopEvans Ye

CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe

Data Stream Processing for Beginners with Kafka and CDCAbhijit Kumar

Stream processing on mobile networkspbelko82

Building Distributed Data Streaming SystemAshish Tadose

Removing dependencies between services: Messaging and Apache KafkaDaniel Muñoz Garrido

Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs

Apache Kafkaemreakis

10 Big Data Technologies you Didn't Know About Jesus Rodriguez

Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...LINE Corporation

Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex

Semelhante a An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban. (20)

kafka for db as postgres

Building real time data-driven products

Distributed Kafka Architecture Taboola Scale

Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...

Debunking Common Myths in Stream Processing

Data streaming-systems

Streaming Data Ingest and Processing with Apache Kafka

Apache Kafka - Scalable Message-Processing and more !

Trend Micro Big Data Platform and Apache Bigtop

CouchbasetoHadoop_Matt_Michael_Justin v4

Data Stream Processing for Beginners with Kafka and CDC

Stream processing on mobile networks

Building Distributed Data Streaming System

Removing dependencies between services: Messaging and Apache Kafka

Stream All Things—Patterns of Modern Data Integration with Gwen Shapira

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Apache Kafka

10 Big Data Technologies you Didn't Know About

Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...

Intro to Apache Apex - Next Gen Platform for Ingest and Transform

Mais de Data Con LA

Data Con LA 2022 KeynotesData Con LA

Data Con LA 2022 KeynoteData Con LA

Data Con LA 2022 - Startup ShowcaseData Con LA

Data Con LA 2022 KeynoteData Con LA

Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA

Data Con LA 2022 - AI EthicsData Con LA

Data Con LA 2022 - Improving disaster response with machine learningData Con LA

Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA

Data Con LA 2022 - Real world consumer segmentationData Con LA

Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA

Data Con LA 2022 - Moving Data at Scale to AWSData Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA

Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA

Data Con LA 2022 - Intro to Data ScienceData Con LA

Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA

Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA

Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA

Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA

Data Con LA 2022 - Data Streaming with KafkaData Con LA

Mais de Data Con LA (20)

Data Con LA 2022 Keynotes

Data Con LA 2022 Keynote

Data Con LA 2022 - Startup Showcase

Data Con LA 2022 Keynote

Data Con LA 2022 - Using Google trends data to build product recommendations

Data Con LA 2022 - AI Ethics

Data Con LA 2022 - Improving disaster response with machine learning

Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas

Data Con LA 2022 - Real world consumer segmentation

Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...

Data Con LA 2022 - Moving Data at Scale to AWS

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI

Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...

Data Con LA 2022 - Intro to Data Science

Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment

Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...

Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...

Data Con LA 2022- Embedding medical journeys with machine learning to improve...

Data Con LA 2022 - Data Streaming with Kafka

Último

Story boards and shot lists for my a level piececharlottematthew16

From Family Reminiscence to Scholarly Archive .Alan Dix

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

"ML in Production",Oleksandr BaganFwdays

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Gen AI in Business - Global Trends Report 2024.pdfAddepto

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.

1. Apache Kafka and the Rise of the Stream Data Platform Jay Kreps 1

2. Experience at LinkedIn

3. 2009: We want all our data in Hadoop!

4. Initial approach: “gut it out”

5. • Data coverage • Many source systems • Relational DBs • Log files • Metrics • Messaging systems • Many data formats • Constant change • New schemas • New data sources Problems

6. How does everything else work? ?

7. Database changes

8. User events

9. Application Logs & Operational Metrics

10. Messaging

11. This is a giant mess

12. An infrastructure solution?

13. Idea: Stream Data Platform

14. First Attempt: Messaging systems!

15. Problems • Throughput • Batch systems • Persistence • Stream Processing • Ordering guarantees • Partitioning

16. Second Attempt: Build Kafka!

17. What does it do?

18. Commit Log Abstraction

19. Logs & Publish-Subscribe Messaging

20. A Kafka Topic

21. Kafka: A Modern Distributed System for Streams Scalability of a filesystem ◦ Hundreds of MB/sec/server throughput ◦ Many TB per server Guarantees of a database ◦ Messages strictly ordered ◦ All data persistent Distributed by default ◦ Replication ◦ Partitioning model Producers, Consumers, and Brokers all fault tolerant and horizontally scalable

22. Stream Data Platform

23. Batch Data => Batch Processing

24. Stream processing is a generalization of batch processing and request/response processing

25. Request/Response processing: One input => One output

26. Batch processing: All inputs => All outputs

27. Stream Processing: Some inputs => some outputs (you choose how much “some” is)

28. Stream Processing a la carte cat input | grep “foo” | wc -l

29. Stream Processing with Frameworks + = Stream Processing

30. LinkedIn’s Data Architecture

31. Usage at LinkedIn • Everything in the company is a real-time stream • > 1.2 trillion messages written per day • > 3.4 trillion messages read per day • ~ 1 PB of stream data • Tens of thousands of producer processes • Backbone for data stores • Search • Social Graph • Newsfeed • Primary storage (in progress) • Basis for stream processing

32. Elsewhere

33. • Mission: Make this a practical reality everywhere • Product: Confluent Platform • Apache Kafka • Schemas and metadata management • Connectors for common systems • Monitor data flow end-to-end • Stream processing integration

34. Sneak Preview • 0.9 Release • Security • Quotas • New consumer client • Connector framework • Confluent Platform 2.0 • C Client • Connectors • etc

35. Resources • Confluent • @confluentinc • http://confluent.io • http://blog.confluent.io/2015/02/25/stream-data- platform-1 • Apache Kafka • @apachekafka • http://kafka.apache.org • http://linkd.in/199iMwY • Me • @jaykreps

Notas do Editor

Year = 2009: This story takes place at LinkedIn. I was on the infra team. Previously had built a distributed key-value store Was leading Hadoop adoption effort.
Had done initial prototypes, had some valuable things Want Data Lake/Hub
Various ad hoc load jobs Manual parsing of each new data we got into target schema 14% of data Not the kind of team I wanted
Data grandfathering 14% of data in Hadoop N new non-hadoop engineers imply O(N) hadoop engineers
Got interested in things like schemas, metadata, dataflow, etc
Real-time, mostly lossless—couldn’t go back in time Batch, mostly lossless–high latency Why CSV dumps?
Batch aggregation Lossy, high-latency, only went to data warehouse/Hadoop
Splunk
Low-throughput, lossy, no real scalability story No central system No integration with batch wold
Reality about 100x more complex 300 services ~100 databases Multi-datacenter Trolling: load into Oracle, search, etc Publish data from Hadoop to a search index Run a SQL query to find the biggest latency bottleneck Run a SQL query to find common error patterns Low latency monitoring of database changes or user activity Incorporate popularity in real-time display and relevance algorithms Products that incorporate user activity
Not tenable—new systems, data being created faster than we integrate them Not all problems solvable with infrastructure Extract the common pattern…
Had Hadoop—great that can be our warehouse/archive. But what about real-time data and real-time processing? All the messed up stuff? Files are really just a stream of updates stored together
This problem is solved…don’t reinvent the wheel!
Not suitable for ETL Not suitable for high throughput
Reinvent the wheel! Initial time estimate: 3 months 30 second intro to Kafka
Producers, consumers, topic Like a messaging system
Different: Commit log Like a file that goes on forever Stolen from distributed database internals Key abstraction for systems, real-time processing, data integration Formalization of a stream
Very fast publish subscribe mechanism Unifies batch and real-time processing
Built like a modern dist system
Let’s dive into the elements of this platform One of the big things this enables is stream processing…let’s dive into that. Truviso
First thing you want when you have real-time data streams is real-time transformations 1790 Collected data by Networks=>stream processing 3,929,214 people $44k Horses and wagons are a high latency, batch channel
Service: One input = one output Batch job: All inputs = all outputs Stream computing: any window = output for that window
REST, RPC, etc
REST, RPC, etc
Batch processing => stream processing with window of 1 day
Like unix pipe, streams data between programs Distributed, fault-tolerant, elastically scalable Modern version of unix pipes
Streams + Processing = Stream Processing HDFS analogy Most stream processing system use Kafka Some require it—rollback recovery
Stream Data Platform Trace data flow
Everything that happens in the company is a real-time stream
Describe adoption cycle
Only one thing you can do if you think the world needs to change, you live in Silicon Valley—quit your job and do it.

An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.

Semelhante a An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban. (20)

Mais de Data Con LA

Mais de Data Con LA (20)

Último

Último (20)

An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.

Notas do Editor