SlideShare a Scribd company logo
1 of 49
Netflix Data Pipeline
with Kafka
Allen Wang & Steven Wu
Agenda
● Introduction
● Evolution of Netflix data pipeline
● How do we use Kafka
What is Netflix?
Netflix is a logging company
that occasionally streams video
Numbers
● 400 billion events per day
● 8 million events & 17 GB per second during
peak
● hundreds of event types
Agenda
● Introduction
● Evolution of Netflix data pipeline
● How do we use Kafka
Mission of Data Pipeline
Publish, Collect, Aggregate, Move Data @
Cloud Scale
In the old days ...
S3
EMR
Event
Producer
Nowadays ...
S3
Router
Druid
EMR
Existing Data Pipeline
Event
Producer
Stream
Consumers
In to the Future ...
New Data Pipeline
S3
Router
Druid
EMR
Event
Producer
Stream
Consumers
Fronting
Kafka
Consumer
Kafka
Serving Consumers off Diff Clusters
S3
Router
Druid
EMR
Event
Producer
Stream
Consumers
Fronting
Kafka
Consumer
Kafka
Split Fronting Kafka Clusters
● Low-priority (error log, request trace, etc.)
o 2 copies, 1-2 hour retention
● Medium-priority (majority)
o 2 copies, 4 hour retention
● High-priority (streaming activities etc.)
o 3 copies, 12-24 hour retention
Producer Resilience
● Kafka outage should never disrupt existing
instances from serving business purpose
● Kafka outage should never prevent new
instances from starting up
● After kafka cluster restored, event producing
should resume automatically
Fail but Never Block
● block.on.buffer.full=false
● handle potential blocking of first meta data
request
● Periodical check whether KafkaProducer
was opened successfully
Agenda
● Introduction
● Evolution of Netflix data pipeline
● How do we use Kafka
What Does It Take to Run In Cloud
● Support elasticity
● Respond to scaling events
● Resilience to failures
o Favors architecture without single point of failure
o Retries, smart routing, fallback ...
Kafka in AWS - How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our roadmap
Netflix Kafka Container
Kafka
Metric reporting Health check
service Bootstrap
Kafka JVM
Bootstrap
● Broker ID assignment
o Instances obtain sequential numeric IDs using Curator’s locks recipe
persisted in ZK
o Cleans up entry for terminated instances and reuse its ID
o Same ID upon restart
● Bootstrap Kafka properties from Archaius
o Files
o System properties/Environment variables
o Persisted properties service
● Service registration
o Register with Eureka for internal service discovery
o Register with AWS Route53 DNS service
Metric Reporting
● We use Servo and Atlas from NetflixOSS
Kafka
MetricReporter
(Yammer → Servo adaptor)
JMX
Atlas Service
Kafka Atlas Dashboard
Health check service
● Use Curator to periodically read ZooKeeper
data to find signs of unhealthiness
● Export metrics to Servo/Atlas
● Expose the service via embedded Jetty
Kafka in AWS - How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our roadmap
ZooKeeper
● Dedicated 5 node cluster for our data
pipeline services
● EIP based
● SSD instance
Auditor
● Highly configurable producers and
consumers with their own set of topics and
metadata in messages
● Built as a service deployable on single or
multiple instances
● Runs as producer, consumer or both
● Supports replay of preconfigured set of
messages
Auditor
● Broker monitoring (Heartbeating)
Auditor
● Broker performance testing
o Produce tens of thousands messages per second on
single instance
o As consumers to test consumer impact
Kafka admin UI
● Still searching …
● Currently trying out KafkaManager
Kafka in AWS - How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our roadmap
Challenges
● ZooKeeper client issues
● Cluster scaling
● Producer/consumer/broker tuning
ZooKeeper Client
● Challenges
o Broker/consumer cannot survive ZooKeeper cluster
rolling push due to caching of private IP
o Temporary DNS lookup failure at new session
initialization kills future communication
ZooKeeper Client
● Solutions
o Created our internal fork of Apache ZooKeeper
client
o Periodically refresh private IP resolution
o Fallback to last good private IP resolution upon DNS
lookup failure
Scaling
● Provisioned for peak traffic
o … and we have regional fail-over
Strategy #1 Add Partitions to New
Brokers
● Caveat
o Most of our topics do not use keyed messages
o Number of partitions is still small
o Require high level consumer
Strategy #1 Add Partitions to new
brokers
● Challenges: existing admin tools does not
support atomic adding partitions and
assigning to new brokers
Strategy #1 Add Partitions to new
brokers
● Solutions: created our own tool to do it in
one ZooKeeper change and repeat for all or
selected topics
● Reduced the time to scale up from a few
hours to a few minutes
Strategy #2 Move Partitions
● Should work without precondition, but ...
● Huge increase of network I/O affecting
incoming traffic
● A much longer process than adding
partitions
● Sometimes confusing error messages
● Would work if pace of replication can be
controlled
Scale down strategy
● There is none
● Look for more support to automatically move
all partitions from a set of brokers to a
different set
Client tuning
● Producer
o Batching is important to reduce CPU and network
I/O on brokers
o Stick to one partition for a while when producing for
non-keyed messages
o “linger.ms” works well with sticky partitioner
● Consumer
o With huge number of consumers, set proper
fetch.wait.max.ms to reduce polling traffic on broker
Effect of batching
partitioner batched records
per request
broker cpu util
[1]
random without
lingering
1.25 75%
sticky without
lingering
2.0 50%
sticky with 100ms
lingering
15 33%
[1] 10 MB & 10K msgs / second per broker, 1KB per message
Broker tuning
● Use G1 collector
● Use large page cache and memory
● Increase max file descriptor if you have
thousands of producers or consumers
Kafka in AWS - How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our roadmap
Road map
● Work with Kafka community on rack/zone
aware replica assignment
● Failure resilience testing
o Chaos Monkey
o Chaos Gorilla
● Contribute to open source
o Kafka
o Schlep -- our messaging library including SQS and
Kafka support
o Auditor
Thank you!
http://netflix.github.io/
http://techblog.netflix.com/
@NetflixOSS
@allenxwang
@stevenzwu

More Related Content

What's hot

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache KafkaPaul Brebner
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introductionchrislusf
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Big Data Redis Mongodb Dynamodb Sharding
Big Data Redis Mongodb Dynamodb ShardingBig Data Redis Mongodb Dynamodb Sharding
Big Data Redis Mongodb Dynamodb ShardingAraf Karsh Hamid
 
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes EverythingThe Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes EverythingKai Wähner
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 

What's hot (20)

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Big Data Redis Mongodb Dynamodb Sharding
Big Data Redis Mongodb Dynamodb ShardingBig Data Redis Mongodb Dynamodb Sharding
Big Data Redis Mongodb Dynamodb Sharding
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes EverythingThe Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 

Similar to Netflix Data Pipeline With Kafka

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaRicardo Bravo
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
 
Event driven architectures with Kinesis
Event driven architectures with KinesisEvent driven architectures with Kinesis
Event driven architectures with KinesisMark Harrison
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerFederico Palladoro
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpJosé Román Martín Gil
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineMonal Daxini
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemonsaspyker
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...javier ramirez
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingApache Apex
 
Kafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notificationsKafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notificationsSérgio Nunes
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterHostedbyConfluent
 
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITOpenStack
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data AnalyticsAnkur Bansal
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudHostedbyConfluent
 
Ultimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on KubernetesUltimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on Kuberneteskloia
 

Similar to Netflix Data Pipeline With Kafka (20)

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Event driven architectures with Kinesis
Event driven architectures with KinesisEvent driven architectures with Kinesis
Event driven architectures with Kinesis
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
kafka
kafkakafka
kafka
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
Kafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notificationsKafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notifications
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
 
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
 
Ultimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on KubernetesUltimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on Kubernetes
 

Recently uploaded

Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 

Recently uploaded (20)

Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 

Netflix Data Pipeline With Kafka

  • 1. Netflix Data Pipeline with Kafka Allen Wang & Steven Wu
  • 2. Agenda ● Introduction ● Evolution of Netflix data pipeline ● How do we use Kafka
  • 4. Netflix is a logging company
  • 6. Numbers ● 400 billion events per day ● 8 million events & 17 GB per second during peak ● hundreds of event types
  • 7. Agenda ● Introduction ● Evolution of Netflix data pipeline ● How do we use Kafka
  • 8. Mission of Data Pipeline Publish, Collect, Aggregate, Move Data @ Cloud Scale
  • 9. In the old days ...
  • 13. In to the Future ...
  • 15. Serving Consumers off Diff Clusters S3 Router Druid EMR Event Producer Stream Consumers Fronting Kafka Consumer Kafka
  • 16. Split Fronting Kafka Clusters ● Low-priority (error log, request trace, etc.) o 2 copies, 1-2 hour retention ● Medium-priority (majority) o 2 copies, 4 hour retention ● High-priority (streaming activities etc.) o 3 copies, 12-24 hour retention
  • 17. Producer Resilience ● Kafka outage should never disrupt existing instances from serving business purpose ● Kafka outage should never prevent new instances from starting up ● After kafka cluster restored, event producing should resume automatically
  • 18. Fail but Never Block ● block.on.buffer.full=false ● handle potential blocking of first meta data request ● Periodical check whether KafkaProducer was opened successfully
  • 19. Agenda ● Introduction ● Evolution of Netflix data pipeline ● How do we use Kafka
  • 20. What Does It Take to Run In Cloud ● Support elasticity ● Respond to scaling events ● Resilience to failures o Favors architecture without single point of failure o Retries, smart routing, fallback ...
  • 21. Kafka in AWS - How do we make it happen ● Inside our Kafka JVM ● Services supporting Kafka ● Challenges/Solutions ● Our roadmap
  • 22. Netflix Kafka Container Kafka Metric reporting Health check service Bootstrap Kafka JVM
  • 23. Bootstrap ● Broker ID assignment o Instances obtain sequential numeric IDs using Curator’s locks recipe persisted in ZK o Cleans up entry for terminated instances and reuse its ID o Same ID upon restart ● Bootstrap Kafka properties from Archaius o Files o System properties/Environment variables o Persisted properties service ● Service registration o Register with Eureka for internal service discovery o Register with AWS Route53 DNS service
  • 24. Metric Reporting ● We use Servo and Atlas from NetflixOSS Kafka MetricReporter (Yammer → Servo adaptor) JMX Atlas Service
  • 26. Health check service ● Use Curator to periodically read ZooKeeper data to find signs of unhealthiness ● Export metrics to Servo/Atlas ● Expose the service via embedded Jetty
  • 27. Kafka in AWS - How do we make it happen ● Inside our Kafka JVM ● Services supporting Kafka ● Challenges/Solutions ● Our roadmap
  • 28. ZooKeeper ● Dedicated 5 node cluster for our data pipeline services ● EIP based ● SSD instance
  • 29. Auditor ● Highly configurable producers and consumers with their own set of topics and metadata in messages ● Built as a service deployable on single or multiple instances ● Runs as producer, consumer or both ● Supports replay of preconfigured set of messages
  • 31. Auditor ● Broker performance testing o Produce tens of thousands messages per second on single instance o As consumers to test consumer impact
  • 32. Kafka admin UI ● Still searching … ● Currently trying out KafkaManager
  • 33.
  • 34. Kafka in AWS - How do we make it happen ● Inside our Kafka JVM ● Services supporting Kafka ● Challenges/Solutions ● Our roadmap
  • 35. Challenges ● ZooKeeper client issues ● Cluster scaling ● Producer/consumer/broker tuning
  • 36. ZooKeeper Client ● Challenges o Broker/consumer cannot survive ZooKeeper cluster rolling push due to caching of private IP o Temporary DNS lookup failure at new session initialization kills future communication
  • 37. ZooKeeper Client ● Solutions o Created our internal fork of Apache ZooKeeper client o Periodically refresh private IP resolution o Fallback to last good private IP resolution upon DNS lookup failure
  • 38. Scaling ● Provisioned for peak traffic o … and we have regional fail-over
  • 39. Strategy #1 Add Partitions to New Brokers ● Caveat o Most of our topics do not use keyed messages o Number of partitions is still small o Require high level consumer
  • 40. Strategy #1 Add Partitions to new brokers ● Challenges: existing admin tools does not support atomic adding partitions and assigning to new brokers
  • 41. Strategy #1 Add Partitions to new brokers ● Solutions: created our own tool to do it in one ZooKeeper change and repeat for all or selected topics ● Reduced the time to scale up from a few hours to a few minutes
  • 42. Strategy #2 Move Partitions ● Should work without precondition, but ... ● Huge increase of network I/O affecting incoming traffic ● A much longer process than adding partitions ● Sometimes confusing error messages ● Would work if pace of replication can be controlled
  • 43. Scale down strategy ● There is none ● Look for more support to automatically move all partitions from a set of brokers to a different set
  • 44. Client tuning ● Producer o Batching is important to reduce CPU and network I/O on brokers o Stick to one partition for a while when producing for non-keyed messages o “linger.ms” works well with sticky partitioner ● Consumer o With huge number of consumers, set proper fetch.wait.max.ms to reduce polling traffic on broker
  • 45. Effect of batching partitioner batched records per request broker cpu util [1] random without lingering 1.25 75% sticky without lingering 2.0 50% sticky with 100ms lingering 15 33% [1] 10 MB & 10K msgs / second per broker, 1KB per message
  • 46. Broker tuning ● Use G1 collector ● Use large page cache and memory ● Increase max file descriptor if you have thousands of producers or consumers
  • 47. Kafka in AWS - How do we make it happen ● Inside our Kafka JVM ● Services supporting Kafka ● Challenges/Solutions ● Our roadmap
  • 48. Road map ● Work with Kafka community on rack/zone aware replica assignment ● Failure resilience testing o Chaos Monkey o Chaos Gorilla ● Contribute to open source o Kafka o Schlep -- our messaging library including SQS and Kafka support o Auditor