Netflix changed its data pipeline architecture recently to use Kafka as the gateway for data collection for all applications which processes hundreds of billions of messages daily. This session will discuss the motivation of moving to Kafka, the architecture and improvements we have added to make Kafka work in AWS. We will also share the lessons learned and future plans.
2. #kafkasummit @allenxwang
The State Of Kafka in Netflix
● 700 billion unique events ingested / day
● 1 trillion unique events / day at peak of last
holiday season
● 1+ trillion events processed every day
● 11 million events ingested / sec @ peak
● 24 GB / sec @ peak
● 1.3 Petabyte / day
2
4. #kafkasummit @allenxwang
● Keystone - Unified event publishing, collection,
routing for batch and stream processing
○ 85% of the Kafka data volume
● Ad-hoc messaging
○ 15% of the Kafka data volume
● Characteristics
○ Non-transactional
○ Message delivery failure does not affect user experience
Use Cases
4
6. #kafkasummit @allenxwang
Design Principles
● Best effort delivery
● First priority is availability of client applications
● Allow minor message drop from producers
○ 99.99% delivery SLA
● Non-keyed messages
● Transparent and dynamic traffic routing for
producers
6
7. #kafkasummit @allenxwang
Key Configurations
● acks = 1
○ Reduce the chance that the producer buffer gets full
● block.on.buffer.full = false
○ Do not block the client application for sending events
● unclean.leader.election.enable = true
○ Maximize availability for producers
○ Consumers may lose data or get duplicates or both
7
14. #kafkasummit @allenxwang
Cascading Effect of Outliers
Event
Producer
Kafka
Buffer exhausted
and message
drop Slow replication
Broker with
networking
problem
Disk read
causes slow
responses
14
X
X
X
20. #kafkasummit @allenxwang
Deployment Strategy
● Prefer multiple small clusters
○ Largest cluster has less than 200 brokers
● Limit the total number of partitions for a cluster to
10,000
● Strive for even distribution of replicas
● Have dedicated ZooKeeper cluster for each Kafka
cluster
20
21. #kafkasummit @allenxwang
Deployment Configuration
21
Fronting Kafka Clusters Consumer Kafka Clusters
Number of clusters 24 12
Total number of instances 3,000+ 900+
Instance type d2.xl i2.2xl
Replication factor 2 2
Retention period 8 to 24 hours 2 to 4 hours
23. #kafkasummit @allenxwang 23
Broker ID Management
● Using ZooKeeper persisted node
● Increment broker ID using curator locking recipe
● Checking AWS Auto Scale Group for broker ID reuse
24. #kafkasummit @allenxwang
Rack Aware Replica Assignment
● All of our clusters span three AWS availability
zones (rack)
● Distribute replicas of same partition to different
AWS availability zones
● We contributed back
○ KIP-36: Rack aware replica assignment
○ Apache Kafka Github Pull Request #132
○ Part of 0.10 release
24
25. #kafkasummit @allenxwang
Scaling Strategy
● We overprovision for daily and failover traffic
● Scale up for organic traffic growth
● Methodologies
○ Adding partitions
○ Partition reassignment
25
26. #kafkasummit @allenxwang
Adding Partitions To New Brokers
● Fast way to expand capacity
● Prerequisites
○ No keyed messages
● Caveat
○ TopicCommand may add partition to existing brokers
○ Created our own tool to guarantee adding partitions only to
new brokers
26
27. #kafkasummit @allenxwang
Partition Reassignment
● The good news
○ Generally applicable to all situations
● The bad news
○ Time consuming
○ Huge replication traffic that affects producers and
consumers
● What we do
○ Create a tool to divide reassignments into small batches to
limit replication traffic 27
30. #kafkasummit @allenxwang
Solution - Failover
● Taking advantage of cloud elasticity
● Cold standby Kafka cluster with minimal initial
capacity and ready to scale up
● Different ZooKeeper cluster with no state
● Replication factor = 1
30