Kafka At Scale in the Cloud

#kafkasummit @allenxwang
Kafka At Scale In The Cloud
1
Allen Wang @ Netflix

The State Of Kafka in Netflix
● 700 billion unique events ingested / day
● 1 trillion unique events / day at peak of last
holiday season
● 1+ trillion events processed every day
● 11 million events ingested / sec @ peak
● 24 GB / sec @ peak
● 1.3 Petabyte / day
2

The State Of Kafka in Netflix
● Managing 4,000+ brokers
● Currently on 0.8.2.1 transitioning to 0.9
3

● Keystone - Unified event publishing, collection,
routing for batch and stream processing
○ 85% of the Kafka data volume
● Ad-hoc messaging
○ 15% of the Kafka data volume
● Characteristics
○ Non-transactional
○ Message delivery failure does not affect user experience
Use Cases
4

Keystone Data Pipeline
Stream
Consumers
Samza
Router
EMR
Fronting
Kafka
Event
Producer
Consumer
Kafka
Control Plane
HTTP
PROXY
5

Design Principles
● Best effort delivery
● First priority is availability of client applications
● Allow minor message drop from producers
○ 99.99% delivery SLA
● Non-keyed messages
● Transparent and dynamic traffic routing for
producers
6

Key Configurations
● acks = 1
○ Reduce the chance that the producer buffer gets full
● block.on.buffer.full = false
○ Do not block the client application for sending events
● unclean.leader.election.enable = true
○ Maximize availability for producers
○ Consumers may lose data or get duplicates or both
7

Challenges
● Being stateful in an environment that favors
stateless services
○ Unpredictable instance life cycle
○ Transient network issues
8

Challenges
● Serving traffic for massive stateless services that
autoscale
9
Regional
failover
Regional
failover

Challenges
● Topics have unpredictable data volume
10
Unintentional
increase after
IOS app release

The Effect Of Outliers
11

Outliers
● Origins of outliers
○ Bad hardware
○ Noisy neighbours
○ Uneven workload
● Symptoms of outliers
○ Significantly higher response time
○ Frequent TCP timeouts/retransmissions
12

Direct Effect of Outliers
● Slow broker response leads to producer buffer
exhaustion and message drop
13

Cascading Effect of Outliers
Event
Producer
Kafka
Buffer exhausted
and message
drop Slow replication
Broker with
networking
problem
Disk read
causes slow
responses
14
X
X
X

#kafkasummit @allenxwang 15
A True Story

Keystone went live 10/30/2015.
The very next day ...
16

Multiple
ZooKeeper
servers became
unhealthy
ZooKeeper
quorum lost
Producers
dropped
messages
ZooKeeper
quorum
recovered
Producers
recovered?
17

Message drop
resumed for two
largest Kafka
clusters with
300+ brokers
Controllers
bounced and
changed
Began
rolling
restart
18

Lessons Learned
● There are times things can go wrong ... and no
turning back
● Reduce the complex
● Find a way to start over fresh
19

Deployment Strategy
● Prefer multiple small clusters
○ Largest cluster has less than 200 brokers
● Limit the total number of partitions for a cluster to
10,000
● Strive for even distribution of replicas
● Have dedicated ZooKeeper cluster for each Kafka
cluster
20

Deployment Configuration
21
Fronting Kafka Clusters Consumer Kafka Clusters
Number of clusters 24 12
Total number of instances 3,000+ 900+
Instance type d2.xl i2.2xl
Replication factor 2 2
Retention period 8 to 24 hours 2 to 4 hours

Tools of Trade
22

Broker ID Management
● Using ZooKeeper persisted node
● Increment broker ID using curator locking recipe
● Checking AWS Auto Scale Group for broker ID reuse

Rack Aware Replica Assignment
● All of our clusters span three AWS availability
zones (rack)
● Distribute replicas of same partition to different
AWS availability zones
● We contributed back
○ KIP-36: Rack aware replica assignment
○ Apache Kafka Github Pull Request #132
○ Part of 0.10 release
24

Scaling Strategy
● We overprovision for daily and failover traffic
● Scale up for organic traffic growth
● Methodologies
○ Adding partitions
○ Partition reassignment
25

Adding Partitions To New Brokers
● Fast way to expand capacity
● Prerequisites
○ No keyed messages
● Caveat
○ TopicCommand may add partition to existing brokers
○ Created our own tool to guarantee adding partitions only to
new brokers
26

Partition Reassignment
● The good news
○ Generally applicable to all situations
● The bad news
○ Time consuming
○ Huge replication traffic that affects producers and
consumers
● What we do
○ Create a tool to divide reassignments into small batches to
limit replication traffic 27

When Things Go Wrong
When Things Go Wrong
28

Save The Penguins
29

Solution - Failover
● Taking advantage of cloud elasticity
● Cold standby Kafka cluster with minimal initial
capacity and ready to scale up
● Different ZooKeeper cluster with no state
● Replication factor = 1
30

Failover
Samza
RouterFronting
KafkaEvent
Producer
X
31

Failover
● Time is the essence - failover as fast as 5 minutes
Fully
Automated
32

After Failover
● Fix it!
● Or rebuild it
● Offline maintenance
● Fail back when ready
33

Monitor, Monitor, Monitor
34

Outlier Detection
● Metrics based
○ Broker’s 99 percentile response time
○ Broker TCP timeouts, errors, retransmissions
○ Producer’s send latency
● Action
○ Broker termination
35

Same broker
shown as
outlier for
multiple
metrics

Visualizing
Outliers

Kafka Monitoring Service
● Broker monitoring (Are you there?)
● Heart-beating & continuous message latency
monitoring (Are you healthy?)
● Consumer partition count and offset monitoring
(Are you delivering?)
38

Visualizing Kafka Metadata
● Undesirable tree view for ZooKeeper
39

Keystone Dashboard - Metadata View
40

Keystone Dashboard - Metadata view
41

Blogs for Keystone Data Pipeline
http://techblog.netflix.com/search?q=keystone
42

Kafka At Scale in the Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Kafka At Scale in the Cloud

Similar to Kafka At Scale in the Cloud (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Kafka At Scale in the Cloud