SlideShare uma empresa Scribd logo
1 de 36
URP? Excuse You!
Todd Palino
Senior Staff Engineer, Site Reliability
LinkedIn
• What is Kafka
• Encyclopedia of Monitoring
• Automation
What This
Talk Is Not
Why Talk About
Monitoring?
Messages per Day at LinkedIn
What is Monitoring (not)?
Monitoring is not Alerting
• Collect everything
• Alert on nothing
• Events are better than metrics
• Tests are better than alerts
• Sleep is best in life
• What’s an SLA?
• Availability
• Latency
• Customer Guarantees
Service
Level
Objectives
Key Kafka Metrics
The Three Metrics You Need to Know
Partitions that are not
fully replicated within
the cluster
URP
The overall utilization
of an Apache Kafka
broker
Request
Handlers
How long requests
are taking, in which
stage of processing
Request
Timing
Under-Replicated Partitions
• Highly discussed
• Overall cluster health
• Replication is a consumer and producer
Under-Replicated Partitions
EXAMPLE: FAILED BROKER
Under-Replicated Partitions
EXAMPLE: CONSUMER PROBLEMS
Under-Replicated Partitions
EXAMPLE: PRODUCER PROBLEMS
Under-Replicated Partitions
• Overrated
• Doesn’t map to SLO
• Often not actionable
• Collect, but don’t alert
Everybody
In The
Pool
• Specialized thread pools
• Clients deal with network and
request pools
• Request handlers do most of
the work
Request
Handlers
• Decode and validate
• Perform task
• Wait for other brokers
• Assemble response
Request Handler Problems
• Anything that causes Kafka
to expend CPU cycles
• Includes problems related
to failing disks (IO wait)
• SSL and compression work
both can use a lot of CPU
CPU Time Timeout Deadlock
• Most often due to failing to
process controller requests
• Intra-cluster requests tend
to be bound by partition
counts
• Rapidly starves the pool of
threads
• Should always be a code
bug
• Usually looks exactly like a
timeout problem
• Rare, but hard to identify
Request Handler Problems
EXAMPLE: TIMEOUT OR DEADLOCK
Request Handler Problems
• Anything that causes Kafka
to expend CPU cycles
• Includes problems related
to failing disks (IO wait)
• SSL and compression work
both can use a lot of CPU
CPU Time Timeout Deadlock
• Most often due to failing to
process controller requests
• Intra-cluster requests tend
to be bound by partition
counts
• Rapidly starves the pool of
threads
• Should always be a code
bug
• Usually looks exactly like a
timeout problem
• Rare, but hard to identify
Brokers Don’t Do Compression
Brokers Don’t Shouldn’t Do Compression
• Kafka brokers are running a new version
• Message format has been set to the new
version
• Clients haven’t upgraded
Up Conversion Down Conversion
• Kafka brokers are running a new version
• Message format is set to an older version
due to clients
• Producer clients update to new version
Request Timing
• Remote – Waiting for other brokers
• Response Queue – Waiting to
send
• Response Send - Send to client
• Total – Request handling, end to
end
• Request Queue – Waiting to
process
• Local – Work local to the broker
Request Timing
EXAMPLE: PRODUCE TOTAL TIME
Request Timing
EXAMPLE: PRODUCE LOCAL TIME
Request Timing
EXAMPLE: PRODUCE REMOTE TIME
Thank you?
What’s Missing?
Availability
Monitoring
• SLO, part 2
• Measured externally
• Client focused
• github.com/linkedin/kafka-monitor
Operating System
And Hardware
Metrics
• What do they mean?
• What application is causing
it?
• Don’t alert unless:
• 100% clear signal
• 100% clear response
Capacity
Planning
• Plan in advance
• Multi-factor
• Don’t alert for capacity
Capacity
Metrics
• Request Handler Idle Ratio
• Disk Utilization
• Partition Count
• Network Utilization
Wrapping Up
If You Remember Nothing Else…
• Define your service level objectives
• Monitor your service level objectives
• Metrics that cover many problems are noisy
• Buy Kafka: The Definitive Guide
Getting (and Giving) Help
• Kafka Monitor
• https://github.com/linkedin/kafka-monitor
• Burrow
• https://github.com/linkedin/Burrow
• Cruise Control
• https://github.com/linkedin/cruise-control
• kafka-tools
• https://github.com/linkedin/kafka-tools
LinkedIn Open Source Get Involved
• Community
• users@kafka.apache.org
• dev@kafka.apache.org
• Bugs and Work:
• https://issues.apache.org/jira/projects/KAFK
A
Thank you

Mais conteúdo relacionado

Mais procurados

Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlJiangjie Qin
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Apache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patternsApache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patternsFlorent Ramiere
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producerconfluent
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin PodvalMartin Podval
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registryconfluent
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
Monitoring Apache Kafka
Monitoring Apache KafkaMonitoring Apache Kafka
Monitoring Apache Kafkaconfluent
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafkaconfluent
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsRunning Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsLightbend
 
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...HostedbyConfluent
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?confluent
 

Mais procurados (20)

Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patternsApache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patterns
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registry
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Monitoring Apache Kafka
Monitoring Apache KafkaMonitoring Apache Kafka
Monitoring Apache Kafka
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsRunning Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
 
Kafka tutorial
Kafka tutorialKafka tutorial
Kafka tutorial
 
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 

Semelhante a URP? Excuse You! The Three Kafka Metrics You Need to Know

Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Ontico
 
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming ApplicationsMetrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applicationsconfluent
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into OverdriveTodd Palino
 
Resilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackResilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackC4Media
 
Make It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version ControlMake It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version Controlindiver
 
Fault Tolerance in Distributed Environment
Fault Tolerance in Distributed EnvironmentFault Tolerance in Distributed Environment
Fault Tolerance in Distributed EnvironmentOrkhan Gasimov
 
Asynchronous programming using CompletableFutures in Java
Asynchronous programming using CompletableFutures in JavaAsynchronous programming using CompletableFutures in Java
Asynchronous programming using CompletableFutures in JavaOresztész Margaritisz
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelinesSumant Tambe
 
Production Ready Microservices at Scale
Production Ready Microservices at ScaleProduction Ready Microservices at Scale
Production Ready Microservices at ScaleRajeev Bharshetty
 
Benchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and ResultsBenchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and ResultsNGINX, Inc.
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayDataStax Academy
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudAnshum Gupta
 
Adding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsAdding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsRonny López
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesBhakti Mehta
 
Design Review Best Practices - SREcon 2014
Design Review Best Practices - SREcon 2014Design Review Best Practices - SREcon 2014
Design Review Best Practices - SREcon 2014Mandi Walls
 
Continuous Delivery for the Rest of Us
Continuous Delivery for the Rest of UsContinuous Delivery for the Rest of Us
Continuous Delivery for the Rest of UsC4Media
 
Stream Processing @ Lyft
Stream Processing @ LyftStream Processing @ Lyft
Stream Processing @ LyftJamie Grier
 

Semelhante a URP? Excuse You! The Three Kafka Metrics You Need to Know (20)

Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
 
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming ApplicationsMetrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Resilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackResilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes Back
 
Make It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version ControlMake It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version Control
 
Fault Tolerance in Distributed Environment
Fault Tolerance in Distributed EnvironmentFault Tolerance in Distributed Environment
Fault Tolerance in Distributed Environment
 
Asynchronous programming using CompletableFutures in Java
Asynchronous programming using CompletableFutures in JavaAsynchronous programming using CompletableFutures in Java
Asynchronous programming using CompletableFutures in Java
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
 
Production Ready Microservices at Scale
Production Ready Microservices at ScaleProduction Ready Microservices at Scale
Production Ready Microservices at Scale
 
Benchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and ResultsBenchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and Results
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right Way
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Adding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsAdding Real-time Features to PHP Applications
Adding Real-time Features to PHP Applications
 
CoAP Talk
CoAP TalkCoAP Talk
CoAP Talk
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservices
 
Design Review Best Practices - SREcon 2014
Design Review Best Practices - SREcon 2014Design Review Best Practices - SREcon 2014
Design Review Best Practices - SREcon 2014
 
Continuous Delivery for the Rest of Us
Continuous Delivery for the Rest of UsContinuous Delivery for the Rest of Us
Continuous Delivery for the Rest of Us
 
Play With Streams
Play With StreamsPlay With Streams
Play With Streams
 
Stream Processing @ Lyft
Stream Processing @ LyftStream Processing @ Lyft
Stream Processing @ Lyft
 

Mais de Todd Palino

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderTodd Palino
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsTodd Palino
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayTodd Palino
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Todd Palino
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Todd Palino
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum PainTodd Palino
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInTodd Palino
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaTodd Palino
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak PerformanceTodd Palino
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More ProblemsTodd Palino
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTodd Palino
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesTodd Palino
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceTodd Palino
 

Mais de Todd Palino (13)

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafka
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a Service
 

Último

CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESkarthi keyan
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptxmohitesoham12
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfBalamuruganV28
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfisabel213075
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 

Último (20)

CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptx
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdf
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdf
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 

URP? Excuse You! The Three Kafka Metrics You Need to Know

Notas do Editor

  1. Let me start off by telling you what we’re not talking about today. I won’t be going into the basics of what Kafka is – I assume that if you’re attending Kafka Summit, you have an idea of what it does and how it works. Regardless, you’re going to get some good data here on monitoring, even if you have very limited Kafka knowledge. However, this also won’t be an encyclopedic look at monitoring. I’m going to discuss a few key sets of metrics, and how to use them. But I won’t even be covering all the Kafka metrics you should look at, never mind all that exist. I encourage you to spin up a JMX tool of choice and explore what’s exposed for sensors in Kafka. I also encourage you to share with the class, whether in posts, talks, or tweets, any gems that you have for your own monitoring. I’m also not going to talk about automation, even as it relates to handling alerts. There are many fine talks out there about automating responses and runbooks, and we could spend hours talking about just that.
  2. So why am I here today talking about monitoring? There are lots of topics that could be covered, especially in an ecosystem as large as Kafka. And I could always deliver yet another “here’s how we do it at LinkedIn” talk. However, today I’m choosing to share a look at where we’re moving right now. I recently wrote a post for DevOps.com about a term we use, “Code Yellow”. This is one of our tools for dealing with an application, or a team, in crisis. Typically this is due to something like communication problems, or a large amount of tech debt. Since I recently wrote this post, and you all know that I work on Kafka, you can probably guess that I’m currently in this state. In our case, it’s due to somewhat unexpected growth.
  3. LinkedIn started using Kafka back in 2010, before it was open sourced. In September of 2015, we announced that we had hit a milestone, at one trillion messages a day produced into our Kafka clusters. Last year, at Kafka Summit in San Francisco, I noted that we had passed two trillion messages a day. At the beginning of the year, we clocked in at three trillion. And now, we’re over five trillion messages a day. That hockey stick at the end is the current source of my long days and sleepless nights. Top this off with the fact that our monitoring is currently very noisy, partly due to scale problems around this growth, and partly because we alert on many things that are not providing clear signals. We’re currently overhauling our monitoring as a result of this.
  4. So why do we have such noisy alerting? We’ve forgotten that monitoring and alerting are not the same thing.
  5. Today, we're going to be talking about monitoring, not alerting. What is the difference, you ask? In our case, monitoring refers to all the data we have available to us from Kafka and our underlying systems, from high level metrics like partition counts down to the most minute sensor that is available. Alerting, on the other hand, we will use to refer to the metrics that are used to tell us about an imminent problem. They're the metrics that wake us up at night. These should be carefully chosen, and they should be clear signals that demand an immediate response 100% of the time. Another thing to keep in mind that events are almost always superior to metrics when alerting. We know this, right? Kafka is all about events. And yet we still have measurements that are rates where they should be discrete counts of events. We normally can’t work with individual events, like a failed request, at scale. But we do want to know the actual number of failed requests, and not a requests per second metric where we miss data due to time windows. We also need to make sure that we’re testing the code before we deploy it. My team has fallen prey to reactive alerting – we find a new problem, like a socket leak, and we add a new alert for file handles in use so we can catch it before it goes critical. The bug gets fixed, but we keep the alert, just in case we run into it again. It would be much better for everyone if we added a release test that checks for the general case of increased file handle usage, and dropped the alert on the live systems. Alerting should always be aimed at maximizing the amount of sleep that your operations team gets. That means as few alerts as possible to keep everything running, and automating as much as possible.
  6. When we're talking about alerting, the most important thing to watch is the metrics related to your service level objectives, or SLOs. Just as a note, an SLO and an SLA are not the same thing. A service level agreement is a contract: it's basically an SLO with teeth - a penalty. The SLO is the level of service that we're promising to our customers. For Kafka, this is typically going to be that the system will be available, and it will perform at a certain level for produce and consume requests. We'll cover what metrics to use for this in a bit. In addition to these, your SLOs are whatever you’re guaranteeing to your customers. This may include a minimum amount of retention. If you’re working to GDPR, or another privacy standard, you may specify a maximum amount of time that data will be retained for (here’s a hint, that’s not necessarily the retention in time that you set for the topic).
  7. I've talked at length about the under replicated partition count metric. I dedicated a significant number of pages in a book you may have seen about how to respond to any non-zero value. At it's heart, this number tells you that the replication within the cluster is having a problems.
  8. A stable count on all but one broker tells you that that broker is not working. It's either down, or the replication is not started
  9. A variable count on a single broker tells you that that broker is having a problem servicing consume requests
  10. A variable count on multiple brokers indicates a more overall problem. In this case, you'll need to enumerate the partitions that are falling behind (using the CLI tools) and see if there is a common thread, such as a single broker that is having problems replicating from multiple cluster members.
  11. But the most important thing that the URP metric is, is overrated for alerting. That's right, I said it. I don't like getting paged for this metric. But why, you ask? If it illustrates so many problems, why wouldn't I want to get alerts for it? The problem is that it doesn't tell me that I'm breaching my SLO, and whatever problem it's telling me about is often not immediately actionable. More often than not, this metric tells me about two problems. The first is that a broker is down. I can detect that with a much clearer signal, however, by health checking the application. The other problem is that the cluster is operating over it's capacity. I don't want to be paged for that either because capacity is a proactive monitoring problem, not a reactive problem. We'll talk about that more in a few slides. Still, you should be collecting this metric, and you might want to consider generating warnings for it. It does illustrate a risky situation, because we depend on replication in the cluster for redundancy. When it's not zero, you have a problem that needs some attention.
  12. As with most applications, Kafka has thread pools to do work. There are several different ones - network handlers, request handlers, log compaction, recovery (which are also used for handling log segments at startup and shutdown). When we’re talking about client traffic, the network and request handlers are the ones that do all the work, and the request handlers are far more important. This is because the network handlers just take care of the network connection, including reading and writing bytes on the wire.
  13. The request handler does everything else for the client - it decodes and validates the protocol, handles produce and consume work, and assembles the response to send back. It even performs all of the broker internal work, responding to controller requests. This means that if you want a single indicator of how busy the broker is, you couldn’t ask for a much better measure than the utilization of the request handlers. But as with under-replicated partitions, there are a lot of different problems that could be indicated here
  14. CPU - Slow disk performance, often due to a failing drive, is a particular problem for produce requests. As the request handler will have to take more time when writing to disk, it will manifest as higher utilization Timeouts and deadlocks look very similar Timeouts -  all of the request handler threads are getting tied up. We most often see this when the broker is starting up, and it is failing to process requests from the controller within the controller socket timeout.  Deadlock - But if that doesn’t solve it, you may have hit a deadlock condition in handling requests. We’ve seen this recently with some shutdown code, but it was related to the authorizer we were using and not Kafka directly.  
  15. Here are the produce TotalTime graphs for a broker that is working perfectly well. (Include 50th, 99th, and 999th). If the broker is running well, why is there such a discrepancy? The reason is that the amount of time required for a produce request varies widely depending on the content of the request.
  16. Timeouts most often happen when controller requests are not processed within the controller socket timeout. What happens is that the controller sends the request, it times out, and then the controller sends the request again. You’ll see this especially when the broker is starting up, and the controller is trying to send it the state of the world with leader and ISR requests Deadlocks look almost identical, but they’re much more rare. We’ve seen them recently during shutdown, but that was caused by an issue in the authorizer module that we use, and not something that was endemic to Kafka itself. However, they’re almost always code issues. This makes them pretty tricky to debug.
  17. Wait, the Kafka brokers don’t compress data anymore! We got rid of that with the bump to message format 1, and relative offsets in the produced batches. Right? Yeah, that’s what I thought, too. Turns out that there are a couple cases, which are not as rare as you might think, that will result in the broker having to rewrite the incoming message batches.
  18. Another common culprit for the request handlers being over utilized, even at a low traffic volume, is due to compression. This happens when the client versions do not match the message format on disk. The (config name) is settable via a broker configuration, and controls how messages are written to disk. In an ideal world, the producer client version matches this configuration, such that the producer is sending the same message format. If the producer is an older version, the broker will have to upconvert the messages, and if the producer is using a higher message format version the broker will need to down convert. Both of these situations means the broker will be forced to recompress the message batch before writing it to disk (this also happens if your brokers are still using message format zero). This is an expensive operation, and should be avoided. It’s also worth noting that you can set the message format on disk as a per-topic override. You will want to be very careful if you feel the need to do this, as it means the logs on disk are inconsistent, and you could easily have compression you’re not expecting.
  19. If you have slow request processing due to issues like this, you’re also going to have latency issues. Which gets us into the third set of metrics... For each protocol request type, Kafka provides a set of timing metrics. These describe the amount of time that the request spends in various states while being processed: Total time - this is the overall total time to process a request, from when it is received to when it is complete Request Queue Time - how long the request sits in queue before being picked up by a request handler for processing Local Time - The amount of local processing time required for the request. This can include a number of things, such as disk write time for produce requests Remote Time - The amount of time that the request waits on non-local steps. This includes acknowledgements from followers for produce requests Response Queue Time - how long the response for the request sits in queue before being sent to the client Response Send Time - how long it takes to send the response to the client. This only covers getting it into the send buffers locally, not network time. In addition to the time metrics, there is also a rate metric that gives you the number of requests of a particular type per second. The time metrics are provided as percentiles, and as such you can choose from 50th, 75th, 99th, and 99.9th percentiles, as well as an average and maximum value over the course of the running process. Request latency is typically going to be the first of your SLO measurements. Which means that you will probably want to be monitoring these metrics and possibly alerting off them. The problem comes in as you try to pick which attributes to monitor, and what the baseline values are.
  20. Here are the produce TotalTime graphs, 50th percentile and 99.9th percentile, for a broker that is working perfectly well. It may be hard to see, but the scale of the first graph is in single digits, and the scale of the second is in thousands. If the broker is running well, why is there such a discrepancy? The reason is that the amount of time required for a produce request varies widely depending on the content of the request.
  21. Let’s consider the local time. Again, these are the 50th percentile and the 99.9th percentile, and the first graph goes from zero to one, while the second graph is again in the thousands. What would impact the amount of time required to process the produce requests locally? In this case, most of our produce requests are really small - small batches, single topic - but some of them are very large. The bigger the produce request, the more time it takes to write the data to disk.
  22. How about the remote time for the same produce requests? Yet again, these are the 50th and 99.9th percentile graphs, with the first one being from zero to two, and the second being in the thousands. The average value is small, but the 999th is multiple orders of magnitude higher. The most common cause here is that most of our requests are being produced with the required acknowledgements being set to 1, while some are requesting all acknowledgements. That easily drives up the amount of time spent in the remote step. This isn’t to say that you can’t use these metrics effectively for alerting. It just means that you need to define your SLOs appropriately.  Stating simply that produce requests will be handled in 20ms or less may not be reasonable, but specifying that value for the average produce request may be fine. 
  23. OK, so we’ve covered our three metrics, and we’ve still got X minutes left in this talk. I could sit here and just stare at my phone for the rest of the time. Or …
  24. We could talk about what’s missing, since we only covered a very small slice of monitoring for Kafka.
  25. The other side of your service level objectives is probably going to be the availability of Kafka to handle requests. But as with any system, you can’t truly measure the availability of a Kafka cluster from the brokers themselves. There are many factors that go into availability, including whether or not the network is working. Looking at the broker itself may tell you that everything’s fine, meanwhile none of your clients can connect. For monitoring availability, you need to use something external to the Kafka cluster to look at it from the client’s point of view. This is why LinkedIn created, and open sourced, kafka-monitor (https://github.com/linkedin/kafka-monitor). This runs a producer and a consumer for each cluster, and assures that both requests work properly. It can assure that there is at least one partition on each broker in the cluster, so you check the entire cluster. It also provides latency metrics for requests, so you have an objective view of the request timings we were just talking about.
  26. So what should we do about lower level OS and hardware metrics? Well, let me ask you this. I have a Kafka cluster that’s running at 95% CPU, what do I do? Well, if it’s serving requests properly and within the SLO, I go get a cup of coffee. I might need to look at it, but it’s not a crisis. Most metrics, OS or otherwise, are a great recipe for creating lots of alert noise that is not actionable. CPU and memory usage could be high due to other applications, and in most cases relate to overall capacity and not to the application’s performance or current state of functionality. You should definitely collect them so that you can go back and debug problems later. If you’re thinking about setting up an alert you need to ask yourself two things: Is this always actionable when the alert goes off? Is the action 100% clear? If the answer to either of these is something along the lines of “Yes, but…” you need to stop and rethink what you’re trying to accomplish. But, Todd! I need to monitor things like disk usage, don’t I? Yes, of course we do, but this falls under the heading of capacity planning.
  27. My Kafka environment, like many of yours, is shared between many different applications. You may even have some of the tech debt that we have, where you have little control over when someone starts using it for a new service. This means that we should be keeping an eye on the capacity of the system, and preemptively adding more. Preemptively is the key word here. You want to deploy new brokers before you’ve hit 100% capacity, which means that you need to order them earlier than that.
  28. I am no magician, contrary to the perception that many have of my ability to solve problems. It does me no good to get an alarm in the middle of the night that we’re approaching saturation, as I can’t magically make new hardware appear. And if I already have the hardware, it should have been added to the clusters so that I never hit a crisis point.
  29. The metrics that I’m mostly interested in for judging capacity are: Request handler pool idle ratio Disk utilization Partition Count Network utilization You should be trending these metrics over time, and reviewing them on a regular basis. You may want to have some sort of alert once capacity is approaching a point where you need to get more, but that should be an email, or even better, and automatic work ticket in your system of choice. Additionally, make sure you’re making use of features like quotas and retention of messages by size so that you can minimize any surprises.
  30. If you take nothing else away from today’s talk, leave with this. First, you must define what your service level objectives are for Kafka within your organization. Even if you’re running at a small scale, and with a limited number of customers. Even if you’re the only customer of your cluster. Make it clear what the expectations are, and hold to them. Next, once you have those SLOs, that is what you need to be monitoring. David Henke, who led Engineering and Operations at LinkedIn for many years, would often say “What gets measured, gets fixed.” If you do not monitor your SLOs, then they do not really count. But beware of metrics that inform you to many different problems. They are typically noisy, and they often make it difficult to determine what the underlying problem is. They are attractive, because it’s a single number that says “something is wrong”, but they will drive you crazy in the end. And lastly, buy yourself a copy of Kafka: The Definitive Guide. In fact, you should buy two or three. Because reasons.