SlideShare uma empresa Scribd logo
1 de 38
Baixar para ler offline
An Introduction to Prometheus
Time Series Denver - May 30, 2018
Introduction
● CTO & Co-Founder - FreshTracks.io - A CA Accelerator Incubation
○ “Simplifying Kubernetes Visibility”
● bob@freshtracks.io
● @bob_cotton
● Father, Fly Fisher & Avid Homebrewer
Agenda
● What is a Cloud Native Application?
● Cloud Native Application Challenges
● The 5 Pillars of Monitoring
● An Introduction to Prometheus
● What FreshTracks Provides
What is a Cloud Native Application?
Cloud Native Application
● Follows 12 Factor Application Practices
● Packaged into containers
● Follows a micro-service architecture
● Managed by a Container Orchestration
○ Kubernetes, Docker Swarm, Mesos
● Usually deployed on dynamic
infrastructure
○ VMWare
○ Cloud providers
● Application lifecycle allows for
○ Auto-provisioning
○ Auto-scaling
○ Auto-redundancy
Cloud Native Applications Challenges
Cloud Native Challenges
● Containers are ephemeral
○ Scheduled on any node in the cluster
○ Move Frequently on restarts and deployments
● Kubernetes needs to be monitored
● Kubernetes brings additional complexities
○ Resource Quotas
○ Pod and Cluster Scaling
● Challenges traditional tools
5 Pillars of Monitoring
The 5 Pillars of Monitoring
Metrics and
Alerting Log Analytics
Distributed
Tracing
Application
Performance
Monitoring
Real User
Monitoring
Enter Prometheus
Prometheus
● Started in 2012 at SoundCloud by ex-Google Engineers
○ Open Sourced in 2015
● Patterned after “BorgMon” - Google’s Container monitoring system
● Second project accepted into the CNCF after Kubernetes
● Adoption surge is tracking Kubernetes
○ 63% of teams using Kubernetes use Prometheus
Prometheus Major Features
● Label/value based time series data model
● “Pull based” metrics collection
● Service discovery mechanism
● Simple metrics format with a rich set of “exporters”
● Extremely high-performance TSDB
● Extensive query language - PromQL
● Alert Manager
● Easily installable from Helm
○ Single, statically linked binary
● Open Source Grafana used for visualization
Time Series Data Model
<identifier> → [(t0, v0), (t1, v1), (t2, v2) …]
Identifier is a collection of label/value pairs
Time stored as int64 - Millis since the epoch
Values stored as float64
Efficient storage on disk -- 1.3 bytes/sample
Label/Value Based Data Model
● Graphite/StatsD
○ apache.192-168-5-1.home.200.http_request_total
○ apache.192-168-5-1.home.500.http_request_total
○ apache.192-168-5-1.about.200.http_request_total
● Prometheus
○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/home”, status=”200”}
○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/home”, status=”500”}
○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/about”, status=”200”}
● Selecting Series
○ *.*.home.200.*.http_requests_total
○ http_requests_total{status=”200”, path=”/home”}
Client Data Model
● Counters
○ Always go up or get reset to 0
● Gauge
○ Tracks a real value e.g. temperature
● Histogram and Summary
○ Used for percentiles
Prometheus Service Discovery and Target Scrape
Prometheus
K8s API Server
TSDB
Kublet
(cAdvisor)
node-exporter
kube_state_metrics
App containers
other exporters
node_exporter
App containers
Kublet
(cAdvisor)
Service Discovery
Prometheus Exposition Format and Exporters
● The Prometheus exposition format - Text over http. Simple, human readable
● Supported by Sysdig and the TICK collector
○ Efforts to make it a standard
● Close to 100 exporters for various technologies
● The jmx_exporter can cover any Java/JMX application
● https://prometheus.io/docs/instrumenting/exporters/
Official Exporters:
● node_exporter
● jmx_exporter
● snmp_exporter
● haproxy_exporter
● cloudwatch_exporter
● collectd_exporter
● mysql_exporter
● memcached_exporter
Querying Series with PromQL
● PromQL is a functional query language. Nothing like SQL
rate(http_requests_total[5m])
select job, instance, path, status
rate(value, 5m)
FROM http_requests_total;
Querying Series with PromQL
Calculate a ratio of website hits to failures:
sum(rate(http_requests_total{status=”500”}[5m])) by (path) /
sum(rate(http_requests_total[5m])) by (path)
{path=”/home”} 0.014
{path=”/about”} 0.027
Graphing
Dashboards with Grafana
@bob_cotton@bob_cotton
Labels, Re-Label and Recording Rules
Oh My...
Label/Value Based Data Model
● Graphite/StatsD
○ apache.192-168-5-1.home.200.http_request_total
○ apache.192-168-5-1.home.500.http_request_total
○ apache.192-168-5-1.about.200.http_request_total
● Prometheus
○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/home”, status=”200”}
○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/home”, status=”500”}
○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/about”, status=”200”}
● Selecting Series
○ *.*.home.200.*.http_requests_total
○ http_requests_total{status=”200”, path=”/home”}
@bob_cotton
Kubernetes Labels
● Kubernetes gives us labels on all the things
● Our scrape targets live in the context of the K8s labels
○ This comes from service discovery
● We want to enhance the scraped metric labels with K8s labels
● This is why we need relabel rules in Prometheus
@bob_cotton
K8s API Server
TSDB
Scrape Target
Service Discovery
Prometheus
0="{__address__ 300.196.17.41}"
1="{__meta_kubernetes_namespace default}"
2="{__meta_kubernetes_pod_annotation_freshtracks_io_data_sidecar true}"
3="{__meta_kubernetes_pod_annotation_freshtracks_io_path /metrics2}"
4="{__meta_kubernetes_pod_annotation_kubernetes_io_created_by "kind":"SerializedReference"?}"
5="{__meta_kubernetes_pod_annotation_kubernetes_io_limit_ranger LimitRanger plugin set: cpu
request for container prometheus-configmap-reload; cpu request for container data-sidecar}"
6="{__meta_kubernetes_pod_annotation_prometheus_io_port 8077}"
7="{__meta_kubernetes_pod_annotation_prometheus_io_scrape false}"
8="{__meta_kubernetes_pod_container_name prometheus-configmap-reload}"
9="{__meta_kubernetes_pod_host_ip 172.20.42.119}"
10="{__meta_kubernetes_pod_ip 100.96.17.41}"
11="{__meta_kubernetes_pod_label_freshtracks_io_cluster bowl.freshtracks.io}"
12="{__meta_kubernetes_pod_label_pod_template_hash 1636686694}"
13="{__meta_kubernetes_pod_label_run data-sidecar}"
14="{__meta_kubernetes_pod_name data-sidecar-1636686694-83crm}"
15="{__meta_kubernetes_pod_node_name ip-xx-xxx-xx-xxx.us-west-2.compute.internal}"
16="{__meta_kubernetes_pod_ready false}"
17="{__metrics_path__ /metrics}"
18="{__scheme__ http}"
19="{job ftio-data-sidecar-calc}"
<relabel_config>
{__address__ 300.196.17.41:8077}
{__scheme__ http}
{__metrics_path__ /metrics}
{job ftio-data-sidecar-calc}
{kubernetes_namespace default}
{container_name prometheus-configmap-reload}
http_requests_total{region=”us-east”,
az=”us-east-1”, instance_type=”m2.xlarge”,
instance=”i-3582k8”, hostname=”host1”} = 5439
http_requests_total{region=”us-east”,
az=”us-east-1”,
instance_type=”m2.xlarge”,
instance=”i-3582k8”,
hostname=”host1”,
instance=”300.196.17.41:8077”,
job=”ftio-data-sidecar-calc”,
kubernetes_namespace=”default”,
container_name=”prometheus-configmap-reload”,
} = 5439
<metric_relabel_config>
Recording Rules - Derivative Series
● New series can be generated by querying existing series and storing them
path:request_failures_per_requests:ratio_rate5m =
sum(rate(http_requests_total{status=”500”}[5m])) by (path) 
sum(rate(http_requests_total[5m])) by (path)
High Availability
Prometheus
Prometheus
Federation
Prometheus
Prometheus
Prometheus
Prometheus
Prometheus
Prometheus
Prometheus
Prometheus
Subset of Metrics
Long Term Storage and External Integrations
Prometheus
remote_write
● AppOptics: write
● Chronix: write
● Cortex: read and write
● CrateDB: read and write
● Elasticsearch: write
● Gnocchi: write
● Graphite: write
● InfluxDB: read and write
● OpenTSDB: write
● PostgreSQL/TimescaleD
B: read and write
● SignalFx: write
remote_read
Alerting
Alert Definition
ALERT <alert name>
EXPR <expression>
[ FOR <duration> ]
[ LABELS <label set> ]
[ ANNOTATIONS <labelset> ]
ALERT: IngesterCrowding
EXPR: count by(ft_cluster, node)
(cortex_ingester_ingested_samples_total) > 1
FOR: 30m
LABELS: severity: critical
ANNOTATIONS:
description:
https://github.com/Fresh-Tracks/gke-configs/blob/master
/docs/alerts.md#ingestercrowding
summary: Node {{ $labels.node }} is hosting {{ $value
}} ingester pods
Alert Manager
● Deduplication
● Grouping
● Routing
● Suppression
Alert Manager
Prometheus
Prometheus
Alert Manager
Alert Manager
PagerDuty
VictorOps
Slack
FreshTracks.io
Simplifying Kubernetes Visibility
Filling the Gaps
● A small Kubernetes cluster generate > 500K unique samples
○ Which metrics are important?
● Performance of any one container is easy
○ How is the whole microservice behaving? Node? Cluster?
● Prometheus has no anomaly detection
● Dashboard creation is tedious, even if you know what to watch
● How is my service behaving in the context of the cluster?
○ How do node/container/application metrics correlate to each other?
Kubernetes Hierarchy Visibility
Namespace
Workload
Pod
Container
(Workload can be a deployment,
replicaSet, statefulSet,
daemonSet or similar)
Demo
Thanks!
We’re Hiring!

Mais conteúdo relacionado

Mais procurados

Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and ElasticsearchLet's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and ElasticsearchInfluxData
 
PGConf APAC 2018 - Managing replication clusters with repmgr, Barman and PgBo...
PGConf APAC 2018 - Managing replication clusters with repmgr, Barman and PgBo...PGConf APAC 2018 - Managing replication clusters with repmgr, Barman and PgBo...
PGConf APAC 2018 - Managing replication clusters with repmgr, Barman and PgBo...PGConf APAC
 
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...InfluxData
 
Scaling Up Logging and Metrics
Scaling Up Logging and MetricsScaling Up Logging and Metrics
Scaling Up Logging and MetricsRicardo Lourenço
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...Altinity Ltd
 
Streaming millions of Contact Center interactions in (near) real-time with Pu...
Streaming millions of Contact Center interactions in (near) real-time with Pu...Streaming millions of Contact Center interactions in (near) real-time with Pu...
Streaming millions of Contact Center interactions in (near) real-time with Pu...Frank Kelly
 
Inside the InfluxDB storage engine
Inside the InfluxDB storage engineInside the InfluxDB storage engine
Inside the InfluxDB storage engineInfluxData
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Waysmalltown
 
Anatomy of an action
Anatomy of an actionAnatomy of an action
Anatomy of an actionGordon Chung
 
2016 08-30 Kubernetes talk for Waterloo DevOps
2016 08-30 Kubernetes talk for Waterloo DevOps2016 08-30 Kubernetes talk for Waterloo DevOps
2016 08-30 Kubernetes talk for Waterloo DevOpscraigbox
 
Managing Container Clusters in OpenStack Native Way
Managing Container Clusters in OpenStack Native WayManaging Container Clusters in OpenStack Native Way
Managing Container Clusters in OpenStack Native WayQiming Teng
 
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes ClusterTaking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes ClusterChristopher Bradford
 
InfluxDB & Kubernetes
InfluxDB & KubernetesInfluxDB & Kubernetes
InfluxDB & KubernetesInfluxData
 
OpenStack Nova Scheduler
OpenStack Nova Scheduler OpenStack Nova Scheduler
OpenStack Nova Scheduler Peeyush Gupta
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudChangshu Liu
 
Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNBelmiro Moreira
 
Our Story With ClickHouse at seo.do
Our Story With ClickHouse at seo.doOur Story With ClickHouse at seo.do
Our Story With ClickHouse at seo.doMetehan Çetinkaya
 
Mixing Metrics and Logs with Grafana + Influx by David Kaltschmidt, Director ...
Mixing Metrics and Logs with Grafana + Influx by David Kaltschmidt, Director ...Mixing Metrics and Logs with Grafana + Influx by David Kaltschmidt, Director ...
Mixing Metrics and Logs with Grafana + Influx by David Kaltschmidt, Director ...InfluxData
 
Advanced kapacitor
Advanced kapacitorAdvanced kapacitor
Advanced kapacitorInfluxData
 
InfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
InfluxDB Client Libraries and Applications | Miroslav Malecha | BonitooInfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
InfluxDB Client Libraries and Applications | Miroslav Malecha | BonitooInfluxData
 

Mais procurados (20)

Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and ElasticsearchLet's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
 
PGConf APAC 2018 - Managing replication clusters with repmgr, Barman and PgBo...
PGConf APAC 2018 - Managing replication clusters with repmgr, Barman and PgBo...PGConf APAC 2018 - Managing replication clusters with repmgr, Barman and PgBo...
PGConf APAC 2018 - Managing replication clusters with repmgr, Barman and PgBo...
 
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
 
Scaling Up Logging and Metrics
Scaling Up Logging and MetricsScaling Up Logging and Metrics
Scaling Up Logging and Metrics
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Streaming millions of Contact Center interactions in (near) real-time with Pu...
Streaming millions of Contact Center interactions in (near) real-time with Pu...Streaming millions of Contact Center interactions in (near) real-time with Pu...
Streaming millions of Contact Center interactions in (near) real-time with Pu...
 
Inside the InfluxDB storage engine
Inside the InfluxDB storage engineInside the InfluxDB storage engine
Inside the InfluxDB storage engine
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
 
Anatomy of an action
Anatomy of an actionAnatomy of an action
Anatomy of an action
 
2016 08-30 Kubernetes talk for Waterloo DevOps
2016 08-30 Kubernetes talk for Waterloo DevOps2016 08-30 Kubernetes talk for Waterloo DevOps
2016 08-30 Kubernetes talk for Waterloo DevOps
 
Managing Container Clusters in OpenStack Native Way
Managing Container Clusters in OpenStack Native WayManaging Container Clusters in OpenStack Native Way
Managing Container Clusters in OpenStack Native Way
 
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes ClusterTaking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
 
InfluxDB & Kubernetes
InfluxDB & KubernetesInfluxDB & Kubernetes
InfluxDB & Kubernetes
 
OpenStack Nova Scheduler
OpenStack Nova Scheduler OpenStack Nova Scheduler
OpenStack Nova Scheduler
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in Cloud
 
Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
 
Our Story With ClickHouse at seo.do
Our Story With ClickHouse at seo.doOur Story With ClickHouse at seo.do
Our Story With ClickHouse at seo.do
 
Mixing Metrics and Logs with Grafana + Influx by David Kaltschmidt, Director ...
Mixing Metrics and Logs with Grafana + Influx by David Kaltschmidt, Director ...Mixing Metrics and Logs with Grafana + Influx by David Kaltschmidt, Director ...
Mixing Metrics and Logs with Grafana + Influx by David Kaltschmidt, Director ...
 
Advanced kapacitor
Advanced kapacitorAdvanced kapacitor
Advanced kapacitor
 
InfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
InfluxDB Client Libraries and Applications | Miroslav Malecha | BonitooInfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
InfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
 

Semelhante a Time series denver an introduction to prometheus

Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on KubernetesJoerg Henning
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKevin Lynch
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusTobias Schmidt
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performanceDataWorks Summit
 
OSMC 2021 | pg_stat_monitor: A cool extension for better database (PostgreSQL...
OSMC 2021 | pg_stat_monitor: A cool extension for better database (PostgreSQL...OSMC 2021 | pg_stat_monitor: A cool extension for better database (PostgreSQL...
OSMC 2021 | pg_stat_monitor: A cool extension for better database (PostgreSQL...NETWAYS
 
DevOps Braga #15: Agentless monitoring with icinga and prometheus
DevOps Braga #15: Agentless monitoring with icinga and prometheusDevOps Braga #15: Agentless monitoring with icinga and prometheus
DevOps Braga #15: Agentless monitoring with icinga and prometheusDevOps Braga
 
Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)Idan Atias
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowTatiana Al-Chueyr
 
Taking Cloud to Extremes: Scaled-down, Highly Available, and Mission-critical...
Taking Cloud to Extremes: Scaled-down, Highly Available, and Mission-critical...Taking Cloud to Extremes: Scaled-down, Highly Available, and Mission-critical...
Taking Cloud to Extremes: Scaled-down, Highly Available, and Mission-critical...Altoros
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performanceEngine Yard
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016aspyker
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Sharma Podila
 
Kubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesKubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesWeaveworks
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Operator Lifecycle Management
Operator Lifecycle ManagementOperator Lifecycle Management
Operator Lifecycle ManagementDoKC
 
Operator Lifecycle Management
Operator Lifecycle ManagementOperator Lifecycle Management
Operator Lifecycle ManagementDoKC
 
ML-Based SQL Query Resource Usage Prediction
ML-Based SQL Query Resource Usage PredictionML-Based SQL Query Resource Usage Prediction
ML-Based SQL Query Resource Usage PredictionAlluxio, Inc.
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightDataStax Academy
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
 

Semelhante a Time series denver an introduction to prometheus (20)

Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on Kubernetes
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the Datacenter
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
OSMC 2021 | pg_stat_monitor: A cool extension for better database (PostgreSQL...
OSMC 2021 | pg_stat_monitor: A cool extension for better database (PostgreSQL...OSMC 2021 | pg_stat_monitor: A cool extension for better database (PostgreSQL...
OSMC 2021 | pg_stat_monitor: A cool extension for better database (PostgreSQL...
 
DevOps Braga #15: Agentless monitoring with icinga and prometheus
DevOps Braga #15: Agentless monitoring with icinga and prometheusDevOps Braga #15: Agentless monitoring with icinga and prometheus
DevOps Braga #15: Agentless monitoring with icinga and prometheus
 
Introduction to istio
Introduction to istioIntroduction to istio
Introduction to istio
 
Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
 
Taking Cloud to Extremes: Scaled-down, Highly Available, and Mission-critical...
Taking Cloud to Extremes: Scaled-down, Highly Available, and Mission-critical...Taking Cloud to Extremes: Scaled-down, Highly Available, and Mission-critical...
Taking Cloud to Extremes: Scaled-down, Highly Available, and Mission-critical...
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
Kubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesKubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slides
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Operator Lifecycle Management
Operator Lifecycle ManagementOperator Lifecycle Management
Operator Lifecycle Management
 
Operator Lifecycle Management
Operator Lifecycle ManagementOperator Lifecycle Management
Operator Lifecycle Management
 
ML-Based SQL Query Resource Usage Prediction
ML-Based SQL Query Resource Usage PredictionML-Based SQL Query Resource Usage Prediction
ML-Based SQL Query Resource Usage Prediction
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 

Último

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 

Último (20)

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 

Time series denver an introduction to prometheus

  • 1. An Introduction to Prometheus Time Series Denver - May 30, 2018
  • 2. Introduction ● CTO & Co-Founder - FreshTracks.io - A CA Accelerator Incubation ○ “Simplifying Kubernetes Visibility” ● bob@freshtracks.io ● @bob_cotton ● Father, Fly Fisher & Avid Homebrewer
  • 3. Agenda ● What is a Cloud Native Application? ● Cloud Native Application Challenges ● The 5 Pillars of Monitoring ● An Introduction to Prometheus ● What FreshTracks Provides
  • 4. What is a Cloud Native Application?
  • 5. Cloud Native Application ● Follows 12 Factor Application Practices ● Packaged into containers ● Follows a micro-service architecture ● Managed by a Container Orchestration ○ Kubernetes, Docker Swarm, Mesos ● Usually deployed on dynamic infrastructure ○ VMWare ○ Cloud providers ● Application lifecycle allows for ○ Auto-provisioning ○ Auto-scaling ○ Auto-redundancy
  • 7. Cloud Native Challenges ● Containers are ephemeral ○ Scheduled on any node in the cluster ○ Move Frequently on restarts and deployments ● Kubernetes needs to be monitored ● Kubernetes brings additional complexities ○ Resource Quotas ○ Pod and Cluster Scaling ● Challenges traditional tools
  • 8. 5 Pillars of Monitoring
  • 9. The 5 Pillars of Monitoring Metrics and Alerting Log Analytics Distributed Tracing Application Performance Monitoring Real User Monitoring
  • 11. Prometheus ● Started in 2012 at SoundCloud by ex-Google Engineers ○ Open Sourced in 2015 ● Patterned after “BorgMon” - Google’s Container monitoring system ● Second project accepted into the CNCF after Kubernetes ● Adoption surge is tracking Kubernetes ○ 63% of teams using Kubernetes use Prometheus
  • 12. Prometheus Major Features ● Label/value based time series data model ● “Pull based” metrics collection ● Service discovery mechanism ● Simple metrics format with a rich set of “exporters” ● Extremely high-performance TSDB ● Extensive query language - PromQL ● Alert Manager ● Easily installable from Helm ○ Single, statically linked binary ● Open Source Grafana used for visualization
  • 13. Time Series Data Model <identifier> → [(t0, v0), (t1, v1), (t2, v2) …] Identifier is a collection of label/value pairs Time stored as int64 - Millis since the epoch Values stored as float64 Efficient storage on disk -- 1.3 bytes/sample
  • 14. Label/Value Based Data Model ● Graphite/StatsD ○ apache.192-168-5-1.home.200.http_request_total ○ apache.192-168-5-1.home.500.http_request_total ○ apache.192-168-5-1.about.200.http_request_total ● Prometheus ○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/home”, status=”200”} ○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/home”, status=”500”} ○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/about”, status=”200”} ● Selecting Series ○ *.*.home.200.*.http_requests_total ○ http_requests_total{status=”200”, path=”/home”}
  • 15. Client Data Model ● Counters ○ Always go up or get reset to 0 ● Gauge ○ Tracks a real value e.g. temperature ● Histogram and Summary ○ Used for percentiles
  • 16. Prometheus Service Discovery and Target Scrape Prometheus K8s API Server TSDB Kublet (cAdvisor) node-exporter kube_state_metrics App containers other exporters node_exporter App containers Kublet (cAdvisor) Service Discovery
  • 17. Prometheus Exposition Format and Exporters ● The Prometheus exposition format - Text over http. Simple, human readable ● Supported by Sysdig and the TICK collector ○ Efforts to make it a standard ● Close to 100 exporters for various technologies ● The jmx_exporter can cover any Java/JMX application ● https://prometheus.io/docs/instrumenting/exporters/ Official Exporters: ● node_exporter ● jmx_exporter ● snmp_exporter ● haproxy_exporter ● cloudwatch_exporter ● collectd_exporter ● mysql_exporter ● memcached_exporter
  • 18. Querying Series with PromQL ● PromQL is a functional query language. Nothing like SQL rate(http_requests_total[5m]) select job, instance, path, status rate(value, 5m) FROM http_requests_total;
  • 19. Querying Series with PromQL Calculate a ratio of website hits to failures: sum(rate(http_requests_total{status=”500”}[5m])) by (path) / sum(rate(http_requests_total[5m])) by (path) {path=”/home”} 0.014 {path=”/about”} 0.027
  • 23. Label/Value Based Data Model ● Graphite/StatsD ○ apache.192-168-5-1.home.200.http_request_total ○ apache.192-168-5-1.home.500.http_request_total ○ apache.192-168-5-1.about.200.http_request_total ● Prometheus ○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/home”, status=”200”} ○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/home”, status=”500”} ○ http_request_total{job=”apache”, instance=”192.168.5.1”, path=”/about”, status=”200”} ● Selecting Series ○ *.*.home.200.*.http_requests_total ○ http_requests_total{status=”200”, path=”/home”}
  • 24. @bob_cotton Kubernetes Labels ● Kubernetes gives us labels on all the things ● Our scrape targets live in the context of the K8s labels ○ This comes from service discovery ● We want to enhance the scraped metric labels with K8s labels ● This is why we need relabel rules in Prometheus
  • 25. @bob_cotton K8s API Server TSDB Scrape Target Service Discovery Prometheus 0="{__address__ 300.196.17.41}" 1="{__meta_kubernetes_namespace default}" 2="{__meta_kubernetes_pod_annotation_freshtracks_io_data_sidecar true}" 3="{__meta_kubernetes_pod_annotation_freshtracks_io_path /metrics2}" 4="{__meta_kubernetes_pod_annotation_kubernetes_io_created_by "kind":"SerializedReference"?}" 5="{__meta_kubernetes_pod_annotation_kubernetes_io_limit_ranger LimitRanger plugin set: cpu request for container prometheus-configmap-reload; cpu request for container data-sidecar}" 6="{__meta_kubernetes_pod_annotation_prometheus_io_port 8077}" 7="{__meta_kubernetes_pod_annotation_prometheus_io_scrape false}" 8="{__meta_kubernetes_pod_container_name prometheus-configmap-reload}" 9="{__meta_kubernetes_pod_host_ip 172.20.42.119}" 10="{__meta_kubernetes_pod_ip 100.96.17.41}" 11="{__meta_kubernetes_pod_label_freshtracks_io_cluster bowl.freshtracks.io}" 12="{__meta_kubernetes_pod_label_pod_template_hash 1636686694}" 13="{__meta_kubernetes_pod_label_run data-sidecar}" 14="{__meta_kubernetes_pod_name data-sidecar-1636686694-83crm}" 15="{__meta_kubernetes_pod_node_name ip-xx-xxx-xx-xxx.us-west-2.compute.internal}" 16="{__meta_kubernetes_pod_ready false}" 17="{__metrics_path__ /metrics}" 18="{__scheme__ http}" 19="{job ftio-data-sidecar-calc}" <relabel_config> {__address__ 300.196.17.41:8077} {__scheme__ http} {__metrics_path__ /metrics} {job ftio-data-sidecar-calc} {kubernetes_namespace default} {container_name prometheus-configmap-reload} http_requests_total{region=”us-east”, az=”us-east-1”, instance_type=”m2.xlarge”, instance=”i-3582k8”, hostname=”host1”} = 5439 http_requests_total{region=”us-east”, az=”us-east-1”, instance_type=”m2.xlarge”, instance=”i-3582k8”, hostname=”host1”, instance=”300.196.17.41:8077”, job=”ftio-data-sidecar-calc”, kubernetes_namespace=”default”, container_name=”prometheus-configmap-reload”, } = 5439 <metric_relabel_config>
  • 26. Recording Rules - Derivative Series ● New series can be generated by querying existing series and storing them path:request_failures_per_requests:ratio_rate5m = sum(rate(http_requests_total{status=”500”}[5m])) by (path) sum(rate(http_requests_total[5m])) by (path)
  • 29. Long Term Storage and External Integrations Prometheus remote_write ● AppOptics: write ● Chronix: write ● Cortex: read and write ● CrateDB: read and write ● Elasticsearch: write ● Gnocchi: write ● Graphite: write ● InfluxDB: read and write ● OpenTSDB: write ● PostgreSQL/TimescaleD B: read and write ● SignalFx: write remote_read
  • 31. Alert Definition ALERT <alert name> EXPR <expression> [ FOR <duration> ] [ LABELS <label set> ] [ ANNOTATIONS <labelset> ] ALERT: IngesterCrowding EXPR: count by(ft_cluster, node) (cortex_ingester_ingested_samples_total) > 1 FOR: 30m LABELS: severity: critical ANNOTATIONS: description: https://github.com/Fresh-Tracks/gke-configs/blob/master /docs/alerts.md#ingestercrowding summary: Node {{ $labels.node }} is hosting {{ $value }} ingester pods
  • 32. Alert Manager ● Deduplication ● Grouping ● Routing ● Suppression
  • 33. Alert Manager Prometheus Prometheus Alert Manager Alert Manager PagerDuty VictorOps Slack
  • 35. Filling the Gaps ● A small Kubernetes cluster generate > 500K unique samples ○ Which metrics are important? ● Performance of any one container is easy ○ How is the whole microservice behaving? Node? Cluster? ● Prometheus has no anomaly detection ● Dashboard creation is tedious, even if you know what to watch ● How is my service behaving in the context of the cluster? ○ How do node/container/application metrics correlate to each other?
  • 36. Kubernetes Hierarchy Visibility Namespace Workload Pod Container (Workload can be a deployment, replicaSet, statefulSet, daemonSet or similar)
  • 37. Demo