This presentation was presented by Art van Scheppingen at Percona Live 2017 in Santa Clara CA and covers what you need to know to effectively monitor MongoDB
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
Mongo DB Monitoring - Become a MongoDB DBA
1. Copyright 2017 Severalnines AB
MongoDB Monitoring
Art van Scheppingen
Senior Support Engineer, Severalnines
Become a MongoDB DBA - Monitoring Essentials
2. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Monitoring and trending
● Why do we collect data?
● What metrics to collect from MongoDB?
● Key MongoDB metrics in depth
● Available MongoDB monitoring tools
● How to monitor MongoDB using ClusterControl
Agenda
5. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
There is only one person who can land a plane without instruments
6. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Monitoring system (i.e. Nagios)
○ Checks if services are healthy
○ Sends pages
● Trending system (i.e. Cacti, Graphite, Prometheus)
○ Collects metrics
○ Generate graphs
Monitoring vs Trending
7. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Do more than just opening a connection
○ Measure true status of nodes and cluster
○ Test read/write
○ Open essential databases and collections
○ Keep an eye on the replication lag
■ Increase oplog size?
○ Check the full topology
Monitoring: Availability
8. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Trending
○ Plot trends of key (performance) metrics
○ Create timelines of metrics
○ Correlate various metrics
○ Find problems before they arise
○ Pre-emptive problem management
● Trending tools
○ Granularity of sampling
○ More datapoints = better
Trending: why do we need trends?
10. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Periodical (daily/weekly) healthchecks
● Insight into all aspects of the database operations
● Post mortem and proactive monitoring
● Capacity planning
Why do we collect data?
11. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Healthchecks are a pain
● You want to see aggregated
data
● You want to be able to drill
down to a particular host
● You want to see the most
important data first and dig in
later on
Healthchecks
12. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Ability to dig into past data
● Even less than 5s of data
granularity
(hardware-dependent)
● Low granularity allows you to
catch the issue as it evolves -
no need to wait 5 minutes
for a graph to refresh
Post mortem and proactive monitoring
13. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Graphs based on MongoDB
status metrics
● Overall status and per-node
graphs
● Ability to get a timeshifted
graphs - useful for
comparing workload
changes across the time
Insight into internals, capacity planning
15. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Quite similar to other database systems
○ Host metrics
○ Operational metrics
○ Storage engine metrics
○ Replication metrics
○ Shard metrics
Type of metrics to collect
16. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Similar to most other databases
● Understand the utilization of the machine
● Capacity planning
● Determine the type of an issue
○ I/O related?
○ CPU related?
○ Network related?
Host metrics: what for?
17. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● CPU utilization (should I add more nodes to the cluster?)
● Network utilization (am I running out of bandwidth?)
● Ping (how badly latency affects my MongoDB cluster?)
● Disk throughput and IOPS (am I within my hardware limits?)
● Disk space (do I have to plan for larger disks?)
● Memory utilization (do I suffer from a memory leak?)
Host metrics: what to look for?
18. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Similar to most other databases
● Throughput of the cluster
● Relate throughput to cluster performance
● Determine the type of an issue
○ Request spikes?
○ Write amplification related?
○ Queueing?
Operational metrics
19. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Storage engine specific
○ MMAP
○ Wired Tiger
○ MongoRocks
● Insight in how the engine performs
● Internal congestion
Storage engine metrics
20. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Throughput of the replication
● Durability of the oplog
● Replication lag
● Cluster replication acknowledgement
○ Quorum based
○ At least one secondary needs to acknowledge
Replication metrics
22. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Shard chunks and balancing
○ Chunks per shard
○ Disk usage
● Non-sharded collections
○ Sharding has to be enabled on collection level
○ Non-sharded collections get a primary shard assigned
○ Once the primary shard is full, no writes can happen
● Connection pool (mongos)
○ All queries will be sent to the primary in a shard
○ Range queries will block connections of the connection pool
Sharding related metrics
24. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Oplog: a special collection containing all transactions
○ Limited in size (configurable)
○ Eviction of transactions (FIFO)
○ Comparable to a ringbuffer
● Used for replication
○ Secondaries copy transactions from the oplog on other nodes
○ Full data sync necessary once the last executed transaction has been evicted
● Replication window
○ Time between first and last transaction in the oplog
○ Time that allows your secondary to be offline before performing a full sync
Oplog
26. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
From the ClusterControl advisor:
function getReplicationWindow(host) {
var replwindow = {};
replwindow['newset'] = false;
// Fetch the first and last record from the Oplog and take it's timestamp
var res = host.executeMongoQuery("local", '{find: "oplog.rs", sort: { $natural: 1}, limit: 1}');
replwindow['first'] = res["result"]["cursor"]["firstBatch"][0]["ts"]["$timestamp"]["t"];
if (res["result"]["cursor"]["firstBatch"][0]["o"]["msg"] == "initiating set") {
replwindow['newset'] = true;
}
res = host.executeMongoQuery("local", '{find: "oplog.rs", sort: { $natural: -1}, limit: 1}');
replwindow['last'] = res["result"]["cursor"]["firstBatch"][0]["ts"]["$timestamp"]["t"];
replwindow['replwindow'] = replwindow['last'] - replwindow['first'];
return replwindow;
}
Oplog: replication window
27. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● CPU, IO or lock related
● Outcome:
○ Secondary not used by Mongo client drivers
○ Puts larger strain on other secondaries
○ Less likely to be elected during a failover
■ If it will be elected it could be disastrous
○ Lagging behind too far could cause a full sync
Replication lag
29. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Like any other databases: availability
● Client drivers may support connection pooling
○ Multiple non-blocking queries can use the same connection
○ Spawns new connections when low on threshold
● Increase of connections
○ Locking issues
○ Application request bursts
Connections
30. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
From the MongoDB CLI
mongo_replica_0:PRIMARY> db.serverStatus().connections
{ "current" : 25, "available" : 794, "totalCreated" : NumberLong(122418) }
From any mongo client
mongo_replica_0:PRIMARY> db.runCommand( { serverStatus: 1 } ).connections
{ "current" : 25, "available" : 794, "totalCreated" : NumberLong(122418) }
Connections
31. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Atomicity on document level
○ Wiredtiger and MongoRocks
● No “real” transactions
● Write data with the $isolated operator
○ Similar to READ UNCOMMITTED in MySQL (dirty reads in ANSI SQL)
○ No rollback
○ Does not work on shards
Transactions
35. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Optimistic concurrency control
○ If two write operations conflict, the transaction will be paused and retried
● Document level locking
● Tickets (threads)
○ Read
○ Write
Locks (WiredTiger)
37. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● MongoDB uses three tiers of cache
○ Filesystem
○ Active memory
○ Storage engine (WiredTiger / MongoRocks)
● Page faults
○ Cache miss
● Evictions
Cache
38. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
From the MongoDB CLI
mongo_replica_0:PRIMARY> db.serverStatus().extra_info.page_faults
37912924
mongo_replica_0:PRIMARY> db.serverStatus().wiredTiger.cache
{
"bytes currently in the cache" : 887889617,
"modified pages evicted" : 561514,
"tracked dirty pages in the cache" : 626,
"unmodified pages evicted" : 15823118
}
Page faults and cache usage
39. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Shards make write scaling transparently
● Sharding can be solved with two methods:
○ Hash key distribution (limited)
○ Shard lookup table
● MongoDB uses a combination of hash key distribution and shard lookup table
○ Hash key (or range key) distribution gets divided into chunks (ranges)
○ The chunk metadata gets stored in the config server
● The config server is the most important data in a MongoDB sharded cluster!
● The shard router is the the second most important component
● Shards can get out of balance
○ Non-sharded collections
○ Heavy / large writes on a single chunk
○ Auto balancing by the primary of the Config server (3.4) or mongos (< 3.2)
Shard metrics
44. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Open Source
○ Nagios
○ Zabbix
● Subscription based
○ MongoDB Cloud Manager
○ VividCortex
○ ClusterControl
Alerting solutions
45. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Nagios-MongoDB
○ https://github.com/mzupan/nagios-plugin-mongodb/
○ Performs some very important checks
■ Replication lag
■ Lock time percentage
■ Index miss ratio
Nagios
46. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● MongoDB Zabbix monitoring plugin
○ https://github.com/nightw/mikoomi-zabbix-mongodb-monitoring
○ All the necessary metrics and more
■ Entries in oplog
○ Pre-canned triggers
Zabbix
48. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Percona MongoDB Monitoring Templates
○ https://www.percona.com/doc/percona-monitoring-plugins/1.1/cacti/mongodb-templates.
html
Cacti
49. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● PMM
○ https://www.percona.com/doc/percona-monitoring-and-management/
○ Open Source Monitoring & Management framework
○ Can deploy, manage and monitor MySQL & MongoDB
○ Uses Prometheus and Grafana
Orchestration systems: Percona Monitoring & Management
50. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● PMM
○ https://www.percona.com/doc/percona-monitoring-and-management/
○ Open Source Monitoring & Management framework
○ Can deploy, manage and monitor MySQL & MongoDB
○ Uses Prometheus and Grafana
Percona Monitoring & Management sessions:
● MySQL Monitoring with Percona Monitoring and Management, Tue 11:30 - 12:20 in Ballroom E
● Hipster MySQL Monitoring: Serving a deconstructed PMM, Tue 11:30 - 12:20 in Ballroom H
● Monitoring production environment with Percona Monitoring and Management (PMM), Thu 3:00 - 3:50 in room 209
Orchestration systems: Percona Monitoring & Management
59. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Blog series: Become a MongoDB DBA
○ http://severalnines.com/blog-categories/mongodb
● Webinar series: Become a MongoDB DBA
○ http://severalnines.com/upcoming-webinars
● Visit our website for more resources!
○ http://www.severalnines.com
● Stop by our booth in the exhibit hall
● Other sessions by Severalnines at Percona Live 2017
MySQL Load Balancers - MaxScale, ProxySQL, HAProxy, MySQL Router & nginx - a close up look, Wed
11:10am - 12:00pm in Ballroom D
MySQL (NDB) Cluster Best Practices (Die Hard VIII), Wed 3:30pm - 4:20pm in Room 210
Additional resources