Knowing Is Half the Battle

One of our current struggles with stabilizing a platform is usually an overabundance of data generated from systems, presented on dashboards, and generating noise for our monitoring/alerts. Too many times we have had red herring after red herring as we try to decipher what exactly the data is telling us. There is no guarantee that an on-call engineer or administrator will know where to look or what queries will yield meaningful insight into the current system state.

Even if you have experience with setting up monitoring/alerting for your distributed systems/micro-services, I highly suggest you keep reading. I will do my best to suggest some best practices when it comes to these topics and hope they help you bridge the gap between Development and Ops.

We need to go back and understand the basics of simple monitoring. I’ll go over some solid signals to monitoring, that should help identify and isolate problems during an incident. The goal is the reduction of erroneous on-call engineer pages, more time spent on identifying and troubleshooting the root cause, and quicker MTTD (Mean Time to Detection) and MTTR (Mean Time to Resolution*.

Golden Signals

One of the things I will reference as a core theme for this post is Google’s “Four Golden Signals” from its SRE book. These signals are latency, traffic, errors, and saturation. I highly recommend reading up on them and keeping it in the back of your mind as we will use this when going over the type of things we will be monitoring depending on the type of service.

As stated previously, one of the main problems faced is an overabundance of data. We do not do a great job of aggregating or isolating the important “four signals” for our services. Great aggregation is magical. Systems instantly appear simpler and the extra application data is isolated to where they belong. Having a good system for aggregation and classification of your services reduces the amount of thinking required to troubleshoot.

A lot of systems can be easily classified into 4 categories: HTTP/RPC, Queue Processing, Stream Processing, and Job Processing. Taking into consideration Google’s “Golden Signals”, you’ll see that for each different type of service, there will be different golden signals that really give a good overview of a system’s health without needing much raw data to analyze.

HTTP/RPC

  • (Traffic) Request rate — how much traffic is the service dealing with, measured in queries per second.
  • (Errors) Error rate — how much of the incoming requests are failing, measured as a portion of total traffic.
  • (Latency) Duration — how long are we taking to process each request, measured as a percentile on a histogram, usually the 99th percentile.

Queue Processor (SQS, ZeroMQ, RabbitMQ, etc)

  • (Latency/Traffic) Lag — how much time passed between the message ingress time and when we start processing it, measured in milliseconds.
  • (Traffic/Saturation) Net queue size — how many messages were written to the queue minus how many messages were removed from the queue in a given time frame.
  • (Errors) Error rate — the number of messages we failed processing divided by the number of messages we tried to process. Measured as a percentage.
  • (Latency) Duration — see HTTP/RPC Duration.

Stream Processor (Batch File, Kafka, Kinesis)

  • (Latency) Lag per partition per ingress topic — Kafka, for example, takes topics as input, then splits each topic into one or more partition. The lag is determined based on the difference in index position of the consumer groups across multiple partitions. Measured in message counts.
  • (Errors) Errors — see queue processor error rate.
  • (Traffic) Throughput — a sum of how many messages the processor emits in a given time frame, for a specific topic.

Scheduled Job (cron, K8s, etc.)

  • (Errors) Failed jobs — a count of how many jobs failed to run to completion.
  • (Traffic/Saturation) Missed Executions — by nature, scheduled jobs depend on a scheduler being available to invoke them. We monitor missed executions by comparing actual attempts vs duration.
  • (Latency) Duration — see other duration signals.

These key signals should be kept in mind when designing alerts or monitoring around the different services worked on everyday. This does not mean these signals should be the only ones you use in terms of monitoring the health of your system, but it should be a good starting point or a bare minimum to help clear out any unwanted red herrings or unnecessary data points.

Platform Metrics

Up until now, we were solely focused on application metrics and what good “golden signals” for that specific type of data looked like. However, you’ll notice a lack of representation of one signal, Saturation. While saturation is easily measured for some services, others are not as easily measured based on just application data alone. That is where platform metrics come in.

These are your typical platform metrics:

  • CPU
  • Memory
  • Disk
    • Free Space
    • I/O
  • Network
    • I/O (throughput)
    • latency/loss

Platform metrics alone are the biggest culprit of red herrings and noisy alerts. Thresholds are often set too high or too low. Dashboard resolution (time spans) are often set too long or too short to reveal the correct picture. It is easy to get rabbit-holed into a CPU spike, or a memory dip, but remember that they tend to lead in the wrong direction without corroborating evidence from logs or application data. Platform metrics are at best circumstantial.

Keep. It. Simple. Silly.

My favorite part of the Golden Signals section is “As Simple as Possible, No Simpler“.

It highlights just how easy it is for a monitoring system to become excessively complex even with the right intentions, and I believe it highlights where exactly what to strive for as a whole when it comes to monitoring our systems and services.

A monitoring system can be fragile, complicated, and a maintenance burden. It ties up engineering teams from working on new products and services, and contributes to our ever growing mountain of tech debt. But we can change that. Start back at the basics and keep things simple and organized. It will be the only way we fully understand the services we created and keep our systems understandable.

Please feel free to leave feedback or questions!

Gen