Grafana Cloud Metrics (The 4 Pillars Of Observability)

A Recap on Grafana's Capabilities

We will be taking a look at metrics in Grafana during this article - what they are and what you can do with them. To understand what metrics can do in Grafana Cloud it is preferable to have an overview of what Grafana does and where metrics fit into it - This has been covered in a previous blog I have written on Grafana Cloud which can be found here!

As a quick recap, Grafana Cloud is an end-to-end visualisation and observability tool. From collecting and connecting to data through processing and storing it, all the way to visualising and alerting data. Grafana Alloy lets you collect Metrics, Logs, Traces and Profiles (The 4 Pillars of Observability) and store them in Grafana's own scalable and highly available storage solutions, hosted and managed by Grafana Cloud. The data can then be queried from these stores (and many other compatible data sources) to populate dashboards and alerts.

When used to its full potential, Grafana lets you identify problems when they occur in a distributed system and then investigate why the problem has occurred - in other words, create an observable system.

Storing Metrics with Mimir

When Grafana Alloy collects metrics usually it will be in the form of a Prometheus metric - Prometheus is a popular metric Time Series Database but it has some drawbacks. It doesn’t scale well and isn’t built for long term retention, by default it only keeps metrics for 15 days which isn’t ideal.

Grafana saw Prometheus and realised they needed better, they needed something that could scale up to handle metrics for enterprise level observability. Not content to wait for someone else to come along and fix their problem, they developed Mimir. Mimir is a storage solution for Prometheus metrics that is built for high availability, scalability and long term storage in the cloud.

Taking advantage of the cloud for storage means a few things for users of Mimir: You don’t have to worry about scaling your storage infrastructure and you can minimise costs by taking advantage of cloud specific features. An example of this is AWS lifecycle rules, where you can move objects that haven’t been accessed for a while into cheaper storage, reducing cost.

Queries

Probably not a completely alien concept, a query is what lets you get relevant information (numbers, user information, etc) from a data source and give them to your visualisation. From getting CPU metrics to sales metrics any time you want to get something from a data source you use a query, but how do queries work with the many many data sources Grafana can use? It’s simple, you write the query in whatever language the data source asks for. MySQL database? Use SQL. Prometheus? Use PromQL. Splunk data source? Use SPL.

Grafana lets you go beyond a single query though as each visualisation can have multiple queries, and if your visualisation game is even more ambitious than you can even have each of these queries using different data sources all within a single visualisation.

Range and Instant Queries

When creating queries you can have 2 types, range or instant queries. If you want to display data at a single point in time you'll want to use an instant query - things like stats, gauges and pie charts.

The more interesting range query is designed to show data over time. Instead of just grabbing the data as it is now it will evaluate the query at regular intervals over a time range and return data points over that same range.

In the example query below when set to range the query will look over the time range and find the maximum value of a metric for every day then add them together.

Time Range Picker

Range queries and their existence implies a way to select the range a query is executed over. Enter the aptly named “time range picker” - this fine piece of UI lets you select the time range your queries are executed over with preset ranges and the ability to enter your own ranges.

The time range picker is not just for range queries though, it works with instant queries too. A range query will work over from a set time to a set time (e.g. from a week ago to yesterday) but the instant query will only look at the to time. For example, if I set the time range from a week ago to today an instant query will look at today, but if I set a query from a month ago to yesterday the instant query will look at yesterday.

Thresholds

You’ve queried the data you need, your visualisation has the numbers you want on it, but what do the numbers actually mean? An example scenario: you own a shop and have a sales dashboard set up to tell you how much of each item you have sold.

Opening your dashboard, you see a visualisation telling you how many pints of milk your shop has sold today “5”, is this good or bad? What does it tell us about the milk drinking habits of the locals on this particular day? I have no idea, because I have not included any context in my visualisation. This is where thresholds come in.

This can be done quite easily, when you create a visualisation there is an option to create thresholds and associate a colour with each of them. When a number goes above one of these thresholds you can configure the visualisation to change colour or add an indication that this threshold has been reached. Using this method does have its limitations though.

For example say on a Saturday selling 5 pints of milk is bad but on a Wednesday its amazing, the threshold is static and has to be manually updated each day to represent this. What we really need is the ability to dynamically decide thresholds. Say if we could take a queries results and make them the thresholds that would be really convenient wouldn’t it?

A transformation taking the result of a query and turning it into a threshold

A particularly astute observer may have noticed that between this paragraph and the last is an image and above that image is a caption. Reading the caption may stimulate such thoughts as “hmm a transformation that can take a queries results and turn them into a threshold, that seems like a great way to create dynamic thresholds”. If one did happen to have this thought they would be completely correct!

Transformations will be a focus in a future blog on one of the other pillars of observability but in short: after data is queried but before it gets put on a visualisation you can apply a transformation that will do something to your data. This can range from simple formatting and filtering values to extracting new fields with regex.

A particular transformation “Config from query results” is the only one that concerns us for the moment. It lets us take a query and set a configuration variable (like a threshold) to the result of that query. Using this we can display information (like pints of milk purchased on a particular date) and have the colours on the visualisation change based on changing values (like expected number of purchases on a certain day of the week).

Overrides

At times in Grafana your visualisations can get complex - Graphs with multiple queries, a gauge displaying multiple important KPIs, giant heatmaps showing incredibly exciting business critical information. Normally you configure your whole visualisation with the same settings: the min/max displayable value, the thresholds, the naming format of different series. Occasionally this won’t do - for example, one of your gauges won’t have the same thresholds as the other in a panel, or one of your graphs has given a series the wrong name.

In cases such as these the override introduces itself, when one or two of your queries need to be treated differently to the others you can override the default settings for them.

Visualisations

There are many many different visualisations in Grafana but in this blog I will go over 2 you might use with a metric: The gauge and the time series graph (other visualisations are available). I have chosen these because they are both able to take advantage of some of the features discussed in this blog. The gauge is a good example of where one might use an instant query, the time series graph a ranged one. Both make good use of thresholds to give context to data and on the time series graph there is space aplenty to show off the results of multiple queries.

Displaying the thresholds around a value is the unique selling point of the gauge, it is best for when you are tracking something within a set of limits or percentages. A good example where you might use a gauge are current CPU utilisation when you are monitoring a server.A single number is displayed on the stats visualisation with most visualisations on grafana if you give the gauge multiple queries or values to display it can handle it… by creating more gauges. In a single visualisation panel it can display a row or column of gauges which can make it a lot less hassle to display data as you don’t need to create and configure a separate visualisation panel every time you want to display something new.

Time Series Graph

When you think about visualising metrics, the time series graph is probably the thing you will come to think of as it, by definition, lets you display a metric over a period of time, useful for comparing metrics and seeing trends in them.

As time series visualisation plots a metric over time it follows that it requires a range query, you can use an instant query if you want but you will end up with a graph with one point on it for each metric series you are trying to display (not very useful).

Visualisations in Grafana that work over time have this range decided by the time range picker but some visualisations in Grafana have a neat trick built in. Say you are looking at your observability dashboard and you see a big jump in the number of errors in your time series graph. You want to investigate around the time the spike happened, what do you do? You go to the time range picker and enter a time range around when the issue occurred. This is a clumsy time consuming solution that requires manually entering dates or relative times and on the whole a bit of a pain. Visualisations like the time series graph let you click and drag on the visualisation and set the time range picker to the time range selected on the graph!

Doing this will change the time range for the entire dashboard, so you can find where a problem is occurring in one visualisation and easily jump to this time range and start investigating in others.

One last thing I am going to focus on with the time series graph is the legend, each of your metric series displayed in the visualisation has its own color and the legend is what tells you how each of these match up. That’s not all it can do though, not content just to say “the yellow line is the amount of available memory”, in the configuration panel you can set the legend to display statistics about each series in a table format like the max or average value.

How Metrics Fit into Grafana and Observability

So far this blog has served to explain what metrics are and how you can dashboard with them but now I’m going to focus on how metrics fit into the wider topic of observability. Now of course you can just use metrics on their own, Grafana can just be used for visualisation if you aren’t concerned with observability. Take the example from earlier, a shop owner with a dashboard showing how each product is selling - a perfectly good use case for Grafana.

In observability though, metrics play a key role: they tell you how things are doing, quantifying the state of the system. If your server's CPU has been maxed out for the last 2 hours, its metrics will tell you. If your website's latency is spiking, metrics. Metrics are the foundations of observability but there is another key component I haven’t mentioned that is essential for creating an observable system.

It’s a simple concept but Grafana does a lot to make alerts as easy to use as possible, with its Incident Response Management features like on call schedules and integrations with 3rd party services like slack and teams - meaning you can get your alert to whoever needs it on whatever platform they need it on.

Of course you can’t do it all with Metrics, you can do a lot but sometimes you will need more. This is where the other 3 Pillars of Observability come in: Logs, Traces and Profiles - In future blogs I am going to be looking at each of these in detail and discuss how they fit into Grafana and help achieve Observability.

More Resources:

AI, Observability, and Operational Resilience with Shaun Cooney at Splunk — The Somerford Podcast ME

Journey of a Digital Leader: Transformation, Resilience and Philanthropy — The Somerford Podcast, S6E8

Interested in Grafana Cloud Metrics?

For more information on Grafana's solutions, please get in touch!

Grafana Cloud Metrics
(The 4 Pillars Of Observability)

A Recap on Grafana's Capabilities

What are Metrics?

Storing Metrics with Mimir

Dashboarding with Metrics

Queries

Range and Instant Queries

Time Range Picker

Thresholds

Overrides

Visualisations

The Gauge

Time Series Graph

How Metrics Fit into Grafana and Observability

Alerts

More Resources:

AI, Observability, and Operational Resilience with Shaun Cooney at Splunk — The Somerford Podcast ME

Journey of a Digital Leader: Transformation, Resilience and Philanthropy — The Somerford Podcast, S6E8

Interested in Grafana Cloud Metrics?

Grafana Cloud Metrics(The 4 Pillars Of Observability)

A Recap on Grafana's Capabilities

What are Metrics?

Storing Metrics with Mimir

Dashboarding with Metrics

Queries

Range and Instant Queries

Time Range Picker

Thresholds

Overrides

Visualisations

The Gauge

Time Series Graph

How Metrics Fit into Grafana and Observability

Alerts

More Resources:

AI, Observability, and Operational Resilience with Shaun Cooney at Splunk — The Somerford Podcast ME

Journey of a Digital Leader: Transformation, Resilience and Philanthropy — The Somerford Podcast, S6E8

Interested in Grafana Cloud Metrics?

Grafana Cloud Metrics
(The 4 Pillars Of Observability)