Skip to main content

How API burn rate alerts are calculated in Red Hat OpenShift

Burn rates help you predict whether your downtime is too high or risks exceeding your service-level objective.
Image
Photo of a lit match on fire

Photo by Yaoqi on Unsplash

One of the most cryptic alerts in the default Red Hat OpenShift 4 installation is the KubeAPI burn rate. For example, if you open up two critical alerts, you will see:

sum(apiserver_request:burnrate6h) > (6 * 0.01) and sum(apiserver_request:burnrate30m) > (6 * 0.01)
sum(apiserver_request:burnrate1h) > (14.4 * 0.01) and sum(apiserver_request:burnrate5m) > (14.4 * 0.01)

Aside from being able to ascertain that one is using a 6 hour and a 30 minute window and the other is using a 1 hour window and a 5 minute window, these statements are kind of confusing. What does (6 * 0.01) represent? Why isn't it the same in the second alert, (14.4 * 0.01). Are these numbers chosen at random? What do they indicate? How do I know what this alert is telling me?

[ Learn the basics in the Kubernetes cheat sheet. ]

Before I talk about burn rate, I'll take a few steps back and talk about error budgets.

Error budgets

To understand burn rate, you need to have a solid understanding of error budgets—both why and how they relate to this style of alert. What follows is a very condensed explanation of error budgets.

Every organization will have failures. This is practically unavoidable in today's fast-paced, always-on world. Understanding that, an organization needs to determine the business objectives of any service they are relying on. With this in hand, the technical people can take those objectives and turn them into some level of availability requirement.

The following shows uptime values for various service-level objective (SLO) thresholds on a yearly basis:

  • 99%: 3 days, 14 hours, 56 minutes, 18 seconds per year
  • 99.9%:  0 days, 8 hours, 41 minutes, 38 seconds per year
  • 99.99%: 0 days, 0 hours, 52 minutes, 9.8 seconds per year
  • 99.999%:  0 days, 0 hours, 5 minutes, 13 seconds per year
  • 99.9999%: 0 days, 0 hours, 0 minutes, 31 seconds per year

As you can see from the above, once you get past 99.99% there is virtually no room for error. Even at 99.99% that's only approximately 4 minutes and 21 seconds per month of downtime. A 99.999% availability is both unrealistic and undesirable for the vast majority of organizations.

At any rate, after the organization has chosen an appropriate availability goal, the site reliability engineering (SRE) team now has some "wiggle room" to make improvements that could potentially incur downtime.

Say you have decided on a very reasonable 99.9% availability. This means in a given month, any change made in the environment can cause a loss of availability up to 43 minutes. This is enough time to test somewhat risky changes without limiting their usefulness too much. Once the 43 minutes has been used, the environment should go into a "stability" state where no new impacting changes are made until the error budget recovers.

[ Learn how to manage your Linux environment for success. ]

What are burn rates?

First, I should define burn rate means. When dealing with service-level (SLI/SLO/SLA) style monitoring, by necessity this is done using thresholds (sometimes known as high-water marks). For an important metric that indicates a degradation in user-perceived experiences, an SRE needs to be able to take action before the designated threshold is reached. Ideally, you don't want to use all 43 minutes in one improvement. Burn rate is an extrapolation of how quickly you will use your entire available error budget. In plain language, it says "based on the past X hours, if the errors continue at this rate, you will use all your error budget in X more hours."

Knowing in general what a burn rate calculation is supposed to do is only mildly useful. For maximum utility, an SRE needs to be able to configure the alert so that that it fires soon enough for them to react. It's sort of like putting a tripwire around your camp. The objective is to make noise early enough that you have time to react to the intrusion.

The table below lays out some common values that organizations use when evaluating when to send an alert. Every value below except the burn rate is decided by the SRE. You are free to change the SLO period, at what percentage of the error budget is consumed, and so on.

Burn rate table

Error budget consumed SLO period Alert window SLO period percentage (using hours) Burn rate
2% 30d 1h 1/720 * 100 14.4
5% 30d 6h 6/720 * 100 6
10% 30d 3d 72/720 * 100 1
2% 14d 15m 0.25/336 * 100 26.88
5% 14d 2h 2/336 * 100 8.4
10% 14d 1.5d 36/336 * 100 0.933

To calculate the burn rate, use the following equation, in which e stands for percentage of error budget consumed, and t stands for percentage of time period elapsed.

BurnRate = e / t

Plugging in the numbers looks like this:

BurnRate = 2 / ((1/720) * 100)

Therefore, the burn rate equals 14.4.

Examining the alert

"OK," I hear you saying, "but what does this have to do with the alert that was mentioned at the beginning of this article?" To save you some scrolling, here is one of the statements:

sum(apiserver_request:burnrate1h) > (14.4 * 0.01) and sum(apiserver_request:burnrate5m) > (14.4 * 0.01)

For now I'll ignore the and sum(...) half of the alert expression. By referring to the burn rate table above, you know the alert assumes a 30-day SLO period, it is using an alert window (also know as a lookback window) of 1 hour, and the alert will trigger after 2% of the error budget has been consumed. Or in plain language, this alert will fire if 2% (or more) of the error budget for the month has been used in the last hour. All of this is derived from the burn rate of 14.4.

However, what is the second number in the equation? You can see this (14.4 * 0.01), but what does 0.01 tell you? Why doesn't the alert simply put in the answer to the expression 14.4 * 0.01 instead of the expression? The answer is that leaving the 0.01 value in the expression provides the SRE with valuable information. The equation can be rewritten like this:

AlertTriggersWhen = BurnRate * (100 - SLO)

In effect, 0.01 is the SLO assumed in this alert. Restating in plain language: "With an SLO of 99%, this alert will fire if 2% of the month's error budget has been used in the last hour."

Excellent, an SRE can now decipher the first half of the alert. So what's with the and sum(...) part of the alert? Well, remember that this is an extrapolation. The alert is attempting to fire so that you have time to react and fix the problem before the error budget is used up.

To that end, it's possible that a massive spike for a few seconds would trigger an alert during the lookback window. To ascertain whether or not the issue was a "blip" or an ongoing problem, a second condition is required to be true before the alert is triggered. In this case, the primary lookback window is set for 1 hour, meaning that if it sees a statistically significant spike in the last hour, the first half of the evaluation registers as true.

The second, shorter window is used as a sanity check to ensure that the burn rate calculation has held true for a shorter window of time. In effect, it is meant to avoid "alert flapping."

Wrapping up

This has been a very shallow dive into error budgets and how they tie into alerting. If the world of error budgets is new to you, this alert will probably not help a great deal. It will help identify when you have a spike of negative activity in the API, and that has some utility. However, to make the most of this alert, and potentially customize it, I recommend doing a deeper dive into error budgets and then creating thresholds that are meaningful based on your business objectives. The numbers in the burn rate alert represent the percentage of errors burned over time in a given SLO time window relative to the availability targets set out by an organization.

These alerts are deemed critical by Red Hat engineering and thus are included with the default installation of OpenShift 4. It's important to understand what the default alerts are telling you so that you can plan your actions accordingly to preserve the health of the cluster and ultimately the happiness of your users.

[ Learning path: Getting started with Red Hat OpenShift Service on AWS (ROSA)

Topics:   OpenShift   Troubleshooting  
Author’s photo

Steve Ovens

Steve is a dedicated IT professional and Linux advocate. Prior to joining Red Hat, he spent several years in financial, automotive, and movie industries. More about me

Hands-on learning path:

Deploy a cluster in Red Hat OpenShift Service on AWS