We can do better than percentile latencies

Years ago, I used aver­age laten­cy on every dash­board and every alarm. That is, until I woke up to the prob­lems of aver­age laten­cies along with every­body else in the indus­try:

  • When the dataset is small it can be eas­i­ly skewed by a small num­ber of out­liers.
  • When the dataset is large it can hide impor­tant details such as 10% of your users are expe­ri­enc­ing slow respons­es!
  • It is just a sta­tis­ti­cal val­ue, on its own it’s almost mean­ing­less. Until we plot the laten­cy dis­tri­b­u­tion we won’t actu­al­ly under­stand how our users are expe­ri­enc­ing our sys­tem.

Nowa­days, the lead­ing prac­tice is to use 95th or 99th per­centile laten­cies (often referred to as tail laten­cy) instead. These per­centile laten­cies tell us the worst response time 95% or 99% of users are get­ting. They gen­er­al­ly align with our SLOs or SLAs, and gives us a mean­ing­ful tar­get to work towards — e.g. “99% of requests should com­plete in 1s or less”.

Using per­centile laten­cies is a big improve­ment on aver­age laten­cies. But over the years I have expe­ri­enced a num­ber of pain points with them, and I think we can do bet­ter.

The problems with percentile latencies

The biggest prob­lem with using per­centile laten­cies is not actu­al­ly with per­centile laten­cies them­selves, but with the way it’s imple­ment­ed by almost every sin­gle ven­dor out there.

Percentile latencies are “averaged”

Because it takes a lot of stor­age and data pro­cess­ing pow­er to ingest all the raw data, most ven­dors would gen­er­ate the per­centile laten­cy val­ues at the agent lev­el. This means by the time laten­cy data is ingest­ed, they have lost all gran­u­lar­i­ty and comes in as sum­maries — mean, min, max, and some pre­de­fined per­centiles. To show you the final 99th per­centile laten­cy, the ven­dor would (by default) aver­age the 99th per­centile laten­cies that has been ingest­ed.

You can’t aver­age per­centiles, it doesn’t make any sense! This whole prac­tice gives you a mean­ing­less sta­tis­ti­cal val­ue, and it’s in no way the true 99th per­centile laten­cy of your ser­vice. Aver­ag­ing the per­centiles inher­it all the same prob­lems with aver­ages that per­centile laten­cies were sup­posed to address!

I have seen 99th per­centile laten­cies dif­fer by order of mag­ni­tude depend­ing on how I choose to aggre­gate them. Seri­ous­ly, how am I sup­posed to trust this num­ber when choos­ing the max over aver­age can pro­duce a 10x dif­fer­ence! You might as well stick a ran­dom­ly gen­er­ate num­ber on the dash­board, it’s almost as mean­ing­ful as the “the aver­age of 99th per­centiles”.

This prac­tice is so wide­spread, almost every mon­i­tor­ing tool I have tried does this. Hon­ey­comb is one of the few excep­tions because they actu­al­ly ingest and process all the raw events.

Can’t tell how bad the bad times are

It’s great that we can use per­centiles to mon­i­tor our com­pli­ance with SLOs/SLAs. When things are going well, it gives us that warm and fuzzy feel­ing that all is well with the world.

But when they go wrong, and some­times they go very wrong, we are left won­der­ing just how bad things are. Are 10% of my users get­ting response time of 1s and above? Is it 20%? Could it be 50% of my users are get­ting a bad expe­ri­ence? I just don’t know! I can use var­i­ous per­centiles as gates but that approach only goes so far before it over­whelms my dash­boards.

Most data points are not actionable

As much as I love to stare at those green tiles and line graphs and know that:

  1. We have done a good job, go team!
  2. Everything’s fine, there’s no need to do any­thing

Indeed, most of the infor­ma­tion I con­sume when I look at the dash­board, are not imme­di­ate­ly action­able.

To be clear, I’m not say­ing that per­centile laten­cies are not use­ful and that you shouldn’t show them on dash­boards. But as the on-call engi­neer, my atten­tion is heav­i­ly biased towards “what is wrong” than “what is right”. I want dash­boards that match my focus and not force me to scan through tons of infor­ma­tion and pay the cog­ni­tive price to iden­ti­fy the sig­nals from the noise.

As an appli­ca­tion devel­op­er, my def­i­n­i­tion for “what is wrong” is quite dif­fer­ent. As an appli­ca­tion devel­op­er, I’m look­ing for unex­pect­ed changes in appli­ca­tion per­for­mance. If the laten­cy pro­file of my ser­vice changes after a deploy­ment, or oth­er relat­ed event (e.g. a mar­ket­ing cam­paign, or a new fea­ture being tog­gled on) then I need to inves­ti­gate those.

This dichoto­my in what’s impor­tant for ops engi­neers and appli­ca­tion devel­op­ers means we should have sep­a­rate dash­boards for each. More on this lat­er.

What can we do instead?

What could we use instead of per­centiles as the pri­ma­ry met­ric to mon­i­tor our application’s per­for­mance with and alert us when it starts to dete­ri­o­rate?

If you go back to your SLOs or SLAs, you prob­a­bly have some­thing along the lines of “99% of requests should com­plete in 1s or less”. In oth­er words, less than 1% of requests is allowed to take more than 1s to com­plete.

So what if we mon­i­tor the per­cent­age of requests that are over the thresh­old instead? To alert us when our SLAs are vio­lat­ed, we can trig­ger alarms when that per­cent­age is greater than 1% over some pre­de­fined time win­dow.

Unlike per­centiles, this per­cent­age can be eas­i­ly aggre­gat­ed across mul­ti­ple agents:

  • Each agent sub­mits total request count and num­ber of requests over thresh­old
  • Sum the two num­bers across all agents
  • Divide total num­ber of requests over thresh­old by total request count and you have an accu­rate per­cent­age

Dur­ing an out­age, when our SLAs are impact­ed, this met­ric tells us the num­ber of requests that have been affect­ed. Once we under­stood the blast radius of the out­age, the per­centile and max laten­cies then become use­ful met­rics to gauge how much user expe­ri­ence has been impact­ed.

Move aside, error count

We can apply the same approach to how we mon­i­tor errors. For any giv­en sys­tem, you have a small and finite num­ber of suc­cess cas­es. You also have a finite num­ber of known fail­ure cas­es, which you can active­ly mon­i­tor. But then there are the unknown unknowns — the fail­ure cas­es that you hadn’t even realised you have and wouldn’t know to mon­i­tor!

So instead of putting all your efforts into mon­i­tor­ing every sin­gle way your sys­tem can pos­si­bly fail, you should instead mon­i­tor for the absence of a suc­cess indi­ca­tor. For APIs, this can be the per­cent­age of requests that do not have a 2xx or 4xx response. For event pro­cess­ing sys­tems, it might be the per­cent­age of incom­ing events that do not have a cor­re­spond­ing out­go­ing event or observ­able side-effect.

This tells you at a high lev­el that “some­thing is wrong”, but not “what is wrong”. To fig­ure the “what”, you need to build observ­abil­i­ty into your sys­tem so you can ask arbi­trary ques­tions about its state and debug prob­lems that you hadn’t thought of ahead of time.

Different dashboard for different disciplines

As we dis­cussed ear­li­er, dif­fer­ent dis­ci­plines require dif­fer­ent views of the sys­tem. One of the most impor­tant design prin­ci­ples of a dash­board is that it must present infor­ma­tion that is action­able. And since the action you will like­ly take depends on your role in the orga­ni­za­tion, you real­ly need dash­boards that show you infor­ma­tion that are action­able for you!

Don’t try to cre­ate the dash­board to rule them all by cramp­ing every met­ric onto it. You will just end up with some­thing nobody actu­al­ly wants! Instead, con­sid­er cre­at­ing a few spe­cialised dash­boards, one for each dis­ci­pline, for instance:

  • Ops/SRE engi­neers care about out­ages and inci­dents first and fore­most. Action­able infor­ma­tion for them would help them detect inci­dents quick­ly and assess their sever­i­ties eas­i­ly. For exam­ple, per­cent­age of requests that are over the thresh­old, or the per­cent­age of requests that did not yield a suc­cess­ful response.
  • Devel­op­ers care about appli­ca­tion per­for­mance. Per­centile laten­cies are very rel­e­vant here, as are oth­er resource met­rics such as CPU and mem­o­ry usage met­rics.
  • Prod­uct own­ers and busi­ness ana­lysts might also need their own dash­boards too. They care about busi­ness met­rics such as reten­tion, con­ver­sion rate, or sales.

Summary

When you go to see a doc­tor, the doc­tor would try to ascer­tain (as part of the diag­no­sis):

  • What, and where your symp­toms are.
  • The sever­i­ty of your symp­toms.
  • How long have you expe­ri­enced these symp­toms.
  • Any cor­re­lat­ed events that could have trig­gered the symp­toms.

The doc­tor would use these infor­ma­tion to derive a treat­ment plan, or not, as might be the case. As the on-call engi­neer deal­ing with an inci­dent, I go through the same process to fig­ure out what went wrong and how I should respond.

In this post we dis­cussed the short­com­ings of per­centile laten­cies, which makes it a poor choice of met­ric in these sce­nar­ios:

  • They are usu­al­ly cal­cu­lat­ed at the agent lev­el, and aver­aged, which pro­duces a non­sen­si­cal val­ue that doesn’t reflect the true per­centile laten­cy of my sys­tem.
  • They don’t tell you the impact of an inci­dent.

We pro­posed an alter­na­tive approach — to mon­i­tor ser­vice health by track­ing the per­cent­age of requests whose response time is over the thresh­old. Unlike per­centiles, this met­ric aggre­gates well when sum­maris­ing results from mul­ti­ple agents, and gives us a clear pic­ture of the impact of an out­age.

We can apply the same approach to how we mon­i­tor errors. Instead of mon­i­tor­ing each and every error we know about, and miss all the errors we don’t know about, we should mon­i­tor for the absence of suc­cess indi­ca­tors.

Unfor­tu­nate­ly, this is not how exist­ing mon­i­tor­ing tools work… For this vision to come to pass we need sup­port from our ven­dors and change the way we han­dle mon­i­tor­ing data. The next time you meet with your ven­dor, let them know that you need some­thing bet­ter than per­centile laten­cies ;-) And if you know of any tools that let you imple­ment the approach I out­lined here, please let me know via the com­ments below!