Highlights of Monitorama PDX day 1

After watch­ing the talks at Mon­i­tora­ma for a few years, I final­ly have the chance to attend this year and share my insights on server­less observ­abil­i­ty on Wednes­day.

Day 1 was great (despite my mas­sive jet lag!), and the talks were most­ly focused on the human side of mon­i­tor­ing. Of the talks two stood out for me:

Here are my sum­maries from these two talks.

Optimizing for Learning

Dur­ing an inci­dent we often find our­selves over­whelmed by the shear amount of data avail­able to us. Expert intu­ition is the thing that allows us to quick­ly fil­ter out the noise and strike at the heart of the prob­lem.

There­fore, to improve the over­all abil­i­ty of the team, we need to improve the way we learn from inci­dents to opti­mize for expert intu­ition.

Logan then touched on the con­straint the­o­ry (which I still need to read up on to under­stand myself) and the hier­ar­chy of learn­ing.


the Hier­ar­chy of Learn­ing

She talked about the impor­tance of giv­ing peo­ple the oppor­tu­ni­ties to learn the rules of the world, which grant inroads to knowl­edge. As a senior engi­neer, you should vocalise your ratio­nales for actions dur­ing an inci­dent. This helps begin­ners estab­lish men­tal mod­el of the world they oper­ate in.

Whilst many of us are good at mem­o­ris­ing text­books and pass­ing exams, these meth­ods give “the illus­tra­tion of learn­ing”. They are inef­fec­tive at stor­ing infor­ma­tion for long term mem­o­ry, which is crit­i­cal for expert intu­ition.

Instead, look to test your­self often in low-state envi­ron­ments. Because in order to learn, we need to exer­cise retrieval of infor­ma­tion often. These retrieval exer­cis­es help strength­en those neur­al path­ways in our brains.

It’s also more ben­e­fi­cial when the retrieval is effort­ful, so you should make these exer­cis­es dif­fi­cult and force your­self to strug­gle a lit­tle.

Logan went on to out­line a sim­ple tech­nique for us to prac­tice delay retrieval and inter­leav­ing, using cards. Every time we acquire a new piece of infor­ma­tion, we can write it down on a card and keep the cards handy.

From time-to-time you can use the cards to test your­self, and ques­tions you get right often are moved up to anoth­er box which you test your­self with less fre­quent­ly. Ques­tions you get wrong often are moved down to the pre­vi­ous box which you use to test your­self more fre­quent­ly.

Logan also talked about Mem­o­ry Palace, which is a tech­nique that mem­o­ry ath­letes use to improve their mem­o­ry.

This sec­tion of the talk echoes the research­es that Super­me­mo is built on, and Josh Kaufman’s TED talk about the 10k hours rule where he specif­i­cal­ly talked about the need for those exer­cis­es to be delib­er­ate and chal­leng­ing.

We need to build knowl­edge into men­tal mod­els, which is the basis of expert intu­ition. Men­tal mod­els pre­pare you for deal­ing with inci­dents and make you a bet­ter sto­ry teller, and you become bet­ter at auto­mat­i­cal­ly recog­nis­ing causal­i­ties.

To build these men­tal mod­els, we need to recog­nise pat­terns from observ­able events, and from build up our men­tal mod­el from those pat­terns.

But this is hard.

If these events and pat­terns are incom­plete or false, so would the men­tal mod­els you build on top of it.

For true observ­abil­i­ty, you also need to include his­tor­i­cal con­text and deci­sions. This can be as sim­ple as doc­u­ment that out­lines pre­vi­ous inci­dents and deci­sions, to help new­com­ers build a shared under­stand­ing of those his­tor­i­cal con­text.

This is very sim­i­lar to the archi­tec­tur­al deci­sion record (ADR) used by some devel­op­ment teams.

Reg­u­lar reflec­tion also helps sus­tain mem­o­ry and learn­ing. Prac­tic­ing active reflec­tion helps you build men­tal mod­els faster.

Inci­dent reviews help with this.

You should go beyond a sin­gu­lar men­tal mod­el. By expos­ing your­self to mul­ti­ple men­tal mod­els, and there­fore dif­fer­ent ways of look­ing at the world, it helps enforce true under­stand­ing of the mod­els.

Again, I couldn’t agree more with this! I have long held the belief that you need to learn dif­fer­ent pro­gram­ming par­a­digms in order to open your­self up to solu­tions that you might even be able to per­ceive oth­er­wise.

Logan then talked about cul­tur­al mem­o­ry and the growth mind­set.

Cycling back to the ear­li­er point on the need to doc­u­ment his­tor­i­cal con­text. We need to share expe­ri­ences from inci­dents with new­com­ers. Your oper­a­tional abil­i­ty is built on bad mem­o­ries that leave a mark. Insti­tu­tion­al knowl­edge is essen­tial for new­com­ers.

Also, you need to rec­og­nize that lean­ing requires a lot of fail­ure.

Final­ly, Logan fin­ished off on the impor­tance of psy­cho­log­i­cal safe­ty. Which inci­den­tal­ly, is also the most impor­tant attribute of a suc­cess­ful team, based on research con­duct­ed by Google.

Build an envi­ron­ment where peo­ple feel safe to ask any ques­tion, even ones that might be per­ceived as naive.

Don’t car­ry blame, opti­mise for learn­ing instead.

Work­ing in envi­ron­ments with­out psy­cho­log­i­cal safe­ty can be detri­men­tal to your men­tal health. Your mem­o­ry suf­fers from bad men­tal health, which ulti­mate­ly impacts your abil­i­ty to respond quick­ly.

Alert Fatigue

If you have worked in the ops space, then you’re prob­a­bly famil­iar with alert fatigue and deci­sion fatigue.

They hap­pen because we run out of exec­u­tive func­tions, which is a set of cog­ni­tive process­es required for deci­sion mak­ing.

Our exec­u­tive func­tion is lim­it­ed over a peri­od of time. Although it does replen­ish over time, but it’s not an unlim­it­ed resource.

Alert and deci­sion fatigue are relat­ed because they’re both to do with our abil­i­ty to observe changes and our abil­i­ty to affect them.

Aditya used alert fatigue in hos­pi­tal stuffs to draw many par­al­lels with the soft­ware indus­try.

In the health­care indus­try, between 72–99% of all alarms are false alarms!

If you receive mul­ti­ple false alarms for the same patient, it doesn’t just cre­ate alert fatigue for that patient. It great­ly impacts your alert fatigue lev­el sys­tem­i­cal­ly for ALL patients.

This is the same with soft­ware, where one fre­quent­ly mis-fired alarm can cause alert fatigue for all alarms. Which is why we need to active­ly chase down edge cas­es that gen­er­ate high num­bers of false alarms. These guys are like­ly respon­si­ble for drain­ing our exec­u­tive func­tions.

Aditya then shared four steps to reduce alert fatigue: sup­port­ed, trust­wor­thy, action­able and triaged, or STAT for short.

For an alert to be sup­port­ed, you need to be able to iden­ti­fy own­er­ship of this alert, and the own­er needs to have the abil­i­ty to affect the alerts. Don’t for­get, alert sys­tems also include the peo­ple who par­tic­i­pate in respond­ing to alerts, not just the alerts them­selves.

A com­mon bad prac­tice is to not allow respon­ders to change the thresh­olds of the alerts for fear of tem­per­ing. Remem­ber, the respon­ders already have the pow­er to sim­ply ignore the alerts.

Instead, we should empow­er peo­ple to change thresh­olds rather than forc­ing them to ignore them. We are not grant­i­ng them with any more pow­er that they didn’t have before.

Per­son who respond to alert must have own­er­ship to affect the end result as well, which could be busi­ness process­es that the alert is intend­ed to pro­tect.

Don’t keep alerts past their use­ful­ness. If an alert is no longer rel­e­vant as our archi­tec­ture change, get rid of it. If the goal is to reduce alert fatigue, then we don’t want a sys­tem that sys­tem­at­i­cal­ly induces alert fatigue!

We need to be able to trust our alerts, at a sys­temic lev­el, not just indi­vid­ual alerts. Peo­ple needs to trust the sys­tem. Even one alert can cor­rode the trust in the whole sys­tem.

We need to max­imise col­lec­tive trust­wor­thi­ness of the alerts. Which is an impor­tant point to con­sid­er when using auto­mat­ed process for gen­er­at­ing alerts. Do these process­es gen­er­ate that I can actu­al­ly trust? Can I adjust the thresh­olds?

If you can’t trust your alerts, then it leads to even more deci­sions when respond­ing to an alert. The more deci­sions you have to make in response to an alert, the more chance for mis­takes.

This cre­ates deci­sion fatigue, and you’re more like­ly to ignore these alerts in the future. In turn, it cre­ates a fear towards the alerts, and sets off alert fatigue before it even hap­pens.

Spell out the actions, use a deci­sion tree. Deci­sion tress are good.

An alert is not action­able with­out an actor. It needs a clear deci­sion of who should be tak­ing charge. A com­mon bad pat­tern is for the alert to fire blind­ly into a slack chan­nel with no clear indi­ca­tion as to who should deal with it.

We also need a sense of sever­i­ty with alerts so we can triage them.

Hav­ing a sys­tem for triag­ing alerts is great, but we also need to keep up-to-date on the dynam­ic things that change as our sys­tem and archi­tec­ture changes.

We need to have reg­u­lar re-eval­u­a­tion of what should be alert­ed, their sever­i­ty, and what is caus­ing us alert fatigue.

Right at the end of the day, Ian Ben­net of Twit­ter shared his expe­ri­ence of migrat­ing Twit­ter to their new observ­abil­i­ty stack.

It was inter­est­ing to hear about all the prob­lems they ran into, and it also struck me the scale of their infra­struc­ture!

Alright, I’m still jet lagged as hell (I’m writ­ing this at 2am in the morn­ing..) but I’m look­ing for­ward to day 2!