CodeMotion 15–Takeaways from “Measuring micro-services”

This talk by Richard Rodger (of near­Form) was my favourite at this year’s Code­Mo­tion con­fer­ence in Rome, where he talked about why we need to change the way we think about mon­i­tor­ing when it comes to mea­sur­ing micro-ser­vices.

 

TL; DR

Iden­ti­fy invari­ants in your sys­tem and use them to mea­sure the health of your sys­tem.

 

When it comes to mea­sur­ing the health of a human being, we don’t focus on the minute details and instead we mon­i­tor emerg­ing prop­er­ties such as pulse, tem­per­a­ture and blood pres­sure.

Sim­i­lar­ly, for micro-ser­vices, we need to go beyond the usu­al met­rics of CPU and net­work flow and focus on the emerg­ing prop­er­ties of our sys­tem. When you have sys­tems with 10s or 100s of mov­ing parts, those rudi­men­ta­ry met­rics alone can no longer tell you if your sys­tems are in rude health.

 

Message flow rate

Mes­sages are fun­da­men­tal to micro-ser­vices style archi­tec­tures and mes­sage behav­iour has emer­gent prop­er­ties that can be mea­sured.

For instance, you can mon­i­tor rates of user logins, or no. of mes­sages sent to down­stream sys­tems. And when it comes time to do a rolling deploy­ment (or canary deploy­ment, or blue/green deploy­ment, etc.) then notice­able changes to these mes­sage rates is an indi­ca­tor of bugs in your new code.

 

Message Patterns

Here  are some sim­pli­fied mes­sage pat­terns that have emerged from Richard’s expe­ri­ence of build­ing micro-ser­vices.

image

Inter­est­ing­ly, when your micro-ser­vices archi­tec­ture grows organ­i­cal­ly they tend to end up look­ing like the Tree pat­tern over time. Even if you hadn’t designed them that way.

 

Why Micro-services?

As many oth­ers have talked about, micro-ser­vices archi­tec­tures are more com­plex, so why go down this road at all?

Because it helps us deal with deploy­ment risks and move away from the big-bang deploy­ments asso­ci­at­ed with mono­lith­ic sys­tems (a case in point being the Knight Cap­i­tal Group tragedy).

 

Risk is inevitable, there’s no way of com­plete­ly remov­ing risk asso­ci­at­ed with a project, but we can reduce it. The cur­rent best prac­tices of unit tests and code reviews aren’t good enough because they only cov­er the prob­lems that we can antic­i­pate and they don’t real­ly tell us how much our risk has been reduced by.

(side­bar : this is also why prop­er­ty-based test­ing is so great. It moves us away from think­ing about spe­cif­ic test cas­es (i.e. prob­lems that we can antic­i­pate), to think­ing about prop­er­ties and invari­ants about our sys­tem. Scott Wlaschin wrote two great posts on adopt­ing prop­er­ty-based test­ing in F#.)

 

Instead, we have to accept that the sys­tem can fail in ways we can’t antic­i­pate, so it’s more impor­tant to mea­sure impor­tant behav­iours in pro­duc­tion rather than try­ing to catch every pos­si­ble case in devel­op­ment.

(side­bar : This echoes the mes­sage that I have heard time and again from the Net­flix guys. They have also built great tool­ing to make it easy for them to do canary deploy­ment; mea­sure and con­firm sys­tem behav­iour before rolling out changes; and quick­ly roll­back if nec­es­sary.)

 

It’s pos­si­ble to do this with mono­liths too, with Face­book and Github being prime exam­ples. The way they do it is to use fea­ture flags. The down­side of this approach is that you end up with messier code because it’s lit­tered with if-else state­ments.

With micro-ser­vices how­ev­er, your code don’t have to be messy any­more.

(side­bar : on the relat­ed note, Sam New­man also point­ed out a num­ber of oth­er ben­e­fits with micro-ser­vices style archi­tec­tures.)

 

Formal Methods

Richard pro­posed the use of For­mal Meth­ods to decide what to mea­sure in a sys­tem. He then gave a shout out to TLA+ by Leslie Lam­port (of the Pax­os and Vec­tor Clock fame), which was used by AWS to ver­i­fy DynamoDB and S3.

The impor­tant thing is to iden­ti­fy invari­ants in your sys­tem – things that should always be true about your sys­tem – and mea­sure those.

For exam­ple, giv­en two ser­vices – shop­ping cart & sales tax – as below, the ratio of mes­sage rate (i.e. requests) to these ser­vices should be 1:1.

Even as the mes­sage rates them­selves change through­out the day (as dic­tat­ed by user activ­i­ty), this 1:1 ratio should always hold. And that is an invari­ant of our sys­tem that we can mea­sure!

image

image

The same invari­ant applies to any sys­tem that fol­lows the Chain pat­tern we men­tioned ear­li­er. In fact, each of the pat­terns we saw ear­li­er give way to a nat­ur­al invari­ant:

image

 

All your unit tests are check­ing what can go wrong. By look­ing for cause/effect rela­tion­ships and mea­sur­ing invari­ants, we turn that on its head and instead val­i­date what must go right in pro­duc­tion.

ask not what can go wrong, ask what must go right…”

- Chris New­combe, AWS

It becomes an enabler for doing rapid deploy­ment.

When you do a par­tial deploy­ment and see that it has bro­ken your invari­ants, then you know it’s time to roll­back.

Richard then talked about Gilt’s use of micro-ser­vices, who inci­den­tal­ly, gave a talk at Craft­Conf on just that, which you can read all about here.

 

In the Oil indus­try, they have a rule-of-three that says if three of the safe­ty mea­sures are close to crit­i­cal lev­els then it counts as a prob­lem. Even if none of the mea­sures have gone over crit­i­cal lev­els indi­vid­u­al­ly.

We can apply the same idea in our own sys­tems, where we can use mea­sures that are approach­ing fail­ure as indi­ca­tor for:

  • risk of doing a deploy­ment
  • risk of a sys­tem fail­ure in the near future

 

(side­bar : one aspect of mea­sure­ment that Richard didn’t touch on in his talk is gran­u­lar­i­ty of your mea­sure­ment– min­utes, sec­onds, mil­lisec­onds. It deter­mines your time to dis­cov­ery of a prob­lem, and there­fore the sub­se­quent time to cor­rec­tion.

This has been a hot top­ic at the Mon­i­tora­ma con­fer­ences and ex-Net­flix cloud archi­tect Adri­an Cock­croft did a great keynote on it last year.)

 

 

Links