Metricano – simplifying application monitoring

On application monitoring

In the Gamesys social team, our view on appli­ca­tion mon­i­tor­ing is such that any­thing that runs in pro­duc­tion needs to be mon­i­tored exten­sive­ly all the time – every ser­vice entry point, IO oper­a­tions or CPU inten­sive tasks. Sure, it comes at the cost of a few CPU cycles which might mean that you have to run a few more servers at scale, but that’s small cost to pay com­pared to:

  • lack of vis­i­bil­i­ty of how your appli­ca­tion is per­form­ing in pro­duc­tion; or
  • inabil­i­ty to spot issues occur­ring on indi­vid­ual nodes amongst large num­ber of servers; or
  • longer time to dis­cov­ery on pro­duc­tion issues, which results in
    • longer time to recov­ery (i.e. longer down­time)
    • loss of cus­tomers (imme­di­ate result of down­time)
    • loss of cus­tomer con­fi­dence (longer term impact)

Ser­vices such as Stack­Driv­er and Ama­zon Cloud­Watch also allow you to set up alarms around your met­rics so that you can be noti­fied or some auto­mat­ed actions can be trig­gered in response to chang­ing con­di­tions in pro­duc­tion.

In Michael T. Nygard’s Release It!: Design and Deploy Pro­duc­tion-Ready Soft­ware (a great read by the way) he dis­cussed at length how unfavourable con­di­tions in pro­duc­tion can cause cracks to appear in your sys­tem, and through tight cou­pling and oth­er anti-pat­terns these cracks can accel­er­ate and spread across your entire appli­ca­tion and even­tu­al­ly bring it crash­ing down to its knees.

 

In apply­ing exten­sive mon­i­tor­ing to our ser­vices we are able to:

  • see cracks appear­ing in pro­duc­tion ear­ly; and
  • col­lect exten­sive amount of data for the post-mortem; and
  • use knowl­edge gained dur­ing post-mortems to iden­ti­fy ear­ly warn­ing signs and set up alarms accord­ing­ly

 

Introducing Metricano

With our empha­sis on mon­i­tor­ing, it should come as no sur­prise that we have long sought to make it easy for our devel­op­ers to mon­i­tor dif­fer­ent aspects of their ser­vice.

 

Now, we’ve made it easy for you to do the same with Met­ri­cano, an agent-based frame­work for col­lect­ing, aggre­gat­ing and pub­lish­ing met­rics from your appli­ca­tion. From a high-lev­el, the Met­ric­sAgent class col­lects met­rics and aggre­gates them into sec­ond-by-sec­ond sum­maries. These sum­maries are then pub­lished to all the pub­lish­ers you have con­fig­ured.

 

Recording Metrics

There are a num­ber of options for you to record met­rics with Met­ric­sAgent:

Man­u­al­ly

You can call the Incre­ment­Count­Met­ric, or Record­Time­Met­ric meth­ods on an instance of IMet­ric­sAgent (you can use MetricsAgent.Default or cre­ate a new instance with MetricsAgent.Create), for exam­ple:

 

F# Work­flows

From F#, you can also use the built-in time­Met­ric and count­Met­ric work­flows:

 

Post­Sharp Aspects

Last­ly, you can also use the Coun­tEx­e­cu­tion and LogEx­e­cu­tion­Time attrib­ut­es from the Metricano.PostSharpAspects nuget pack­age, which can be applied at method, class and even assem­bly lev­el.

The Coun­tEx­e­cu­tion attribute records a count met­ric with the ful­ly qual­i­fied name of the method, where­as the LogEx­e­cu­tion­Time attribute records exe­cu­tion times into a time met­ric with the ful­ly qual­i­fied name of the method. When applied at class and assem­bly lev­el, the attrib­ut­es are mul­ti-cast­ed to all encom­passed meth­ods, pri­vate, pub­lic, instance and sta­t­ic. It’s pos­si­ble to tar­get spe­cif­ic meth­ods, by name or vis­i­bil­i­ty, etc. please refer to PostSharp’s doc­u­men­ta­tion for detail.

 

Publishing Metrics

All the met­rics you record won’t do you much good if they just stay inside the mem­o­ry of your appli­ca­tion serv­er.


To get met­rics out of your appli­ca­tion serv­er and into a mon­i­tor­ing ser­vice or dash­board, you need to tell Met­ri­cano to pub­lish met­rics with a set of pub­lish­ers. There is a ready made pub­lish­er for Ama­zon Cloud­Watch ser­vice in the Metricano.CloudWatch nuget pack­age.

To add a pub­lish­er to the pipeline, use the Publish.With sta­t­ic method, see exam­ple here.

Since the low­est gran­u­lar­i­ty on Ama­zon Cloud­Watch is 1 minute, so as an opti­miza­tion to cut down on the num­ber of web requests (which also has a cost impact), Cloud­Watch­Pub­lish­er will aggre­gate met­rics local­ly and only pub­lish the aggre­gates on a per minute basis.

If you want to pub­lish your met­ric data to anoth­er ser­vice (Stack­Driv­er or New Rel­ic for instance), you can cre­ate your own pub­lish­er by imple­ment­ing the very sim­ple IMet­ric­sPub­lish­er inter­face. This sim­ple Con­solePub­lish­er for instance, will cal­cu­late the 95 per­centile exe­cu­tion time and print them:

image

In gen­er­al I find the 95/97/99 per­centile time met­rics much more infor­ma­tive than sim­ple aver­ages, since aver­ages are so eas­i­ly biased by even a small num­ber of out­ly­ing data points.

 

Summary

I hope you have enjoyed this post and that you’ll find Met­ri­cano a use­ful addi­tion in your appli­ca­tion.

I high­ly rec­om­mend read­ing Release It!, much of the pat­terns and anti-pat­terns dis­cussed in the book are becom­ing more and more rel­e­vant in today’s world where we’re build­ing small­er, more gran­u­lar microser­vices. Even the sim­plest of appli­ca­tions have mul­ti­ple inte­gra­tion points – social net­works, cloud ser­vices, etc. – and they are places where cracks are like­ly to occur before they spread out to the rest of your appli­ca­tion, unless, you have tak­en the mea­sure to guard against such cas­cad­ing fail­ures.

If you decide to buy the book from ama­zon, please use the link I pro­vide below or add the query string para­me­ter tag=theburningmon-21 to the URL so that I can get a small refer­ral fee and use it towards buy­ing more books and find­ing more inter­est­ing things to write about here Smile

 

Links

Met­ri­cano project page

Release It!: Design and Deploy Pro­duc­tion-Ready Soft­ware

Met­ri­cano nuget pack­age

Metricano.CloudWatch nuget pack­age

Metricano.PostSharpAspects nuget pack­age

Red-White Push – Con­tin­u­ous Deliv­ery at Gamesys Social