Metricano – simplifying application monitoring

On application monitoring

In the Gamesys social team, our view on application monitoring is such that anything that runs in production needs to be monitored extensively all the time – every service entry point, IO operations or CPU intensive tasks. Sure, it comes at the cost of a few CPU cycles which might mean that you have to run a few more servers at scale, but that’s small cost to pay compared to:

  • lack of visibility of how your application is performing in production; or
  • inability to spot issues occurring on individual nodes amongst large number of servers; or
  • longer time to discovery on production issues, which results in
    • longer time to recovery (i.e. longer downtime)
    • loss of customers (immediate result of downtime)
    • loss of customer confidence (longer term impact)

Services such as StackDriver and Amazon CloudWatch also allow you to set up alarms around your metrics so that you can be notified or some automated actions can be triggered in response to changing conditions in production.

In Michael T. Nygard’s Release It!: Design and Deploy Production-Ready Software (a great read by the way) he discussed at length how unfavourable conditions in production can cause cracks to appear in your system, and through tight coupling and other anti-patterns these cracks can accelerate and spread across your entire application and eventually bring it crashing down to its knees.


In applying extensive monitoring to our services we are able to:

  • see cracks appearing in production early; and
  • collect extensive amount of data for the post-mortem; and
  • use knowledge gained during post-mortems to identify early warning signs and set up alarms accordingly


Introducing Metricano

With our emphasis on monitoring, it should come as no surprise that we have long sought to make it easy for our developers to monitor different aspects of their service.


Now, we’ve made it easy for you to do the same with Metricano, an agent-based framework for collecting, aggregating and publishing metrics from your application. From a high-level, the MetricsAgent class collects metrics and aggregates them into second-by-second summaries. These summaries are then published to all the publishers you have configured.


Recording Metrics

There are a number of options for you to record metrics with MetricsAgent:


You can call the IncrementCountMetric, or RecordTimeMetric methods on an instance of IMetricsAgent (you can use MetricsAgent.Default or create a new instance with MetricsAgent.Create), for example:


F# Workflows

From F#, you can also use the built-in timeMetric and countMetric workflows:


PostSharp Aspects

Lastly, you can also use the CountExecution and LogExecutionTime attributes from the Metricano.PostSharpAspects nuget package, which can be applied at method, class and even assembly level.

The CountExecution attribute records a count metric with the fully qualified name of the method, whereas the LogExecutionTime attribute records execution times into a time metric with the fully qualified name of the method. When applied at class and assembly level, the attributes are multi-casted to all encompassed methods, private, public, instance and static. It’s possible to target specific methods, by name or visibility, etc. please refer to PostSharp’s documentation for detail.


Publishing Metrics

All the metrics you record won’t do you much good if they just stay inside the memory of your application server.

To get metrics out of your application server and into a monitoring service or dashboard, you need to tell Metricano to publish metrics with a set of publishers. There is a ready made publisher for Amazon CloudWatch service in the Metricano.CloudWatch nuget package.

To add a publisher to the pipeline, use the Publish.With static method, see example here.

Since the lowest granularity on Amazon CloudWatch is 1 minute, so as an optimization to cut down on the number of web requests (which also has a cost impact), CloudWatchPublisher will aggregate metrics locally and only publish the aggregates on a per minute basis.

If you want to publish your metric data to another service (StackDriver or New Relic for instance), you can create your own publisher by implementing the very simple IMetricsPublisher interface. This simple ConsolePublisher for instance, will calculate the 95 percentile execution time and print them:


In general I find the 95/97/99 percentile time metrics much more informative than simple averages, since averages are so easily biased by even a small number of outlying data points.



I hope you have enjoyed this post and that you’ll find Metricano a useful addition in your application.

I highly recommend reading Release It!, much of the patterns and anti-patterns discussed in the book are becoming more and more relevant in today’s world where we’re building smaller, more granular microservices. Even the simplest of applications have multiple integration points – social networks, cloud services, etc. – and they are places where cracks are likely to occur before they spread out to the rest of your application, unless, you have taken the measure to guard against such cascading failures.

If you decide to buy the book from amazon, please use the link I provide below or add the query string parameter tag=theburningmon-21 to the URL so that I can get a small referral fee and use it towards buying more books and finding more interesting things to write about here Smile



Metricano project page

Release It!: Design and Deploy Production-Ready Software

Metricano nuget package

Metricano.CloudWatch nuget package

Metricano.PostSharpAspects nuget package

Red-White Push – Continuous Delivery at Gamesys Social

Introducing, DSLs to query against Amazon CloudWatch metrics

If you have done any DevOps work on Amazon Web Services (AWS) then you should be familiar with Amazon CloudWatch, a service for tracking and viewing metrics (CPU, network in/out, etc.) about the various AWS services that you consume, or better still, custom metrics that you publish about your service.

On top of that, you can also set up alarms on any metrics and send out alerts via Amazon SNS, which is a pretty standard practice of monitoring your AWS-hosted application. There are of course many other paid services such as StackDriver and New Relic which offer you a host of value-added features, personally I was impressed with some of the predicative features from StackDriver.

The built-in Amazon management console for CloudWatch provides the rudimentary functionalities that lets you browse your metrics and view/overlap them on a graph, but it falls short once you have a decent number of metrics.

For starters, when trying to browse your metrics by namespace, you’re capped at 200 metrics so discovery is out of the question, you have to know what you’re looking for to be able to find it, which isn’t all that useful when you have hundreds of metrics to work with…


Also, there’s no way for you to filter metrics by the recorded datapoints, so to answer even simple questions such as

‘what other timespan metrics also spiked at mid-day when our service discovery latency spiked?’

you now have to manually go through all the relevant metrics (and of course you have to find them first!) and then visually check the graph to try and find any correlations.


After being frustrated by this manual process for one last time I decided to write some tooling myself to make my life (and hopefully others) a bit easier, and in comes Amazon.CloudWatch.Selector, a set of DSLs and CLI for querying against Amazon CloudWatch.



With this simple library you will get:

  • an internal DSL which is intended to be used from F# but still usable from C# although syntactically not as intuitive
  • an external DSL which can be embedded into a command line or web tool


Both DSLs support the same set of filters, e.g.

NamespaceIs Filters metrics by the specified namespace.
NamespaceLike Filters metrics using a regex pattern against their namespaces.
NameIs Filters metrics by the specified name.
NameLike Filters metrics using a regex pattern against their names.
UnitIs Filters metrics against the unit they’re recorded in, e.g. Count, Bytes, etc.
Average Filters metrics by the recorded average data points, e.g. average > 300 looks for metrics whose average in the specified timeframe exceeded 300 at any time.
Min Same as above but for the minimum data points.
Max Same as above but for the maximum data points.
Sum Same as above but for the sum data points.
SampleCount Same as above but for the sample count data points.
DimensionContains Filters metrics by the dimensions they’re recorded with, please refer to the CloudWatch docs on how this works.
DuringLast Specifies the timeframe of the query to be the last X minutes/hours/days. Note: CloudWatch only keeps up to 14 days worth of data so there’s no point going any further back then that.
Since Specifies the timeframe of the query to be since the specified timestamp till now.
Between Specifies the timeframe of the query to be between the specified start and end timestamp.
IntervalOf Specifies the ‘period’ in which the data points will be aggregated into, i.e. 5 minutes, 15 minutes, 1 hour, etc.

Here’s some code snippet on how to use the DSLs:


In addition to the DSLs, you’ll also find a simple CLI tool as part of the project which you can start by setting the credentials in the start_cli.cmd script and running it up. It allows you to query CloudWatch metrics using the external DSL.

Here’s a quick demo of using the CLI to select some CPU metrics for ElasiCache and then plotting them on a graph.


As a side note, one of the reasons why we have so many metrics is because we have made it super easy for ourselves to record new metrics (see this recorded webinar for more information) to gives ourselves a very granular set of metrics so that any CPU-intensive or IO work is monitored as well as any top-level entry points to our services.