Metricano – simplifying application monitoring

On application monitoring

In the Gamesys social team, our view on application monitoring is such that anything that runs in production needs to be monitored extensively all the time – every service entry point, IO operations or CPU intensive tasks. Sure, it comes at the cost of a few CPU cycles which might mean that you have to run a few more servers at scale, but that’s small cost to pay compared to:

  • lack of visibility of how your application is performing in production; or
  • inability to spot issues occurring on individual nodes amongst large number of servers; or
  • longer time to discovery on production issues, which results in
    • longer time to recovery (i.e. longer downtime)
    • loss of customers (immediate result of downtime)
    • loss of customer confidence (longer term impact)

Services such as StackDriver and Amazon CloudWatch also allow you to set up alarms around your metrics so that you can be notified or some automated actions can be triggered in response to changing conditions in production.

In Michael T. Nygard’s Release It!: Design and Deploy Production-Ready Software (a great read by the way) he discussed at length how unfavourable conditions in production can cause cracks to appear in your system, and through tight coupling and other anti-patterns these cracks can accelerate and spread across your entire application and eventually bring it crashing down to its knees.

 

In applying extensive monitoring to our services we are able to:

  • see cracks appearing in production early; and
  • collect extensive amount of data for the post-mortem; and
  • use knowledge gained during post-mortems to identify early warning signs and set up alarms accordingly

 

Introducing Metricano

With our emphasis on monitoring, it should come as no surprise that we have long sought to make it easy for our developers to monitor different aspects of their service.

 

Now, we’ve made it easy for you to do the same with Metricano, an agent-based framework for collecting, aggregating and publishing metrics from your application. From a high-level, the MetricsAgent class collects metrics and aggregates them into second-by-second summaries. These summaries are then published to all the publishers you have configured.

 

Recording Metrics

There are a number of options for you to record metrics with MetricsAgent:

Manually

You can call the IncrementCountMetric, or RecordTimeMetric methods on an instance of IMetricsAgent (you can use MetricsAgent.Default or create a new instance with MetricsAgent.Create), for example:

 

F# Workflows

From F#, you can also use the built-in timeMetric and countMetric workflows:

 

PostSharp Aspects

Lastly, you can also use the CountExecution and LogExecutionTime attributes from the Metricano.PostSharpAspects nuget package, which can be applied at method, class and even assembly level.

The CountExecution attribute records a count metric with the fully qualified name of the method, whereas the LogExecutionTime attribute records execution times into a time metric with the fully qualified name of the method. When applied at class and assembly level, the attributes are multi-casted to all encompassed methods, private, public, instance and static. It’s possible to target specific methods, by name or visibility, etc. please refer to PostSharp’s documentation for detail.

 

Publishing Metrics

All the metrics you record won’t do you much good if they just stay inside the memory of your application server.

To get metrics out of your application server and into a monitoring service or dashboard, you need to tell Metricano to publish metrics with a set of publishers. There is a ready made publisher for Amazon CloudWatch service in the Metricano.CloudWatch nuget package.

To add a publisher to the pipeline, use the Publish.With static method, see example here.

Since the lowest granularity on Amazon CloudWatch is 1 minute, so as an optimization to cut down on the number of web requests (which also has a cost impact), CloudWatchPublisher will aggregate metrics locally and only publish the aggregates on a per minute basis.

If you want to publish your metric data to another service (StackDriver or New Relic for instance), you can create your own publisher by implementing the very simple IMetricsPublisher interface. This simple ConsolePublisher for instance, will calculate the 95 percentile execution time and print them:

image

In general I find the 95/97/99 percentile time metrics much more informative than simple averages, since averages are so easily biased by even a small number of outlying data points.

 

Summary

I hope you have enjoyed this post and that you’ll find Metricano a useful addition in your application.

I highly recommend reading Release It!, much of the patterns and anti-patterns discussed in the book are becoming more and more relevant in today’s world where we’re building smaller, more granular microservices. Even the simplest of applications have multiple integration points – social networks, cloud services, etc. – and they are places where cracks are likely to occur before they spread out to the rest of your application, unless, you have taken the measure to guard against such cascading failures.

If you decide to buy the book from amazon, please use the link I provide below or add the query string parameter tag=theburningmon-21 to the URL so that I can get a small referral fee and use it towards buying more books and finding more interesting things to write about here Smile

 

Links

Metricano project page

Release It!: Design and Deploy Production-Ready Software

Metricano nuget package

Metricano.CloudWatch nuget package

Metricano.PostSharpAspects nuget package

Red-White Push – Continuous Delivery at Gamesys Social

Introducing, DSLs to query against Amazon CloudWatch metrics

If you have done any DevOps work on Amazon Web Services (AWS) then you should be familiar with Amazon CloudWatch, a service for tracking and viewing metrics (CPU, network in/out, etc.) about the various AWS services that you consume, or better still, custom metrics that you publish about your service.

On top of that, you can also set up alarms on any metrics and send out alerts via Amazon SNS, which is a pretty standard practice of monitoring your AWS-hosted application. There are of course many other paid services such as StackDriver and New Relic which offer you a host of value-added features, personally I was impressed with some of the predicative features from StackDriver.

The built-in Amazon management console for CloudWatch provides the rudimentary functionalities that lets you browse your metrics and view/overlap them on a graph, but it falls short once you have a decent number of metrics.

For starters, when trying to browse your metrics by namespace, you’re capped at 200 metrics so discovery is out of the question, you have to know what you’re looking for to be able to find it, which isn’t all that useful when you have hundreds of metrics to work with…

image

Also, there’s no way for you to filter metrics by the recorded datapoints, so to answer even simple questions such as

‘what other timespan metrics also spiked at mid-day when our service discovery latency spiked?’

you now have to manually go through all the relevant metrics (and of course you have to find them first!) and then visually check the graph to try and find any correlations.

 

After being frustrated by this manual process for one last time I decided to write some tooling myself to make my life (and hopefully others) a bit easier, and in comes Amazon.CloudWatch.Selector, a set of DSLs and CLI for querying against Amazon CloudWatch.

 

Amazon.CloudWatch.Selector

With this simple library you will get:

  • an internal DSL which is intended to be used from F# but still usable from C# although syntactically not as intuitive
  • an external DSL which can be embedded into a command line or web tool

 

Both DSLs support the same set of filters, e.g.

NamespaceIs Filters metrics by the specified namespace.
NamespaceLike Filters metrics using a regex pattern against their namespaces.
NameIs Filters metrics by the specified name.
NameLike Filters metrics using a regex pattern against their names.
UnitIs Filters metrics against the unit they’re recorded in, e.g. Count, Bytes, etc.
Average Filters metrics by the recorded average data points, e.g. average > 300 looks for metrics whose average in the specified timeframe exceeded 300 at any time.
Min Same as above but for the minimum data points.
Max Same as above but for the maximum data points.
Sum Same as above but for the sum data points.
SampleCount Same as above but for the sample count data points.
DimensionContains Filters metrics by the dimensions they’re recorded with, please refer to the CloudWatch docs on how this works.
DuringLast Specifies the timeframe of the query to be the last X minutes/hours/days. Note: CloudWatch only keeps up to 14 days worth of data so there’s no point going any further back then that.
Since Specifies the timeframe of the query to be since the specified timestamp till now.
Between Specifies the timeframe of the query to be between the specified start and end timestamp.
IntervalOf Specifies the ‘period’ in which the data points will be aggregated into, i.e. 5 minutes, 15 minutes, 1 hour, etc.

Here’s some code snippet on how to use the DSLs:

 

In addition to the DSLs, you’ll also find a simple CLI tool as part of the project which you can start by setting the credentials in the start_cli.cmd script and running it up. It allows you to query CloudWatch metrics using the external DSL.

Here’s a quick demo of using the CLI to select some CPU metrics for ElasiCache and then plotting them on a graph.

 

As a side note, one of the reasons why we have so many metrics is because we have made it super easy for ourselves to record new metrics (see this recorded webinar for more information) to gives ourselves a very granular set of metrics so that any CPU-intensive or IO work is monitored as well as any top-level entry points to our services.

 

Links

Introducing log4net.Kinesis, a log4net appender for Amazon Kinesis

Just under three weeks ago, Amazon announced the public availability of their new Kinesis service, a service which is designed to allow real-time processing of streaming big data.

As an experiment I have put together a simple, actor-based customer appender for log4net which allows you to publish your log messages into a configured Kinesis stream. You can then have another cluster of machines to fetch the data from the stream and do whatever processing or aggregation you like to do.

You can download and install the appender from Nuget here or checkout the source code here.

The implementation is done in F# in 100 lines of code, and as you can see is very simple, easy to reason with, fully asynchronous and thread-safe.

 

Once you have pushed your log messages into the stream, you’ll need to use the AWSSDK to fetch the data and process them. For Java, there’s a client application which takes care of most of the heavy lifting – e.g. tracking your progress, handling failovers and load balancing. Unfortunately, at the time of writing, there’s no equivalent of such client application in the current version of the .Net AWSSDK.

So to help make it easier for us .Net folks to build real-time data processing applications on top of Amazon Kinesis, I had started a Rx-based .Net client library called ReactoKinesiX (I really wanted to get RX into the name!), more details to follow.

 

I think the introduction of Kinesis is very exciting and opens up many possibilities, and at the current pricing model it also represents a very cost effective alternative to some of the other competing and more polished services out there.

DynamoDB.SQL 2.0.0 is out!

Hi everyone, happy new year!

I was really glad to find a couple of days to work on some of my open source projects and put together a new version of DynamoDB.SQL which brings it inline with the latest version of the .Net AWSSDK amongst other things. You can download and install it from Nuget here.

 

Breaking Changes

There are two breaking changes:

  1. DynamoDB v1 is no longer supported as they have been deprecated from the AWSSDK, which means the v1 syntax (which uses the special keywords @hashkey and @rangekey to refer to the table’s hash and range keys) is also deprecated and you should use the v2 syntax going forward.
  2. The clumsy and frankly unnecessary DynamoDbV2.SQL.Execution namespace is gone! Instead, the extension methods for AmazonDynamoDBClient and DynamoDBContext now exist in the same namespaces so you no longer have to import another namespace just to use the extension methods.

 

Bug Fixes

Selecting specific attributes in a Scan now works, please see respective C# and F# examples.

The old InvalidQuery and InvalidScan exceptions (which didn’t play so well with C# since the error message was not very useful at all) have been replaced with C# friendly InvalidQueryException and InvalidScanException types exposes the underlying parsing errors in the error messages.

 

Global Secondary Index

AWS announced Global Secondary Index support on December 12th, 2013, and it’s supported in DynamoDB.SQL via the existing INDEX query option, for example:

image

However, global indexes work very differently to local secondary indexes, for starters they require their own throughput rather than use the existing throughput for the table (for more details refer to its documentation).

Also, it does not support consistent reads, so when querying against the index you must add the NoConsistentRead option in your query otherwise you’ll receive an error from the DynamoDB service.

Lastly, when you create the global secondary index you have to choose which attributes are projected into the index and unlike local secondary index, attributes that have not been projected into the index will not be retrieved from the table at extra read units cost, you will receive an error from the service instead. Please refer to the guidelines page for Global Secondary Index.

 

Finally…

I’ve also revamped the README document to make it more detailed and useful and added a bunch more examples for both C# and F#, hope you like the new layout.

 

Links

Adrian Cockcroft on Dystopia-as-a-Service

And finally, a good summation of the talk here.