If you have done any DevOps work on Amazon Web Services (AWS) then you should be familiar with Amazon CloudWatch, a service for tracking and viewing metrics (CPU, network in/out, etc.) about the various AWS services that you consume, or better still, custom metrics that you publish about your service.
On top of that, you can also set up alarms on any metrics and send out alerts via Amazon SNS, which is a pretty standard practice of monitoring your AWS-hosted application. There are of course many other paid services such as StackDriver and New Relic which offer you a host of value-added features, personally I was impressed with some of the predicative features from StackDriver.
The built-in Amazon management console for CloudWatch provides the rudimentary functionalities that lets you browse your metrics and view/overlap them on a graph, but it falls short once you have a decent number of metrics.
For starters, when trying to browse your metrics by namespace, you’re capped at 200 metrics so discovery is out of the question, you have to know what you’re looking for to be able to find it, which isn’t all that useful when you have hundreds of metrics to work with…
Also, there’s no way for you to filter metrics by the recorded datapoints, so to answer even simple questions such as
‘what other timespan metrics also spiked at mid-day when our service discovery latency spiked?’
you now have to manually go through all the relevant metrics (and of course you have to find them first!) and then visually check the graph to try and find any correlations.
After being frustrated by this manual process for one last time I decided to write some tooling myself to make my life (and hopefully others) a bit easier, and in comes Amazon.CloudWatch.Selector, a set of DSLs and CLI for querying against Amazon CloudWatch.
With this simple library you will get:
- an internal DSL which is intended to be used from F# but still usable from C# although syntactically not as intuitive
- an external DSL which can be embedded into a command line or web tool
Both DSLs support the same set of filters, e.g.
|NamespaceIs||Filters metrics by the specified namespace.|
|NamespaceLike||Filters metrics using a regex pattern against their namespaces.|
|NameIs||Filters metrics by the specified name.|
|NameLike||Filters metrics using a regex pattern against their names.|
|UnitIs||Filters metrics against the unit they’re recorded in, e.g. Count, Bytes, etc.|
|Average||Filters metrics by the recorded average data points, e.g. average > 300 looks for metrics whose average in the specified timeframe exceeded 300 at any time.|
|Min||Same as above but for the minimum data points.|
|Max||Same as above but for the maximum data points.|
|Sum||Same as above but for the sum data points.|
|SampleCount||Same as above but for the sample count data points.|
|DimensionContains||Filters metrics by the dimensions they’re recorded with, please refer to the CloudWatch docs on how this works.|
|DuringLast||Specifies the timeframe of the query to be the last X minutes/hours/days. Note: CloudWatch only keeps up to 14 days worth of data so there’s no point going any further back then that.|
|Since||Specifies the timeframe of the query to be since the specified timestamp till now.|
|Between||Specifies the timeframe of the query to be between the specified start and end timestamp.|
|IntervalOf||Specifies the ‘period’ in which the data points will be aggregated into, i.e. 5 minutes, 15 minutes, 1 hour, etc.|
Here’s some code snippet on how to use the DSLs:
In addition to the DSLs, you’ll also find a simple CLI tool as part of the project which you can start by setting the credentials in the start_cli.cmd script and running it up. It allows you to query CloudWatch metrics using the external DSL.
Here’s a quick demo of using the CLI to select some CPU metrics for ElasiCache and then plotting them on a graph.
As a side note, one of the reasons why we have so many metrics is because we have made it super easy for ourselves to record new metrics (see this recorded webinar for more information) to gives ourselves a very granular set of metrics so that any CPU-intensive or IO work is monitored as well as any top-level entry points to our services.
- Amazon.CloudWatch.Selector project page
- Amazon.CloudWatch.Selector Nuget package
- Demo video of the CLI
- Webinar – performance monitoring with AOP & AWS
- Slides for the webinar
I’m an AWS Serverless Hero and the author of Production-Ready Serverless. I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless.
Come learn about operational BEST PRACTICES for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more.
Here is a complete list of all my posts on serverless and AWS Lambda. In the meantime, here are a few of my most popular blog posts.
- Lambda optimization tip – enable HTTP keep-alive
- You are thinking about serverless costs all wrong
- Many faced threats to Serverless security
- We can do better than percentile latencies
- I’m afraid you’re thinking about AWS Lambda cold starts all wrong
- Yubl’s road to Serverless
- AWS Lambda – should you have few monolithic functions or many single-purposed functions?
- AWS Lambda – compare coldstart time with different languages, memory and code sizes
- Guys, we’re doing pagination wrong