Introducing, DSLs to query against Amazon CloudWatch metrics

If you have done any DevOps work on Ama­zon Web Ser­vices (AWS) then you should be famil­iar with Ama­zon Cloud­Watch, a ser­vice for track­ing and view­ing met­rics (CPU, net­work in/out, etc.) about the var­i­ous AWS ser­vices that you con­sume, or bet­ter still, cus­tom met­rics that you pub­lish about your ser­vice.

On top of that, you can also set up alarms on any met­rics and send out alerts via Ama­zon SNS, which is a pret­ty stan­dard prac­tice of mon­i­tor­ing your AWS-host­ed appli­ca­tion. There are of course many oth­er paid ser­vices such as Stack­Driv­er and New Rel­ic which offer you a host of val­ue-added fea­tures, per­son­al­ly I was impressed with some of the pred­ica­tive fea­tures from Stack­Driv­er.

The built-in Ama­zon man­age­ment con­sole for Cloud­Watch pro­vides the rudi­men­ta­ry func­tion­al­i­ties that lets you browse your met­rics and view/overlap them on a graph, but it falls short once you have a decent num­ber of met­rics.

For starters, when try­ing to browse your met­rics by name­space, you’re capped at 200 met­rics so dis­cov­ery is out of the ques­tion, you have to know what you’re look­ing for to be able to find it, which isn’t all that use­ful when you have hun­dreds of met­rics to work with…

image

Also, there’s no way for you to fil­ter met­rics by the record­ed dat­a­points, so to answer even sim­ple ques­tions such as

what oth­er times­pan met­rics also spiked at mid-day when our ser­vice dis­cov­ery laten­cy spiked?’

you now have to man­u­al­ly go through all the rel­e­vant met­rics (and of course you have to find them first!) and then visu­al­ly check the graph to try and find any cor­re­la­tions.

 

After being frus­trat­ed by this man­u­al process for one last time I decid­ed to write some tool­ing myself to make my life (and hope­ful­ly oth­ers) a bit eas­i­er, and in comes Amazon.CloudWatch.Selector, a set of DSLs and CLI for query­ing against Ama­zon Cloud­Watch.

 

Amazon.CloudWatch.Selector

With this sim­ple library you will get:

  • an inter­nal DSL which is intend­ed to be used from F# but still usable from C# although syn­tac­ti­cal­ly not as intu­itive
  • an exter­nal DSL which can be embed­ded into a com­mand line or web tool

 

Both DSLs sup­port the same set of fil­ters, e.g.

Name­spaceIs Fil­ters met­rics by the spec­i­fied name­space.
Name­space­Like Fil­ters met­rics using a regex pat­tern against their name­spaces.
NameIs Fil­ters met­rics by the spec­i­fied name.
Name­Like Fil­ters met­rics using a regex pat­tern against their names.
Uni­tIs Fil­ters met­rics against the unit they’re record­ed in, e.g. Count, Bytes, etc.
Aver­age Fil­ters met­rics by the record­ed aver­age data points, e.g. aver­age > 300 looks for met­rics whose aver­age in the spec­i­fied time­frame exceed­ed 300 at any time.
Min Same as above but for the min­i­mum data points.
Max Same as above but for the max­i­mum data points.
Sum Same as above but for the sum data points.
Sam­ple­Count Same as above but for the sam­ple count data points.
Dimen­sion­Con­tains Fil­ters met­rics by the dimen­sions they’re record­ed with, please refer to the Cloud­Watch docs on how this works.
Dur­in­gLast Spec­i­fies the time­frame of the query to be the last X minutes/hours/days. Note: Cloud­Watch only keeps up to 14 days worth of data so there’s no point going any fur­ther back then that.
Since Spec­i­fies the time­frame of the query to be since the spec­i­fied time­stamp till now.
Between Spec­i­fies the time­frame of the query to be between the spec­i­fied start and end time­stamp.
Inter­val­Of Spec­i­fies the ‘peri­od’ in which the data points will be aggre­gat­ed into, i.e. 5 min­utes, 15 min­utes, 1 hour, etc.

Here’s some code snip­pet on how to use the DSLs:

 

In addi­tion to the DSLs, you’ll also find a sim­ple CLI tool as part of the project which you can start by set­ting the cre­den­tials in the start_cli.cmd script and run­ning it up. It allows you to query Cloud­Watch met­rics using the exter­nal DSL.

Here’s a quick demo of using the CLI to select some CPU met­rics for Ela­si­Cache and then plot­ting them on a graph.

 

As a side note, one of the rea­sons why we have so many met­rics is because we have made it super easy for our­selves to record new met­rics (see this record­ed webi­nar for more infor­ma­tion) to gives our­selves a very gran­u­lar set of met­rics so that any CPU-inten­sive or IO work is mon­i­tored as well as any top-lev­el entry points to our ser­vices.

 

Links