auto-create CloudWatch Alarms for APIs with Lambda

In a pre­vi­ous post we dis­cussed how to auto-sub­scribe a Cloud­Watch Log Group to a Lamb­da func­tion using Cloud­Watch Events. So that we don’t need a man­u­al process to ensure all Lamb­da logs would go to our log aggre­ga­tion ser­vice.

Whilst this is use­ful in its own right, it only scratch­es the sur­face of what we can do. Cloud­Trail and Cloud­Watch Events makes it easy to auto­mate many day-to-day oper­a­tional steps. With the help of Lamb­da of course ;-)

I work with API Gate­way and Lamb­da heav­i­ly. When­ev­er you cre­ate a new API, or make changes, there are sev­er­al things you need to do:

  • enable Detailed Met­rics for the deploy­ment stage
  • set up a dash­board in Cloud­Watch, show­ing request count, laten­cies and error counts
  • set up Cloud­Watch Alarms for p99 laten­cies and error counts

Because these are man­u­al steps, they often get missed.

Have you ever for­got­ten to update the dash­board after adding a new end­point to your API? And did you also remem­ber to set up a p99 laten­cy alarm on this new end­point? How about alarms on the no. of 4XX or 5xx errors?

Most teams I have dealt with have some con­ven­tions around these, but with­out a way to enforce them. The result is that the con­ven­tion is applied in patch­es and can­not be relied upon. I find this approach doesn’t scale with the size of the team.

It works when you’re a small team. Every­one has a shared under­stand­ing, and the nec­es­sary dis­ci­pline to fol­low the con­ven­tion. When the team gets big­ger, you need automa­tion to help enforce these con­ven­tions.

For­tu­nate­ly, we can auto­mate away these man­u­al steps using the same pat­tern. In the Mon­i­tor­ing unit of my course Pro­duc­tion-Ready Server­less, I demon­strat­ed how you can do this in 3 sim­ple steps:

  1. Cloud­Trail cap­tures the Cre­at­eDe­ploy­ment request to API Gate­way.
  2. Cloud­Watch Events pat­tern against this cap­tured request.
  3. Lamb­da func­tion to a) enable detailed met­rics, and b) cre­ate alarms for each end­point

If you use the Server­less frame­work, then you might have a func­tion that looks like this:

Cou­ple of things to note from the code above:

  • I’m using the server­less-iam-roles-per-func­tion plu­g­in to give the func­tion a tai­lored IAM role
  • The func­tion needs the apigateway:PATCH per­mis­sion to enable detailed met­rics
  • The func­tion needs the apigateway:GET per­mis­sion to get the API name and REST end­points
  • The func­tion needs the cloudwatch:PutMetricAlarm per­mis­sion to cre­ate the alarms
  • The envi­ron­ment vari­ables spec­i­fy SNS top­ics for the Cloud­Watch Alarms

The cap­tured event looks like this:

We can find the restApiId and stageName inside the detail.requestParameters attribute. That’s all we need to fig­ure out what end­points are there, and so what alarms we need to cre­ate.

Inside the han­dler func­tion, which you can find here, we per­form a few steps:

  1. enable detailed met­rics with an updateStage call to API Gate­way
  2. get the list of REST end­points with a getResources call to API Gate­way
  3. get the REST API name with a getRestApi call to API Gate­way
  4. for each of the REST end­points, cre­ate a p99 laten­cy alarm in the AWS/ApiGateway name­space

Now, every time I cre­ate a new API, I will have Cloud­Watch Alarms to alert me when the 99 per­centile laten­cy for an end­point goes over 1 sec­ond, for 5 min­utes in a row. All this, with just a few lines of code :-)

You can take this fur­ther, and have oth­er Lamb­da func­tions to:
  • cre­ate Cloud­Watch Alarms for 5xx errors for each end­point
  • cre­ate Cloud­Watch Dash­board for the API

So there you have it, a use­ful pat­tern for automat­ing away man­u­al ops tasks!

And before you even have to ask, yes I’m aware of this server­less plu­g­in by the ACloudGu­ru folks. It looks neat, but it’s ulti­mate­ly still some­thing the devel­op­er has to remem­ber to do. That requires dis­ci­pline. My expe­ri­ence tells me that you can­not rely on dis­ci­pline, ever. Which is why, I pre­fer to have a plat­form in place that will gen­er­ate these alarms auto­mat­i­cal­ly.

Like what you’re read­ing? Check out my video course Pro­duc­tion-Ready Server­less and learn the essen­tials of how to run a server­less appli­ca­tion in pro­duc­tion.

We will cov­er top­ics includ­ing:

  • authen­ti­ca­tion & autho­riza­tion with API Gate­way & Cog­ni­to
  • test­ing & run­ning func­tions local­ly
  • CI/CD
  • log aggre­ga­tion
  • mon­i­tor­ing best prac­tices
  • dis­trib­uted trac­ing with X-Ray
  • track­ing cor­re­la­tion IDs
  • per­for­mance & cost opti­miza­tion
  • error han­dling
  • con­fig man­age­ment
  • canary deploy­ment
  • VPC
  • secu­ri­ty
  • lead­ing prac­tices for Lamb­da, Kine­sis, and API Gate­way

You can also get 40% off the face price with the code ytcui. Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP).