Tips and tricks for logging and monitoring AWS Lambda functions

The com­mon prac­tice of using agents/daemons to buffer and batch send logs and met­rics are no longer applic­a­ble in the world of server­less. Here are some tips to help you get the most out of your log­ging and mon­i­tor­ing infra­struc­ture for your func­tions.

This is part 2 of a 3-part mini series on man­ag­ing your AWS Lamb­da logs.

If you haven’t read part 1 yet, please give it a read now. We’ll be build­ing on top of the basic infra­struc­ture of ship­ping logs from Cloud­Watch Logs detailed in that post.

part 1 : cen­tralise log­ging

part 3 : track­ing cor­re­la­tion IDs

New paradigm, new problems

Much have changed with the server­less par­a­digm, and it solves many of the old prob­lems we face and replaced them with some new prob­lems that (I think) are eas­i­er to deal with.

Con­se­quent­ly, many of the old prac­tices are no longer applic­a­ble — eg. using agents/daemons to buffer and batch send met­rics and logs to mon­i­tor­ing and log aggre­ga­tion ser­vices. How­ev­er, even as we throw away these old prac­tices for the new world of server­less, we are still after the same qual­i­ties that made our old tools “good”:

  1. able to col­lect rich set of sys­tem and appli­ca­tion met­rics and logs
  2. pub­lish­ing met­rics and logs should not add user-fac­ing laten­cy (ie. they should be per­formed in the back­ground)
  3. met­rics and logs should appear in real­time (ie. with­in a few sec­onds)
  4. met­rics should be gran­u­lar

Unfor­tu­nate­ly, the cur­rent tool­ing for Lamb­da — Cloud­Watch met­rics & Cloud­Watch Logs — are fail­ing on a few of these, some more so than oth­ers:

  • pub­lish­ing cus­tom met­rics requires addi­tion­al net­work calls that need to be made dur­ing the function’s exe­cu­tion, adding to user-fac­ing laten­cy
  • Cloud­Watch met­rics for AWS ser­vices are only gran­u­lar down to 1 minute inter­val (cus­tom met­rics can be gran­u­lar down to 1 sec­ond)
  • Cloud­Watch met­rics are often a few min­utes behind (though cus­tom met­rics might have less lag now that they can be record­ed at 1 sec­ond inter­val)
  • Cloud­Watch Logs are usu­al­ly more than 10s behind (not pre­cise mea­sure­ment, but based on per­son­al obser­va­tion)

With Lamb­da, we have to rely on AWS to improve Cloud­Watch in order to bring us par­i­ty with exist­ing “serv­er-ful” ser­vices.

Many ven­dors have announced sup­port for Lamb­da, such as Data­dog and Wave­front. How­ev­er, as they are using the same met­rics from Cloud­Watch they will have the same lag.

IOPipe is a pop­u­lar alter­na­tive for mon­i­tor­ing Lamb­da func­tions and they do things slight­ly dif­fer­ent­ly — by giv­ing you a wrap­per func­tion around your code so they can inject mon­i­tor­ing code (it’s a famil­iar pat­tern to those who have used AOP frame­works in the past).

For their 1.0 release they also announced sup­port for trac­ing (see the demo video below), which I think it’s inter­est­ing as AWS already offers X-Ray and it’s a more com­plete trac­ing solu­tion (despite its own short­com­ings as I men­tioned in this post).

IOPipe seems like a viable alter­na­tive to Cloud­Watch, espe­cial­ly if you’re new to AWS Lamb­da and just want to get start­ed quick­ly. I can total­ly see the val­ue of that sim­plic­i­ty.

How­ev­er, I have some seri­ous reser­va­tions with IOPipe’s approach:

  • A wrap­per around every one of my func­tions? This lev­el of per­va­sive access to my entire appli­ca­tion requires a seri­ous amount of trust that has to be earned, espe­cial­ly in times like this.
  • Cloud­Watch col­lects logs and met­rics asyn­chro­nous­ly with­out adding to my function’s exe­cu­tion time. But with IOPipe they have to send the met­rics to their own sys­tem, and they have to do so dur­ing my function’s exe­cu­tion time and hence adding to user-fac­ing laten­cy (for APIs).
  • Fur­ther to the above points, it’s anoth­er thing that can cause my func­tion to error or time out even after my code has suc­cess­ful­ly exe­cut­ed. Per­haps they’re doing some­thing smart to min­imise that risk but it’s hard for me to know for sure and I have to antic­i­pate fail­ures.

Of all the above, the laten­cy over­head is the biggest con­cern for me. Between API Gate­way and Lamb­da I already have to deal with cold start and the laten­cy between API Gate­way and Lamb­da. As your microser­vice archi­tec­ture expands and the no. of inter-ser­vice com­mu­ni­ca­tions grows, these laten­cies will com­pound fur­ther.


For back­ground tasks this is less a con­cern, but a size­able por­tion of Lamb­da func­tions I have writ­ten have to han­dle HTTP requests and I need to keep the exe­cu­tion time as low as pos­si­ble for these func­tions.

Sending custom metrics asynchronously

I find Datadog’s approach for send­ing cus­tom met­rics very inter­est­ing. Essen­tial­ly you write cus­tom met­rics as spe­cial­ly-for­mat­ted log mes­sages that Data­dog will process (you have to set up IAM per­mis­sions for Cloud­Watch to call their func­tion) and track them as met­rics.

Data­dog allows you to send cus­tom met­rics using log mes­sages in their DogStatsD for­mat.

It’s a sim­ple and ele­gant approach, and one that we can adopt for our­selves even if we decide to use anoth­er mon­i­tor­ing ser­vice.

In part 1 we estab­lished an infra­struc­ture to ship logs from Cloud­Watch Logs to a log aggre­ga­tion ser­vice of our choice. We can extend the log ship­ping func­tion to look for log mes­sages that look like these:

Log cus­tom met­rics as spe­cial­ly for­mat­ted log mes­sages

For these log mes­sages, we will inter­pret them as:

MONITORING|metric_value|metric_unit|metric_name|metric_namespace

And instead of send­ing them to the log aggre­ga­tion ser­vice, we’ll send them as met­rics to our mon­i­tor­ing ser­vice instead. In this par­tic­u­lar case, I’m using Cloud­Watch in my demo (see link below), so the for­mat of the log mes­sage reflects the fields I need to pass along in the Put­Met­ric­Da­ta call.

To send cus­tom met­rics, we write them as log mes­sages. Again, no laten­cy over­head as Lamb­da ser­vice col­lects these for us and sends them to Cloud­Watch in the back­ground.

And moments lat­er they’re avail­able in Cloud­Watch met­rics.

Cus­tom met­rics are record­ed in Cloud­Watch as expect­ed.

Take a look at the custom-metrics func­tion in this repo.

Tracking the memory usage and billed duration of your AWS Lambda functions in CloudWatch

Lamb­da reports the amount of mem­o­ry used, and the billed dura­tion at the end of every invo­ca­tion. Whilst these are not pub­lished as met­rics in Cloud­Watch, you can find them as log mes­sages in Cloud­Watch Logs.

At the end of every invo­ca­tion, Lamb­da pub­lish­es a REPORT log mes­sage detail­ing the max amount of mem­o­ry used by your func­tion dur­ing this invo­ca­tion, and how much time is billed (Lamb­da charges at 100ms blocks).

I rarely find mem­o­ry usage to be an issue as Node­js func­tions have such a small foot­print. My choice of mem­o­ry allo­ca­tion is pri­mar­i­ly based on get­ting the right bal­ance between cost and per­for­mance. In fact, Alex Casal­boni of CloudA­cad­e­my wrote a very nice blog post on using Step Func­tions to help you find that sweet spot.

The Billed Duration on the oth­er hand, is a use­ful met­ric when viewed side by side with Invocation Duration. It gives me a rough idea of the amount of wastage I have. For exam­ple, if the aver­age Invocation Durationof a func­tion is 42ms but the aver­age Billed Duration is 100ms, then there is a 58% wastage and maybe I should con­sid­er run­ning the func­tion on a low­er mem­o­ry allo­ca­tion.

Inter­est­ing­ly, IOPipe records these in their dash­board out of the box.

IOPipes records a num­ber of addi­tion­al met­rics that are not avail­able in Cloud­Watch, such as Mem­o­ry Usage and CPU Usage over time, as well as cold­starts.

How­ev­er, we don’t need to add IOPipe just to get these met­rics. We can apply a sim­i­lar tech­nique to the pre­vi­ous sec­tion and pub­lish them as cus­tom met­rics to our mon­i­tor­ing ser­vice.

To do that, we have to look out for these REPORT log mes­sages and parse the rel­e­vant infor­ma­tion out of them. Each mes­sage con­tains 3 pieces of infor­ma­tion we want to extract:

  • Billed Dura­tion (Mil­lisec­onds)
  • Mem­o­ry Size (MB)
  • Mem­o­ry Used (MB)

We will parse these log mes­sages and return an array of Cloud­Watch met­ric data for each, so we can flat map over them after­wards.

This is a func­tion in the “parse” mod­ule, which maps a log mes­sage to an array of Cloud­Watch met­ric data.
Flat map over the Cloud­Watch met­ric data returned by the above parse.usageMetrics func­tion and pub­lish them.

And sure enough, after sub­scrib­ing the log group for an API (cre­at­ed in the same demo project to test this) and invok­ing the API, I’m able to see these new met­rics show up in Cloud­Watch met­rics.

Look­ing at the graph, maybe I can reduce my cost by run­ning it on a much small­er mem­o­ry size.

Take a look at the usage-metrics func­tion in this repo.

Mind the concurrency!

When pro­cess­ing Cloud­Watch Logs with Lamb­da func­tions, you need to be mind­ful of the no. of con­cur­rent exe­cu­tions it cre­ates so to not run foul of the con­cur­rent exe­cu­tion lim­it.

Since this is an account-wide lim­it, it means your log-ship­ping func­tion can cause cas­cade fail­ures through­out your entire appli­ca­tion. Crit­i­cal func­tions can be throt­tled because too many exe­cu­tions are used to push logs out of Cloud­Watch Logs — not a good way to go down ;-)

What we need is a more fine-grained throt­tling mech­a­nism for Lamb­da. It’s fine to have an account-wide lim­it, but we should be able to cre­ate pools of func­tions that can have slices of that lim­it. For exam­ple, tier-1 func­tions (those serv­ing the core busi­ness needs) gets 90% of the avail­able con­cur­rent exe­cu­tions. Whilst tier-2 func­tions (BI, mon­i­tor­ing, etc.) gets the oth­er 10%.

As things stand, we don’t have that, and the best you can do is to keep the exe­cu­tion of your log-ship­ping func­tion brief. Maybe that means fire-and-for­get when send­ing logs and met­rics; or send the decod­ed log mes­sages into a Kine­sis stream where you have more con­trol over par­al­lelism.

Or, maybe you’ll mon­i­tor the exe­cu­tion count of these tier-2 func­tions and when the no. of executions/minute breach­es some thresh­old you’ll tem­porar­i­ly unsub­scribe log groups from the log-ship­ping func­tion to alle­vi­ate the prob­lem.

Or, maybe you’ll install some bulk­heads by mov­ing these tier-2 func­tions into a sep­a­rate AWS account and use cross-account invo­ca­tion to trig­ger them. But this seems a real­ly heavy-hand­ed way to workaround the prob­lem!

Point is, it’s not a solved prob­lem and I haven’t come across a sat­is­fy­ing workaround yet. AWS is aware of this gap and hope­ful­ly they’ll add sup­port for bet­ter con­trol over con­cur­rent exe­cu­tions.