Mind the 75GB limit on AWS Lambda deployment packages

Gotta clean up those old Lambda deployment packages!

With AWS Lambda and the Serverless framework, deploying your code has become so simple and frictionless.

As you move more and more of your architecture to run on Lambda, you might find that, in addition to getting things done faster you are also deploying your code more frequently.

That’s awesome!

But, as you rejoice in this new found superpower to make your users and stakeholders happy, you need to keep an eye out for that regional limit of 75GB for all the uploaded deployment packages.


At Yubl, me and a small team of 6 server engineers managed to rack up nearly 20GB of deployment packages in 3 months.

We wrote all of our Lambda functions in Nodejs, and deployment packages were typically less than 2MB. But the frequency of deployments made sure that the overall size of deployment packages went up steadily.

Now that I’m writing most of my Lambda functions in Scala (it’s the weapon of choice for the Space Ape Games server team), I’m dealing with deployment packages that are significantly bigger!

When authoring Lambda functions in Java, be prepared to significantly bigger deployment packages.

Serverless framework: disable versionFunctions

By default, the Serverless framework would create a new version of your function every time you deploy.

In Serverless 0.X, this is (kinda) needed because it used function alias. For example, I can have multiple deployment stages for the same function?—?devstaging and production. But in the Lambda console there is only one function, and each stage is simply an alias pointing to a different version of the same function.

Unfortunately this behaviour also made it difficult to manage the IAM permissions because multiple versions of the same function share the same IAM role. Since you can’t version the IAM role with the function, this makes it hard for you to add or remove permissions without breaking older versions.

Fortunately, the developers listened to the community and since the 1.0 release each stage is deployed as a separate function.

Essentially, this allows you to “version” IAM roles with deployment stages since each stage gets a separate IAM role. So there’s technically no need for you to create a new version for every deployment anymore. But, that is still the default behaviour, unless you explicitly disable it in your serverless.ymlby setting versionFunctions to false.

You might argue that having old versions of the function in production makes it quicker to rollback.

In that case, enable it for the production stage only. To do that, here’s a handy trick to allow a default configuration in your serverless.yml to be overridable by deployment stage.

In my personal experience though, unless you have taken great care and used aliases to tag the production releases it’s actually quite hard to know which version correlates to what. Assuming that you have reproducible builds, I would have much more confidence if we rollback by deploying from a hotfixor support branch of our code.

Clean up old versions with janitor-lambda

If disabling versionFunctions in the serverless.yml for all of your projects is hard to enforce, another approach would be to retroactively delete old versions of functions that are no longer referenced by an alias.

To do that, you can create a cron job (ie. scheduled CloudWatch event + Lambda) that will scan through your functions and look for versions that are not referenced and delete them.

I took some inspiration from Netflix’s Janitor Monkey and created a Janitor Lambda function that you can deploy to your AWS environment to clean unused versions of your functions.

After we employed this Janitor Lambda function, our total deployment package went from 20GB to ~1GB (we had a lot of functions…).

You should use SSM Parameter Store over Lambda env variables

AWS Lambda announced native support for environment variables at the end of 2016. But even before that, the Serverless framework had supported environment variables and I was using them happily as me and my team at the time migrated our monolithic Node.js backend to serverless.

However, as our architecture expanded we found several drawbacks with managing configurations with environment variables.

Hard to share configs across projects

The biggest problem for us was the inability to share configurations across projects since environment variables are function specific at runtime.

The Serverless framework has the notion of services, which is just a way of grouping related functions together. You can specify service-wide environment variables as well as function-specific ones.

A sample serverless.yml that specifies both service-wide as well as function-specific environment variables.

However, we often found that configurations need to be shared across multiple services. When these configurations change we had to update and redeploy all functions that depend on them – which in itself was becoming a challenge to track these dependencies across many Github repos that are maintained by different members of the team.

For example, as we were migrating from a monolithic system piece by piece whilst delivering new features, we weren’t able to move away from the monolithic MongoDB database in one go. It meant that lots of functions shared MongoDB connection strings. When one of these connection strings changed – and it did several times – pain and suffering followed.

Another configurable value we often share are the root URL of intermediate services. Being a social network, many of our user-initiated operations depend on relationship data, so many of our microservices depend on the Relationship API. Instead of hardcoding the URL to the Relationship API in every service (one of the deadly microservice anti-patterns), it should be stored in a central configuration service.

Hard to implement fine-grained access control

When you need to configure sensitive data such as credentials, API keys or DB connection strings, the rule of thumb are:

  1. data should be encrypted at rest (includes not checking them into source control in plain text)
  2. data should be encrypted in-transit
  3. apply the principle of least privilege to function’s and personnel’s access to data

If you’re operating in a heavily regulated environment then point 3. might be more than a good practice but a regulatory requirement. I know of many fintech companies and financial juggernauts where access to production credentials are tightly controlled and available only to a handful of people in the company.

Whilst efforts such as the serverless-secrets-plugin delivers on point 1. it couples one’s ability to deploy Lambda functions with one’s access to sensitive data – ie. he who deploys the function must have access to the sensitive data too. This might be OK for many startups, as everyone has access to everything, ideally your process for managing access to data can evolve with the company’s needs as it grows up.

SSM Parameter Store

My team outgrew environment variables, and I started looking at other popular solutions in this space – etcd, consul, etc. But I really didn’t fancy these solutions because:

  • they’re costly to run: you need to run several EC2 instances in multi-AZ setting for HA
  • you have to manage these servers
  • they each have a learning curve with regards to both configuring the service as well as the CLI tools
  • we needed a fraction of the features they offer

This was 5 months before Amazon announced SSM Parameter Store at re:invent 2016, so at the time we built our own Configuration API with API Gateway and Lambda.

Nowadays, you should just use the SSM Parameter Store because:

  • it’s a fully managed service
  • sharing configurations is easy, as it’s a centralised service
  • it integrates with KMS out-of-the-box
  • it offers fine-grained control via IAM
  • it records a history of changes
  • you can use it via the console, AWS CLI as well as via its HTTPS API

In short, it ticks all our boxes.

You have fine-grained control over what parameters a function is allowed to access.

There are couple of service limits to be aware of:

  • max 10,000 parameters per account
  • max length of parameter value is 4096 characters
  • max 100 past values for a parameter

Client library

Having a centralised place to store parameters is just one side of the coin. You should still invest effort into making a robust client library that is easy to use, and supports:

  • caching & cache expiration
  • hot-swapping configurations when source config value has changed

Here is one such client library that I put together for a demo:

To use it, you can create config objects with the loadConfigs function. These objects will expose properties that return the config values as Promise (hence the yield, which is the magic power we get with co).

You can have different config values with different cache expiration too.

If you want to play around with using SSM Parameter Store from Lambda (or to see this cache client in action), then check out this repo and deploy it to your AWS environment. I haven’t included any HTTP events, so you’d have to invoke the functions from the console.

Update 15/09/2017: the Serverless framework release 1.22.0 which introduced support for SSM parameters out of the box.

With this latest version of the Serverless framework, you can specify the value of environment variables to come from SSM parameter store directly.

Compared to many of the existing approaches, it has some benefits:

  • avoid checking in sensitive data in plain text in source control
  • avoid duplicating the same config values in multiple services

However, it still falls short on many fronts (based on my own requirements):

  • since it’s fetching the SSM parameter values at deployment time, it still couples your ability to deploy your function with access to sensitive configuration data
  • the configuration values are stored in plain text as Lambda environment variables, which means you don’t need the KMS permissions to access them, you can see it the Lambda console in plain sight
  • further to the above, if the function is compromised by an attacker (who would then have access to process.env) then they’ll be able to easily find the decrypted values during the initial probe (go to 13:05 mark on this video where I gave a demo of how easily this can be done)
  • because the values are baked at deployment time, it doesn’t allow you to easily propagate config value changes. To make a config value change, you will need to a) identify all dependent functions; and b) re-deploying all these functions

Of course, your requirement might be very different from mine, and I certainly think it’s an improvement over many of the approaches I have seen. But, personally I still think you should:

  1. fetch SSM parameter values at runtime
  2. cache these values, and hot-swap when source values change

Using Protocol Buffers with API Gateway and AWS Lambda

AWS announced binary support for API Gateway in late 2016, which opened up the door for you to use more efficient binary formats such as Google’s Protocol Buffers and Apache Thrift.


Compared to JSON – which is the bread and butter for APIs built with API Gateway and Lambda – these binary formats can produce significantly smaller payloads.

At scale, they can make a big difference to your bandwidth cost.

In restricted environments such as low-end devices or in countries with poor mobile connections, sending smaller payloads can also improve your user experience by improving the end-to-end network latency, and possibly processing time on the device too.

Comparison of serializer performance between Proto Buffers and JSON in .Net


Follow these 3 simple steps (assuming you’re using Serverless framework):

  1. install the awesome serverless-apigw-binary plugin
  2. add application/x-protobuf to binary media types (see screenshot below)
  3. add function that returns Protocol Buffers as base64 encoded response

The serverless-apigw-binary plugin has made it really easy to add binary support to API Gateway

To encode & decode Protocol Buffers payload in Nodejs, you can use the protobufjs package from NPM.

It lets you work with your existing .proto files, or you can use JSON descriptors. Give the docs a read to see how you can get started.

In the demo project (link at the bottom of the post) you’ll find a Lambda function that always returns a response in Protocol Buffers.

Couple of things to note from this function:

  • we set the Content-Type header to application/x-protobuf
  • body is base64 encoded representation of the Protocol Buffers payload
  • isBase64Encoded is set to true

you need to do all 3 of these things to make API Gateway return the response as binary data.

Consider them the magic incantation for making API Gateway return binary data, and, the caller also has to set the Accept header to application/x-protobuf.

In the same project, there’s also a JSON endpoint that returns the same payload as comparison.

The response from this JSON endpoint looks like this:

{"players":[{"id":"eb66db14992e06b36282d607cf0134ce4fe45f50","name":"Calvin Ortiz","scores":[57,12,100,56,47,78,20,37,32,48]},{"id":"7b9b38e535453d120e706ff57fef41f6fee991cb","name":"Marcus Cummings","scores":[40,57,24,15,45,54,25,67,59,23]},{"id":"db34a2a5f4d16e77a6d3d6154a8b8bb6760b3b99","name":"Harry James","scores":[61,85,14,70,8,80,14,22,76,87]},{"id":"e21018c4f43eef10771e0fa71bc54156b00a64dd","name":"Gregory Bishop","scores":[51,31,27,47,72,75,61,28,100,41]},{"id":"b3ee29ee49b640ce15be1737d0dca60e48108ee1","name":"Ann Evans","scores":[69,17,48,99,85,8,75,55,78,46]},{"id":"9c1e6d4d46bb0c0d2c92bab11e5dbd5f4ab0c619","name":"Juan Perez","scores":[71,34,60,84,21,98,60,8,91,92]},{"id":"d8de89222633c61393931457c1e72558eba48639","name":"Loretta Harvey","scores":[15,40,73,92,42,65,58,30,26,84]},{"id":"141dad672ec559431f808964391d128d2c3274bf","name":"Ian Powell","scores":[17,21,14,84,64,14,22,22,34,92]},{"id":"8a97e85e2e5385c45fc31f24bfe781c26f78c0b7","name":"Steve Gibson","scores":[33,97,6,1,20,1,78,3,77,19]},{"id":"6b3ca6924e17cd5fd9d91b36d49b36a5d542c9ea","name":"Harold Ferguson","scores":[31,32,4,10,37,85,46,86,39,17]}]}

As you can see, it’s just a bunch of randomly generated names and GUIDs, and integers. The same response in Protocol Buffers is nearly 40% smaller.

Problem with the protobufjs package

Before we move on, there is one important detail about using the protobufjspacakge in a Lambda function – you need to npm install the package on a Linux system.

This is because it has a dependency that is distributed as native binaries, so if you installed the packaged on OSX then the binaries that are packaged and deployed to Lambda will not run on the Lambda execution environment.

I had similar problems with other Google libraries in the past. I find the best way to deal with this is to take a leaf out of aws-serverless-go-shim’s approach and deploy your code inside a Docker container.

This way, you would locally install a compatible version of the native binaries for your OS so you can continue to run and debug your function with sls invoke local (see this post for details).

But, during deployment, a script would run npm install --force in a Docker container running a compatible Linux distribution. This would then install a version of the native binaries that can be executed in the Lambda execution environment. The script would then use sls deploy to deploy the function.

The deployment script can be something simple like this:

In the demo project, I also have a docker-compose.yml file:

The Serverless framework requires my AWS credentials, hence why I’ve attached the $HOME/.aws directory to the container for the AWSSDK to find at runtime.

To deploy, run docker-compose up.

Use HTTP content negotiation

Whilst binary formats are more efficient when it comes to payload size, they do have one major problem: they’re really hard to debug.

Imagine the scenario – you have observed a bug, but you’re not sure if the problem is in the client app or the server. But hey, let’s just observe the HTTP conversation with a HTTP proxy such as Charles or Fiddler.

This workflow works great for JSON but breaks down when it comes to binary formats such as Protocol Buffers as the payloads are not human readable.

As we have discussed in this post, the human readability of JSON comes with the cost of heavier bandwidth usage. For most network communications, be it service-to-service, or service-to-client, unless a human is actively “reading” the payloads it’s not worth paying the cost. But when a human is trying to read it, that human readability is very valuable.

Fortunately, HTTP’s content negotiation mechanism means we can have the best of both worlds.

In the demo project, there is a contentNegotiated function which returns either JSON or Protocol Buffers payloads based on what the Accept header.

By default, you should use Protocol Buffers for all your network communications to minimise bandwidth use.

But, you should build in a mechanism for toggling the communication to JSON when you need to observe the communications. This might mean:

  • for debug builds of your mobile app, allow super users (devs, QA, etc.) the ability to turn on debug mode, which would switch the networking layer to send Accept header as application/json
  • for services, include a configuration option to turn on debug mode (see this post on configuring functions with SSM parameters and cache client for hot-swapping) to make service-to-service calls use JSON too, so you can capture and analyze the request and responses more easily

As usual, you can try out the demo code yourself, the repo is available here.

Capture and forward correlation IDs through different Lambda event sources

Serverless architectures are microservices by default, you need correlation IDs to help debug issues that spans across multiple functions, and possibly different event source types – asynchronous, synchronous and streams.

This is the last of a 3-part mini series on managing your AWS Lambda logs.

If you haven’t read part 1 yet, please give it a read now. We’ll be building on top of the basic infrastructure of shipping logs from CloudWatch Logs detailed in that post.

part 1 : centralise logging

part 2: tips and tricks

Why correlation IDs?

As your architecture becomes more complex, many services have to work together in order to deliver the features your users want.

Microservice death stars, circa 2015.

When everything works, it’s like watching an orchestra, lots of small pieces all acting independently whilst at the same time collaborating to form a whole that’s greater than the sum of its parts.

However, when things don’t work, it’s a pain in the ass to debug. Finding that one clue is like finding needle in the haystack as there are so many moving parts, and they’re all constantly moving.

Imagine you’re an engineer at Twitter and trying to debug why a user’s tweet was not delivered to one of his followers’ timeline.

“Let me cross reference the logs from hundreds of services and find the logs that mention the author’s user ID, the tweet ID, or the recipient’s user ID, and put together a story of how the tweet flowed through our system and why it wasn’t delivered to the recipient’s timeline.”

“What about logs that don’t explicitly mention those fields?”

“mm… let me get back to you on that…”

Needle in the haystack.

This is the problem that correlation IDs solve in the microservice world – to tag every log message with the relevant context so that it’s easy to find them later on.

Aside from common IDs such as user ID, order ID, tweet ID, etc. you might also want to include the X-Ray trace ID in every log message. That way, if you’re using X-Ray with Lambda then you can use it to quickly load up the relevant trace in the X-Ray console.

By default, Lambda automatically generates a _X_AMZN_TRACE_ID value in the environment variable.

Also, if you’re going to add a bunch of correlation IDs to every log message then you should consider switching to JSON. Then you need to update the ship-logs function we introduced in part 1 to handle log messages that are formatted as JSON.

Enable debug logging on entire call chain

Another common problem people run into, is that by the time we realise there’s a problem in production we find out that the crucial piece of information we need to debug the problem is logged as DEBUG, and we disable DEBUG logs in production because they’re too noisy.

“Darn it, now we have to enable debug logging and redeploy all these services! What a pain!”

“Don’t forget to disable debug logging and redeploy them, after you’ve found the problem ;-)”

Fortunately it doesn’t have to be a catch-22 situation. You can enable DEBUG logging on the entire call chain by:

  1. make the decision to enable DEBUG logging (for say, 5% of all requests) at the edge service
  2. pass the decision on all outward requests alongside the correlation IDs
  3. on receiving the request from the edge service, possibly through async event sources such as SNS, the intermediate services will capture this decision and turn on DEBUG logging if asked to do so
  4. the intermediate services will also pass that decision on all outward requests alongside the correlation IDs

The edge service decides to turn DEBUG logging on for 5% of requests, that decision is captured and passed along throughout the entire call chain, through HTTP requests, SNS message and Kinesis events.

Capture and forward correlation IDs

With that out of the way, let’s dive into some code to see how you can actually make it work. If you want to follow along, then the code is available in this repo, and the architecture of the demo project looks like this:

The demo project consists of an edge API, api-a, which initialises a bunch of correlation IDs as well as the decision on whether or not to turn on debug logging. It’ll pass these along through HTTP requests to api-b, Kinesis events and SNS messages. Each of these downstream function would in turn capture and pass them along to api-c.

We can take advantage of the fact that concurrency is now managed by the platform, which means we can safely use global variables to store contextual information relevant for the current invocation.

In the handler function we can capture incoming correlation IDs in global variables, and then include them in log messages, as well as any outgoing messages/HTTP requests/events, etc.

To abstract away the implementation details, let’s create a requestContextmodule that makes it easy to fetch and update these context data:

And then add a log module which:

  • disables DEBUG logging by default
  • enables DEBUG logging if explicitly overriden via environment variables or a Debug-Log-Enabled field was captured in the incoming request alongside other correlation IDs
  • logs messages as JSON

Once we start capturing correlation IDs, our log messages would look something like this:

Notice that I have also captured the User-Agent from the incoming request, as well as the decision to not enable DEBUG logging.

Now let’s see how we can capture and forward correlation IDs through API Gateway and outgoing HTTP requests.

API Gateway

You can capture and pass along correlation IDs via HTTP headers. The trick is making sure that everyone in the team follows the same conventions.

To standardise these conventions (what to name headers that are correlation IDs, etc.) you can provide a factory function that your developers can use to create API handlers. Something like this perhaps:

When you need to implement another HTTP endpoint, pass your handler code to this factory function. Now, with minimal change, all your logs will have the captured correlation IDs (as well as User-Agent, whether to enable debug logging, etc.).

The api-a function in our earlier architecture looks something like this:

Since this is the API on the edge, so it initialises the x-correlation-id using the AWS Request ID for its invocation. This, along with several other pieces of contextual information is recorded with every log message.

By adding a custom HTTP module like this one, you can also make it easy to include these contextual information in outgoing HTTP requests. Encapsulating these conventions in an easy-to-use library also helps you standardise the approach across your team.

In the api-a function above, we made a HTTP request to the api-bendpoint. Looking in the logs, you can see the aforementioned contextual information has been passed along.

In this case, we also have the User-Agent from the original user-initiated request to api-a. This is useful because when I look at the logs for intermediate services, I often miss the context of what platform the user is using which makes it harder to correlate the information I gather from the logs to the symptoms the user describes in their bug reports.

When the api-b function (see here) makes its own outbound HTTP request to api-c it’ll pass along all of these contextual information plus anything we add in the api-b function itself.

Log message for when api-b calls api-c with the custom HTTP module. Notice it includes the “x-correlation-character-b” header which is set by the api-b function.

When you see the corresponding log message in api-c’s logs, you’ll see all the context from both api-a and api-b.


To capture and forward correlation IDs through SNS messages, you can use message attributes.

In the api-a function above, we also published a message to SNS (omitted from the code snippet above) with a custom sns module which includes the captured correlation IDs as message attributes, see below.

When this SNS message is delivered to a Lambda function, you can see the correlation IDs in the MessageAttributes field of the SNS event.

Let’s create a snsHandler factory function to standardise the process of capturing incoming correlation IDs via SNS message attributes.

We can use this factory function to quickly create SNS handler functions. The log messages from these handler functions will have access to the captured correlation IDs. If you use the aforementioned custom httpmodule to make outgoing HTTP requests then they’ll be included as HTTP headers automatically.

For instance, the following SNS handler function would capture incoming correlation IDs, include them in log messages, and pass them on when making a HTTP request to api-c (see architecture diagram).

Those correlation IDs (including the one added by the SNS handler function) are included as HTTP headers.

Kinesis Streams

Unfortunately, with Kinesis and DynamoDB Streams, there’s no way to tag additional information with the payload. Instead, in order to pass correlation IDs along, we’d have to modify the actual payload itself.

Let’s create a kinesis module for sending events to a Kinesis stream, so that we can insert a __context field to the payload to carry the correlation IDs.

On the receiving end, we can take it out, use it to set the current requestContext, and delete this __context field before passing it on to the Kinesis handler function for processing. The sender and receiver functions won’t even notice we modified the payload.

Wait, there’s one more problem – our Lambda function will receive a batch of Kinesis records, each with its own context. How will we consolidate that?

The simplest way is to force the handler function to process records one at a time. That’s what we’ve done in the kinesisHandler factory function here.

The handler function (created with the kinesisHandler factory function) would process one record at at time, and won’t have to worry about managing the request context. All of its log messages would have the right correlation IDs, and outgoing HTTP requests, SNS messages and Kinesis events would also pass those correlation IDs along.

When api-c receives the invocation event, you can see the correlation IDs have been passed along via HTTP headers.

This approach is simple, developers working on Kinesis handler functions won’t have to worry about the implementation details of how correlation IDs are captured and passed along, and things “just work”.

However, it also removes the opportunity to optimize by processing all the records in a batch. Perhaps your handler function has to persist the events to a persistence store that’s better suited for storing large payloads rather than lots of small ones.

This simple approach is not the right fit for every situation, an alternative would be to leave the __context field on the Kinesis records and let the handler function deal with them as it sees fit. In which case you would also need to update the shared libraries – the loghttpsns and kinesismodules we have talked about so far – to give the caller to option to pass in a requestContext as override.

This way, the handler function can process the Kinesis records in a batch. Where it needs to log or make a network call in the context of a specific record, it can extract and pass the request context along as need be.

The End

That’s it, folks. A blueprint for how to capture and forward correlation IDs through 3 of the most commonly used event sources for Lambda.

Here’s an annotated version of the architecture diagram earlier, showing the flow of data as they’re captured and forwarded from one invocation to another, through HTTP headers, message attributes, Kinesis record data.

You can find a deployable version of the code you have seen in this post in this repo. It’s intended for demo sessions in my O’Reilly course detailed below, so documentation is seriously lacking at the moment, but hopefully this post gives you a decent idea of how the project is held together.

Other event sources

There are plenty of event sources that we didn’t cover in this post.

It’s not possible to pass correlation IDs through every event source, as some do not originate from your system – eg. CloudWatch Events that are triggered by API calls made by AWS service.

And it might be hard to pass correlation IDs through, say, DynamoDB Streams – the only way (that I can think of) for it to work is to include the correlation IDs as fields in the row (which, might not be such a bad idea but it does have cost implications).

Tips and tricks for logging and monitoring AWS Lambda functions

The common practice of using agents/daemons to buffer and batch send logs and metrics are no longer applicable in the world of serverless. Here are some tips to help you get the most out of your logging and monitoring infrastructure for your functions.

This is part 2 of a 3-part mini series on managing your AWS Lambda logs.

If you haven’t read part 1 yet, please give it a read now. We’ll be building on top of the basic infrastructure of shipping logs from CloudWatch Logs detailed in that post.

part 1 : centralise logging

part 3 : tracking correlation IDs

New paradigm, new problems

Much have changed with the serverless paradigm, and it solves many of the old problems we face and replaced them with some new problems that (I think) are easier to deal with.

Consequently, many of the old practices are no longer applicable – eg. using agents/daemons to buffer and batch send metrics and logs to monitoring and log aggregation services. However, even as we throw away these old practices for the new world of serverless, we are still after the same qualities that made our old tools “good”:

  1. able to collect rich set of system and application metrics and logs
  2. publishing metrics and logs should not add user-facing latency (ie. they should be performed in the background)
  3. metrics and logs should appear in realtime (ie. within a few seconds)
  4. metrics should be granular

Unfortunately, the current tooling for Lambda – CloudWatch metrics & CloudWatch Logs – are failing on a few of these, some more so than others:

  • publishing custom metrics requires additional network calls that need to be made during the function’s execution, adding to user-facing latency
  • CloudWatch metrics for AWS services are only granular down to 1 minute interval (custom metrics can be granular down to 1 second)
  • CloudWatch metrics are often a few minutes behind (though custom metrics might have less lag now that they can be recorded at 1 second interval)
  • CloudWatch Logs are usually more than 10s behind (not precise measurement, but based on personal observation)

With Lambda, we have to rely on AWS to improve CloudWatch in order to bring us parity with existing “server-ful” services.

Many vendors have announced support for Lambda, such as Datadog and Wavefront. However, as they are using the same metrics from CloudWatch they will have the same lag.

IOPipe is a popular alternative for monitoring Lambda functions and they do things slightly differently – by giving you a wrapper function around your code so they can inject monitoring code (it’s a familiar pattern to those who have used AOP frameworks in the past).

For their 1.0 release they also announced support for tracing (see the demo video below), which I think it’s interesting as AWS already offers X-Ray and it’s a more complete tracing solution (despite its own shortcomings as I mentioned in this post).

IOPipe seems like a viable alternative to CloudWatch, especially if you’re new to AWS Lambda and just want to get started quickly. I can totally see the value of that simplicity.

However, I have some serious reservations with IOPipe’s approach:

  • A wrapper around every one of my functions? This level of pervasive access to my entire application requires a serious amount of trust that has to be earned, especially in times like this.
  • CloudWatch collects logs and metrics asynchronously without adding to my function’s execution time. But with IOPipe they have to send the metrics to their own system, and they have to do so during my function’s execution time and hence adding to user-facing latency (for APIs).
  • Further to the above points, it’s another thing that can cause my function to error or time out even after my code has successfully executed. Perhaps they’re doing something smart to minimise that risk but it’s hard for me to know for sure and I have to anticipate failures.

Of all the above, the latency overhead is the biggest concern for me. Between API Gateway and Lambda I already have to deal with cold start and the latency between API Gateway and Lambda. As your microservice architecture expands and the no. of inter-service communications grows, these latencies will compound further.

For background tasks this is less a concern, but a sizeable portion of Lambda functions I have written have to handle HTTP requests and I need to keep the execution time as low as possible for these functions.

Sending custom metrics asynchronously

I find Datadog’s approach for sending custom metrics very interesting. Essentially you write custom metrics as specially-formatted log messages that Datadog will process (you have to set up IAM permissions for CloudWatch to call their function) and track them as metrics.

Datadog allows you to send custom metrics using log messages in their DogStatsD format.

It’s a simple and elegant approach, and one that we can adopt for ourselves even if we decide to use another monitoring service.

In part 1 we established an infrastructure to ship logs from CloudWatch Logs to a log aggregation service of our choice. We can extend the log shipping function to look for log messages that look like these:

Log custom metrics as specially formatted log messages

For these log messages, we will interpret them as:


And instead of sending them to the log aggregation service, we’ll send them as metrics to our monitoring service instead. In this particular case, I’m using CloudWatch in my demo (see link below), so the format of the log message reflects the fields I need to pass along in the PutMetricData call.

To send custom metrics, we write them as log messages. Again, no latency overhead as Lambda service collects these for us and sends them to CloudWatch in the background.

And moments later they’re available in CloudWatch metrics.

Custom metrics are recorded in CloudWatch as expected.

Take a look at the custom-metrics function in this repo.

Tracking the memory usage and billed duration of your AWS Lambda functions in CloudWatch

Lambda reports the amount of memory used, and the billed duration at the end of every invocation. Whilst these are not published as metrics in CloudWatch, you can find them as log messages in CloudWatch Logs.

At the end of every invocation, Lambda publishes a REPORT log message detailing the max amount of memory used by your function during this invocation, and how much time is billed (Lambda charges at 100ms blocks).

I rarely find memory usage to be an issue as Nodejs functions have such a small footprint. My choice of memory allocation is primarily based on getting the right balance between cost and performance. In fact, Alex Casalboni of CloudAcademy wrote a very nice blog post on using Step Functions to help you find that sweet spot.

The Billed Duration on the other hand, is a useful metric when viewed side by side with Invocation Duration. It gives me a rough idea of the amount of wastage I have. For example, if the average Invocation Durationof a function is 42ms but the average Billed Duration is 100ms, then there is a 58% wastage and maybe I should consider running the function on a lower memory allocation.

Interestingly, IOPipe records these in their dashboard out of the box.

IOPipes records a number of additional metrics that are not available in CloudWatch, such as Memory Usage and CPU Usage over time, as well as coldstarts.

However, we don’t need to add IOPipe just to get these metrics. We can apply a similar technique to the previous section and publish them as custom metrics to our monitoring service.

To do that, we have to look out for these REPORT log messages and parse the relevant information out of them. Each message contains 3 pieces of information we want to extract:

  • Billed Duration (Milliseconds)
  • Memory Size (MB)
  • Memory Used (MB)

We will parse these log messages and return an array of CloudWatch metric data for each, so we can flat map over them afterwards.

This is a function in the “parse” module, which maps a log message to an array of CloudWatch metric data.
Flat map over the CloudWatch metric data returned by the above parse.usageMetrics function and publish them.

And sure enough, after subscribing the log group for an API (created in the same demo project to test this) and invoking the API, I’m able to see these new metrics show up in CloudWatch metrics.

Looking at the graph, maybe I can reduce my cost by running it on a much smaller memory size.

Take a look at the usage-metrics function in this repo.

Mind the concurrency!

When processing CloudWatch Logs with Lambda functions, you need to be mindful of the no. of concurrent executions it creates so to not run foul of the concurrent execution limit.

Since this is an account-wide limit, it means your log-shipping function can cause cascade failures throughout your entire application. Critical functions can be throttled because too many executions are used to push logs out of CloudWatch Logs – not a good way to go down ;-)

What we need is a more fine-grained throttling mechanism for Lambda. It’s fine to have an account-wide limit, but we should be able to create pools of functions that can have slices of that limit. For example, tier-1 functions (those serving the core business needs) gets 90% of the available concurrent executions. Whilst tier-2 functions (BI, monitoring, etc.) gets the other 10%.

As things stand, we don’t have that, and the best you can do is to keep the execution of your log-shipping function brief. Maybe that means fire-and-forget when sending logs and metrics; or send the decoded log messages into a Kinesis stream where you have more control over parallelism.

Or, maybe you’ll monitor the execution count of these tier-2 functions and when the no. of executions/minute breaches some threshold you’ll temporarily unsubscribe log groups from the log-shipping function to alleviate the problem.

Or, maybe you’ll install some bulkheads by moving these tier-2 functions into a separate AWS account and use cross-account invocation to trigger them. But this seems a really heavy-handed way to workaround the problem!

Point is, it’s not a solved problem and I haven’t come across a satisfying workaround yet. AWS is aware of this gap and hopefully they’ll add support for better control over concurrent executions.