Serverless architectures are microservices by default, you need correlation IDs to help debug issues that spans across multiple functions, and possibly different event source types – asynchronous, synchronous and streams.
This is the last of a 3-part mini series on managing your AWS Lambda logs.
If you haven’t read part 1 yet, please give it a read now. We’ll be building on top of the basic infrastructure of shipping logs from CloudWatch Logs detailed in that post.
part 1 : centralise logging
part 2: tips and tricks
Why correlation IDs?
As your architecture becomes more complex, many services have to work together in order to deliver the features your users want.
When everything works, it’s like watching an orchestra, lots of small pieces all acting independently whilst at the same time collaborating to form a whole that’s greater than the sum of its parts.
However, when things don’t work, it’s a pain in the ass to debug. Finding that one clue is like finding needle in the haystack as there are so many moving parts, and they’re all constantly moving.
Imagine you’re an engineer at Twitter and trying to debug why a user’s tweet was not delivered to one of his followers’ timeline.
“Let me cross reference the logs from hundreds of services and find the logs that mention the author’s user ID, the tweet ID, or the recipient’s user ID, and put together a story of how the tweet flowed through our system and why it wasn’t delivered to the recipient’s timeline.”
“What about logs that don’t explicitly mention those fields?”
“mm… let me get back to you on that…”
Needle in the haystack.
This is the problem that correlation IDs solve in the microservice world – to tag every log message with the relevant context so that it’s easy to find them later on.
Aside from common IDs such as user ID, order ID, tweet ID, etc. you might also want to include the X-Ray trace ID in every log message. That way, if you’re using X-Ray with Lambda then you can use it to quickly load up the relevant trace in the X-Ray console.
Also, if you’re going to add a bunch of correlation IDs to every log message then you should consider switching to JSON. Then you need to update the
ship-logs function we introduced in part 1 to handle log messages that are formatted as JSON.
Enable debug logging on entire call chain
Another common problem people run into, is that by the time we realise there’s a problem in production we find out that the crucial piece of information we need to debug the problem is logged as DEBUG, and we disable DEBUG logs in production because they’re too noisy.
“Darn it, now we have to enable debug logging and redeploy all these services! What a pain!”
“Don’t forget to disable debug logging and redeploy them, after you’ve found the problem ;-)”
Fortunately it doesn’t have to be a catch-22 situation. You can enable DEBUG logging on the entire call chain by:
- make the decision to enable DEBUG logging (for say, 5% of all requests) at the edge service
- pass the decision on all outward requests alongside the correlation IDs
- on receiving the request from the edge service, possibly through async event sources such as SNS, the intermediate services will capture this decision and turn on DEBUG logging if asked to do so
- the intermediate services will also pass that decision on all outward requests alongside the correlation IDs
Capture and forward correlation IDs
With that out of the way, let’s dive into some code to see how you can actually make it work. If you want to follow along, then the code is available in this repo, and the architecture of the demo project looks like this:
We can take advantage of the fact that concurrency is now managed by the platform, which means we can safely use global variables to store contextual information relevant for the current invocation.
In the handler function we can capture incoming correlation IDs in global variables, and then include them in log messages, as well as any outgoing messages/HTTP requests/events, etc.
To abstract away the implementation details, let’s create a
requestContextmodule that makes it easy to fetch and update these context data:
And then add a
log module which:
- disables DEBUG logging by default
- enables DEBUG logging if explicitly overriden via environment variables or a
Debug-Log-Enabledfield was captured in the incoming request alongside other correlation IDs
- logs messages as JSON
Once we start capturing correlation IDs, our log messages would look something like this:
Notice that I have also captured the
User-Agent from the incoming request, as well as the decision to not enable DEBUG logging.
Now let’s see how we can capture and forward correlation IDs through API Gateway and outgoing HTTP requests.
You can capture and pass along correlation IDs via HTTP headers. The trick is making sure that everyone in the team follows the same conventions.
To standardise these conventions (what to name headers that are correlation IDs, etc.) you can provide a factory function that your developers can use to create API handlers. Something like this perhaps:
When you need to implement another HTTP endpoint, pass your handler code to this factory function. Now, with minimal change, all your logs will have the captured correlation IDs (as well as
User-Agent, whether to enable debug logging, etc.).
api-a function in our earlier architecture looks something like this:
Since this is the API on the edge, so it initialises the
x-correlation-id using the AWS Request ID for its invocation. This, along with several other pieces of contextual information is recorded with every log message.
By adding a custom HTTP module like this one, you can also make it easy to include these contextual information in outgoing HTTP requests. Encapsulating these conventions in an easy-to-use library also helps you standardise the approach across your team.
api-a function above, we made a HTTP request to the
api-bendpoint. Looking in the logs, you can see the aforementioned contextual information has been passed along.
In this case, we also have the
User-Agent from the original user-initiated request to
api-a. This is useful because when I look at the logs for intermediate services, I often miss the context of what platform the user is using which makes it harder to correlate the information I gather from the logs to the symptoms the user describes in their bug reports.
api-b function (see here) makes its own outbound HTTP request to
api-c it’ll pass along all of these contextual information plus anything we add in the
api-b function itself.
When you see the corresponding log message in
api-c’s logs, you’ll see all the context from both
To capture and forward correlation IDs through SNS messages, you can use message attributes.
api-a function above, we also published a message to SNS (omitted from the code snippet above) with a custom
sns module which includes the captured correlation IDs as message attributes, see below.
When this SNS message is delivered to a Lambda function, you can see the correlation IDs in the
MessageAttributes field of the SNS event.
Let’s create a
snsHandler factory function to standardise the process of capturing incoming correlation IDs via SNS message attributes.
We can use this factory function to quickly create SNS handler functions. The log messages from these handler functions will have access to the captured correlation IDs. If you use the aforementioned custom
httpmodule to make outgoing HTTP requests then they’ll be included as HTTP headers automatically.
For instance, the following SNS handler function would capture incoming correlation IDs, include them in log messages, and pass them on when making a HTTP request to
api-c (see architecture diagram).
Unfortunately, with Kinesis and DynamoDB Streams, there’s no way to tag additional information with the payload. Instead, in order to pass correlation IDs along, we’d have to modify the actual payload itself.
Let’s create a
kinesis module for sending events to a Kinesis stream, so that we can insert a
__context field to the payload to carry the correlation IDs.
On the receiving end, we can take it out, use it to set the current
requestContext, and delete this
__context field before passing it on to the Kinesis handler function for processing. The sender and receiver functions won’t even notice we modified the payload.
Wait, there’s one more problem – our Lambda function will receive a batch of Kinesis records, each with its own context. How will we consolidate that?
The simplest way is to force the handler function to process records one at a time. That’s what we’ve done in the
kinesisHandler factory function here.
The handler function (created with the
kinesisHandler factory function) would process one record at at time, and won’t have to worry about managing the request context. All of its log messages would have the right correlation IDs, and outgoing HTTP requests, SNS messages and Kinesis events would also pass those correlation IDs along.
This approach is simple, developers working on Kinesis handler functions won’t have to worry about the implementation details of how correlation IDs are captured and passed along, and things “just work”.
However, it also removes the opportunity to optimize by processing all the records in a batch. Perhaps your handler function has to persist the events to a persistence store that’s better suited for storing large payloads rather than lots of small ones.
This simple approach is not the right fit for every situation, an alternative would be to leave the
__context field on the Kinesis records and let the handler function deal with them as it sees fit. In which case you would also need to update the shared libraries – the
kinesismodules we have talked about so far – to give the caller to option to pass in a
requestContext as override.
This way, the handler function can process the Kinesis records in a batch. Where it needs to log or make a network call in the context of a specific record, it can extract and pass the request context along as need be.
That’s it, folks. A blueprint for how to capture and forward correlation IDs through 3 of the most commonly used event sources for Lambda.
Here’s an annotated version of the architecture diagram earlier, showing the flow of data as they’re captured and forwarded from one invocation to another, through HTTP headers, message attributes, Kinesis record data.
You can find a deployable version of the code you have seen in this post in this repo. It’s intended for demo sessions in my O’Reilly course detailed below, so documentation is seriously lacking at the moment, but hopefully this post gives you a decent idea of how the project is held together.
Other event sources
There are plenty of event sources that we didn’t cover in this post.
It’s not possible to pass correlation IDs through every event source, as some do not originate from your system – eg. CloudWatch Events that are triggered by API calls made by AWS service.
And it might be hard to pass correlation IDs through, say, DynamoDB Streams – the only way (that I can think of) for it to work is to include the correlation IDs as fields in the row (which, might not be such a bad idea but it does have cost implications).