Capture and forward correlation IDs through different Lambda event sources

Server­less archi­tec­tures are microser­vices by default, you need cor­re­la­tion IDs to help debug issues that spans across mul­ti­ple func­tions, and pos­si­bly dif­fer­ent event source types — asyn­chro­nous, syn­chro­nous and streams.

This is the last of a 3-part mini series on man­ag­ing your AWS Lamb­da logs.

If you haven’t read part 1 yet, please give it a read now. We’ll be build­ing on top of the basic infra­struc­ture of ship­ping logs from Cloud­Watch Logs detailed in that post.

part 1 : cen­tralise log­ging

part 2: tips and tricks

Why correlation IDs?

As your archi­tec­ture becomes more com­plex, many ser­vices have to work togeth­er in order to deliv­er the fea­tures your users want.

Microser­vice death stars, cir­ca 2015.

When every­thing works, it’s like watch­ing an orches­tra, lots of small pieces all act­ing inde­pen­dent­ly whilst at the same time col­lab­o­rat­ing to form a whole that’s greater than the sum of its parts.

How­ev­er, when things don’t work, it’s a pain in the ass to debug. Find­ing that one clue is like find­ing nee­dle in the haystack as there are so many mov­ing parts, and they’re all con­stant­ly mov­ing.

Imag­ine you’re an engi­neer at Twit­ter and try­ing to debug why a user’s tweet was not deliv­ered to one of his fol­low­ers’ time­line.

Let me cross ref­er­ence the logs from hun­dreds of ser­vices and find the logs that men­tion the author’s user ID, the tweet ID, or the recipient’s user ID, and put togeth­er a sto­ry of how the tweet flowed through our sys­tem and why it wasn’t deliv­ered to the recipient’s time­line.”

What about logs that don’t explic­it­ly men­tion those fields?”

mm… let me get back to you on that…”

Nee­dle in the haystack.

This is the prob­lem that cor­re­la­tion IDs solve in the microser­vice world — to tag every log mes­sage with the rel­e­vant con­text so that it’s easy to find them lat­er on.

Aside from com­mon IDs such as user ID, order ID, tweet ID, etc. you might also want to include the X-Ray trace ID in every log mes­sage. That way, if you’re using X-Ray with Lamb­da then you can use it to quick­ly load up the rel­e­vant trace in the X-Ray con­sole.

By default, Lamb­da auto­mat­i­cal­ly gen­er­ates a _X_AMZN_TRACE_ID val­ue in the envi­ron­ment vari­able.

Also, if you’re going to add a bunch of cor­re­la­tion IDs to every log mes­sage then you should con­sid­er switch­ing to JSON. Then you need to update the ship-logs func­tion we intro­duced in part 1 to han­dle log mes­sages that are for­mat­ted as JSON.

Enable debug logging on entire call chain

Anoth­er com­mon prob­lem peo­ple run into, is that by the time we realise there’s a prob­lem in pro­duc­tion we find out that the cru­cial piece of infor­ma­tion we need to debug the prob­lem is logged as DEBUG, and we dis­able DEBUG logs in pro­duc­tion because they’re too noisy.

Darn it, now we have to enable debug log­ging and rede­ploy all these ser­vices! What a pain!”

Don’t for­get to dis­able debug log­ging and rede­ploy them, after you’ve found the prob­lem ;-)”

For­tu­nate­ly it doesn’t have to be a catch-22 sit­u­a­tion. You can enable DEBUG log­ging on the entire call chain by:

  1. make the deci­sion to enable DEBUG log­ging (for say, 5% of all requests) at the edge ser­vice
  2. pass the deci­sion on all out­ward requests along­side the cor­re­la­tion IDs
  3. on receiv­ing the request from the edge ser­vice, pos­si­bly through async event sources such as SNS, the inter­me­di­ate ser­vices will cap­ture this deci­sion and turn on DEBUG log­ging if asked to do so
  4. the inter­me­di­ate ser­vices will also pass that deci­sion on all out­ward requests along­side the cor­re­la­tion IDs

The edge ser­vice decides to turn DEBUG log­ging on for 5% of requests, that deci­sion is cap­tured and passed along through­out the entire call chain, through HTTP requests, SNS mes­sage and Kine­sis events.

Capture and forward correlation IDs

With that out of the way, let’s dive into some code to see how you can actu­al­ly make it work. If you want to fol­low along, then the code is avail­able in this repo, and the archi­tec­ture of the demo project looks like this:

The demo project con­sists of an edge API, api-a, which ini­tialis­es a bunch of cor­re­la­tion IDs as well as the deci­sion on whether or not to turn on debug log­ging. It’ll pass these along through HTTP requests to api-b, Kine­sis events and SNS mes­sages. Each of these down­stream func­tion would in turn cap­ture and pass them along to api-c.

We can take advan­tage of the fact that con­cur­ren­cy is now man­aged by the plat­form, which means we can safe­ly use glob­al vari­ables to store con­tex­tu­al infor­ma­tion rel­e­vant for the cur­rent invo­ca­tion.

In the han­dler func­tion we can cap­ture incom­ing cor­re­la­tion IDs in glob­al vari­ables, and then include them in log mes­sages, as well as any out­go­ing messages/HTTP requests/events, etc.

To abstract away the imple­men­ta­tion details, let’s cre­ate a requestContextmod­ule that makes it easy to fetch and update these con­text data:

And then add a log mod­ule which:

  • dis­ables DEBUG log­ging by default
  • enables DEBUG log­ging if explic­it­ly over­ri­den via envi­ron­ment vari­ables or a Debug-Log-Enabled field was cap­tured in the incom­ing request along­side oth­er cor­re­la­tion IDs
  • logs mes­sages as JSON

Once we start cap­tur­ing cor­re­la­tion IDs, our log mes­sages would look some­thing like this:

Notice that I have also cap­tured the User-Agent from the incom­ing request, as well as the deci­sion to not enable DEBUG log­ging.

Now let’s see how we can cap­ture and for­ward cor­re­la­tion IDs through API Gate­way and out­go­ing HTTP requests.

API Gateway

You can cap­ture and pass along cor­re­la­tion IDs via HTTP head­ers. The trick is mak­ing sure that every­one in the team fol­lows the same con­ven­tions.

To stan­dard­ise these con­ven­tions (what to name head­ers that are cor­re­la­tion IDs, etc.) you can pro­vide a fac­to­ry func­tion that your devel­op­ers can use to cre­ate API han­dlers. Some­thing like this per­haps:

When you need to imple­ment anoth­er HTTP end­point, pass your han­dler code to this fac­to­ry func­tion. Now, with min­i­mal change, all your logs will have the cap­tured cor­re­la­tion IDs (as well as User-Agent, whether to enable debug log­ging, etc.).

The api-a func­tion in our ear­li­er archi­tec­ture looks some­thing like this:

Since this is the API on the edge, so it ini­tialis­es the x-correlation-id using the AWS Request ID for its invo­ca­tion. This, along with sev­er­al oth­er pieces of con­tex­tu­al infor­ma­tion is record­ed with every log mes­sage.

By adding a cus­tom HTTP mod­ule like this one, you can also make it easy to include these con­tex­tu­al infor­ma­tion in out­go­ing HTTP requests. Encap­su­lat­ing these con­ven­tions in an easy-to-use library also helps you stan­dard­ise the approach across your team.

In the api-a func­tion above, we made a HTTP request to the api-bend­point. Look­ing in the logs, you can see the afore­men­tioned con­tex­tu­al infor­ma­tion has been passed along.

In this case, we also have the User-Agent from the orig­i­nal user-ini­ti­at­ed request to api-a. This is use­ful because when I look at the logs for inter­me­di­ate ser­vices, I often miss the con­text of what plat­form the user is using which makes it hard­er to cor­re­late the infor­ma­tion I gath­er from the logs to the symp­toms the user describes in their bug reports.

When the api-b func­tion (see here) makes its own out­bound HTTP request to api-c it’ll pass along all of these con­tex­tu­al infor­ma­tion plus any­thing we add in the api-b func­tion itself.

Log mes­sage for when api-b calls api-c with the cus­tom HTTP mod­ule. Notice it includes the “x-cor­re­la­tion-char­ac­ter-b” head­er which is set by the api-b func­tion.

When you see the cor­re­spond­ing log mes­sage in api-c’s logs, you’ll see all the con­text from both api-a and api-b.

SNS

To cap­ture and for­ward cor­re­la­tion IDs through SNS mes­sages, you can use mes­sage attrib­ut­es.

In the api-a func­tion above, we also pub­lished a mes­sage to SNS (omit­ted from the code snip­pet above) with a cus­tom sns mod­ule which includes the cap­tured cor­re­la­tion IDs as mes­sage attrib­ut­es, see below.

When this SNS mes­sage is deliv­ered to a Lamb­da func­tion, you can see the cor­re­la­tion IDs in the MessageAttributes field of the SNS event.

Let’s cre­ate a snsHandler fac­to­ry func­tion to stan­dard­ise the process of cap­tur­ing incom­ing cor­re­la­tion IDs via SNS mes­sage attrib­ut­es.

We can use this fac­to­ry func­tion to quick­ly cre­ate SNS han­dler func­tions. The log mes­sages from these han­dler func­tions will have access to the cap­tured cor­re­la­tion IDs. If you use the afore­men­tioned cus­tom httpmod­ule to make out­go­ing HTTP requests then they’ll be includ­ed as HTTP head­ers auto­mat­i­cal­ly.

For instance, the fol­low­ing SNS han­dler func­tion would cap­ture incom­ing cor­re­la­tion IDs, include them in log mes­sages, and pass them on when mak­ing a HTTP request to api-c (see archi­tec­ture dia­gram).

Those cor­re­la­tion IDs (includ­ing the one added by the SNS han­dler func­tion) are includ­ed as HTTP head­ers.

Kinesis Streams

Unfor­tu­nate­ly, with Kine­sis and DynamoDB Streams, there’s no way to tag addi­tion­al infor­ma­tion with the pay­load. Instead, in order to pass cor­re­la­tion IDs along, we’d have to mod­i­fy the actu­al pay­load itself.

Let’s cre­ate a kinesis mod­ule for send­ing events to a Kine­sis stream, so that we can insert a __context field to the pay­load to car­ry the cor­re­la­tion IDs.

On the receiv­ing end, we can take it out, use it to set the cur­rent requestContext, and delete this __context field before pass­ing it on to the Kine­sis han­dler func­tion for pro­cess­ing. The sender and receiv­er func­tions won’t even notice we mod­i­fied the pay­load.

Wait, there’s one more prob­lem — our Lamb­da func­tion will receive a batch of Kine­sis records, each with its own con­text. How will we con­sol­i­date that?

The sim­plest way is to force the han­dler func­tion to process records one at a time. That’s what we’ve done in the kinesisHandler fac­to­ry func­tion here.

The han­dler func­tion (cre­at­ed with the kinesisHandler fac­to­ry func­tion) would process one record at at time, and won’t have to wor­ry about man­ag­ing the request con­text. All of its log mes­sages would have the right cor­re­la­tion IDs, and out­go­ing HTTP requests, SNS mes­sages and Kine­sis events would also pass those cor­re­la­tion IDs along.

When api-c receives the invo­ca­tion event, you can see the cor­re­la­tion IDs have been passed along via HTTP head­ers.

This approach is sim­ple, devel­op­ers work­ing on Kine­sis han­dler func­tions won’t have to wor­ry about the imple­men­ta­tion details of how cor­re­la­tion IDs are cap­tured and passed along, and things “just work”.

How­ev­er, it also removes the oppor­tu­ni­ty to opti­mize by pro­cess­ing all the records in a batch. Per­haps your han­dler func­tion has to per­sist the events to a per­sis­tence store that’s bet­ter suit­ed for stor­ing large pay­loads rather than lots of small ones.

This sim­ple approach is not the right fit for every sit­u­a­tion, an alter­na­tive would be to leave the __context field on the Kine­sis records and let the han­dler func­tion deal with them as it sees fit. In which case you would also need to update the shared libraries — the loghttpsns and kinesismod­ules we have talked about so far — to give the caller to option to pass in a requestContext as over­ride.

This way, the han­dler func­tion can process the Kine­sis records in a batch. Where it needs to log or make a net­work call in the con­text of a spe­cif­ic record, it can extract and pass the request con­text along as need be.

The End

That’s it, folks. A blue­print for how to cap­ture and for­ward cor­re­la­tion IDs through 3 of the most com­mon­ly used event sources for Lamb­da.

Here’s an anno­tat­ed ver­sion of the archi­tec­ture dia­gram ear­li­er, show­ing the flow of data as they’re cap­tured and for­ward­ed from one invo­ca­tion to anoth­er, through HTTP head­ers, mes­sage attrib­ut­es, Kine­sis record data.

You can find a deploy­able ver­sion of the code you have seen in this post in this repo. It’s intend­ed for demo ses­sions in my O’Reilly course detailed below, so doc­u­men­ta­tion is seri­ous­ly lack­ing at the moment, but hope­ful­ly this post gives you a decent idea of how the project is held togeth­er.

Other event sources

There are plen­ty of event sources that we didn’t cov­er in this post.

It’s not pos­si­ble to pass cor­re­la­tion IDs through every event source, as some do not orig­i­nate from your sys­tem — eg. Cloud­Watch Events that are trig­gered by API calls made by AWS ser­vice.

And it might be hard to pass cor­re­la­tion IDs through, say, DynamoDB Streams — the only way (that I can think of) for it to work is to include the cor­re­la­tion IDs as fields in the row (which, might not be such a bad idea but it does have cost impli­ca­tions).

Like what you’re read­ing? Check out my video course Pro­duc­tion-Ready Server­less and learn the essen­tials of how to run a server­less appli­ca­tion in pro­duc­tion.

We will cov­er top­ics includ­ing:

  • authen­ti­ca­tion & autho­riza­tion with API Gate­way & Cog­ni­to
  • test­ing & run­ning func­tions local­ly
  • CI/CD
  • log aggre­ga­tion
  • mon­i­tor­ing best prac­tices
  • dis­trib­uted trac­ing with X-Ray
  • track­ing cor­re­la­tion IDs
  • per­for­mance & cost opti­miza­tion
  • error han­dling
  • con­fig man­age­ment
  • canary deploy­ment
  • VPC
  • secu­ri­ty
  • lead­ing prac­tices for Lamb­da, Kine­sis, and API Gate­way

You can also get 40% off the face price with the code ytcui. Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP).