aws x-ray and lambda : the good, the bad and the ugly

AWS announced gen­er­al avail­abil­i­ty of AWS Lamb­da sup­port for AWS X-Ray back in May. It’s tak­en me a while to try it out, and whilst I see plen­ty of val­ues I think its cur­rent lim­i­ta­tions sig­nif­i­cant­ly restricts its use­ful­ness in a com­plex sys­tem.

I found Lamb­da-spe­cif­ic doc­u­men­ta­tions to be frag­ment­ed and I had to rely on exper­i­men­ta­tion and piece togeth­er clues from sev­er­al sources:

I also found record­ing anno­ta­tions and meta­da­ta didn’t work as adver­tised in the sam­ple (although admit­ted­ly I could be doing some­thing wrong…).


Update 03/07/2017 : after this post was pub­lished the folks at AWS got in touch and kind­ly cleared up some of the issues high­light­ed here which were caused by poor doc­u­men­ta­tion which they’ll rec­ti­fy in the near future. Scroll down to see the clar­i­fi­ca­tion on the rel­e­vant sec­tions.


The Sample Application

The sam­ple project I cre­at­ed cen­tres around a Lamb­da func­tion called service-a, which in term calls a num­ber of down­stream sys­tems:

  • pub­lish­ing to a SNS top­ic
  • GET’ng and PUT’ng an object in S3
  • GET’ng and PUT’ng a row in DynamoDB
  • invok­ing anoth­er Lamb­da func­tion (service-c) using the Lamb­da API
  • mak­ing a HTTP request to an API Gate­way end­point backed by anoth­er Lamb­da func­tion (one of service-b, error and timeout func­tions in the dia­gram above, which rep­re­sents the suc­cess, error and time­out cas­es respec­tive­ly)

You can find all the source code here.

The Good

Once I fig­ured out the mag­ic incan­ta­tions I was able to get the results I’m after. It took more time and ener­gy than should have, but by and large most fea­tures worked as adver­tised at the first (or sec­ond) time of ask­ing.

This is a trace of the ser­vice-a func­tion, which includes the time it takes for Lamb­da to ini­tialise the func­tion, and the var­i­ous down­stream sys­tems it talked to, all nest­ed by under cus­tom sub­seg­ments. It even includes the trace of the ser­vice-c func­tion (and the time it spent pub­lish­ing to SNS) which was invoked using the Lamb­da API.

The ser­vice map for ser­vice-a includes ser­vice-c as a down­stream depen­den­cy, as well as service-c’s depen­den­cy on SNS.

The Bad

It’s always 200…

When the service-a func­tion is invoked through its API Gate­way end­point and errors, the cor­re­spond­ing trace still reports a 200 response code.

Pre­sum­ably what X-Ray sees is a 200 response from the Lamb­da ser­vice whose pay­load indi­cates a 502 response to the API Gate­way invo­ca­tion and so it thought “hey, it’s a 200!”.

Here, I can see the ser­vice-a end­point returned a 502 in Post­man..

..but the trace report­ed a 200 response code.

Odd­ly enough, the trace for the error func­tion also reports a 200 even though its own sta­tus field indi­cates it had errored.

This behav­iour is both con­fus­ing and incon­sis­tent to me, per­haps I have mis­un­der­stood how it works. Sad­ly, the X-Ray’s con­cepts page also does not explain the dif­fer­ence between an Error and a Fault

Whilst this might seem like a small nui­sance now, the inabil­i­ty to quick­ly iden­ti­fy error traces will hurt you most when you need to diag­nose prob­lems in pro­duc­tion, pos­si­bly when you’re under the most time pres­sure.


Update 03/07/2017 : AWS con­firmed that the rea­son the errors are report­ed as 200 is due to Lamb­da ser­vice return­ing a 200 response (with pay­load that indi­cates an error). One workaround is to use the fil­ter expres­sion ser­vice() { fault } which returns all traces that con­tains a fault.


Traces don’t span over API Gateway

When the service-a func­tion makes an out­go­ing HTTP request to an API Gate­way end­point the trace stops at the API Gate­way end­point and doesn’t extend to the Lamb­da func­tions that are trig­gered by API Gate­way.

This behav­iour was con­sis­tent with all 3 end­points I test­ed—service-b, error and timeout.

For this test, I have fol­lowed the X-Ray doc­u­men­ta­tion and used the X-Ray SDK to wrap the Node­js https mod­ule when mak­ing the HTTP request.

I can see the trace IDs are cor­rect­ly passed along in the out­go­ing HTTP request and received by the han­dling Lamb­da func­tion.

This is the ser­vice map I expect­ed to see in this case—where service-a’s trace fol­lows through the HTTP request to API Gate­way and includes the invo­ca­tion of the timeout func­tion.

ps. this is not an actu­al screen­shot, it’s an image I com­posed togeth­er to show what I expect­ed to see!

Instead, the actu­al ser­vice map stops at the API Gate­way.

How­ev­er, when invok­ing anoth­er Lamb­da func­tion direct­ly (using the Lamb­da API and wrapped AWS SDK) the trace worked as expect­ed.

Per­haps the lim­i­ta­tion lies with API Gate­way?

The Ugly

No sampling

Accord­ing to the Lambda’s doc­u­men­ta­tion on X-Ray, requests should be sam­pled at 1 request per minute.

How­ev­er, that wasn’t the case in my exper­i­ments. EVERY request was sam­pled, as you can see from the Age of the traces in the screen­shot below.

This behav­iour was con­sis­tent when invok­ing Lamb­da via API Gate­way as well as via the Lamb­da man­age­ment con­sole.

Whilst the X-Ray ser­vice is not expen­sive per se—$5.00 per mil­lion traces—it’s nonethe­less a cost that can eas­i­ly spring up on you if you are unwill­ing­ly trac­ing every request through your sys­tem. As an exam­ple, I worked on a mod­er­ate­ly suc­cess­ful social game at Gamesys with ~1M DAU. At rough­ly 250M user requests per day, X-Ray would have cost $5 * 250 * 30 days = $37500, which was more than our entire AWS bill at the time!


Update 03/07/2017 : this turns out to be a prob­lem with the doc­u­men­ta­tion, which doesn’t men­tion that sam­pling is vol­ume-based and only kicks in once you reach a cer­tain vol­ume of requests/s.


Annotations and Metadata only work on subsegments

The one thing that just refused to work (even though I have fol­lowed the exam­ples) was adding anno­ta­tion and meta­da­ta to the root seg­ment:

module.exports.handler = (event, context, callback) => {
  ...
  let segment = AWSXRay.getSegment();
  let n = Math.random() * 3;
  segment.addMetadata('random', `${n}`);      // this doesn't work
  segment.addAnnotation('path', event.path);  // this doesn't work
  ...
}

Inter­est­ing­ly, adding anno­ta­tions and meta­da­ta to sub­seg­ments works just fine.



Look­ing at the logs, I found some­thing inter­est­ing: the seg­ment ID for the root seg­ment doesn’t match the seg­ment ID in the X-Ray trace.

For instance, I fetch the root seg­ment for my func­tion in the han­dler and logs it to Cloud­Watch Logs.

const AWSXRay = require('aws-xray-sdk');
module.exports.handler = (event, context, callback) => {
  ...
  // this should be the root segment for my function
  let segment = AWSXRay.getSegment();
  console.log(JSON.stringify(segment));
  ...
}

In the logs I can see the seg­ment ID is 05b2b9ac6c9e5682.

But in the X-Ray trace, the seg­ment ID for the root seg­ment is 2b7d5b4a2a2d96e9.


Fur­ther­more, the trace ID is also dif­fer­ent:

  • in the logs it’s 1–59504311-d765e7accb8558871fa89d6d
  • in the X-Ray con­sole it’s 1–59504312–5ef2a3eda0c1b2c4d64dda00

This was very odd, so I decid­ed to track the trace ID in the logs vs in the X-Ray con­sole, start­ing with a cold­start.

Bin­go! Looks like it’s a bug in the X-Ray SDK for Node­js where AWSXray.getSegment() returns the root seg­ment from the pre­vi­ous invo­ca­tion..


Update 03/07/2017 : whilst there was a bug in the X-Ray SDK wrt track­ing the trace-id, adding anno­ta­tions and meta­da­ta to the root seg­ment is sim­ply not sup­port­ed, which the doc doesn’t explic­it­ly state. You can work around this by cre­at­ing a sub­seg­ment that cov­ers the entire span of your func­tion invo­ca­tion to act as your cus­tom root seg­ment and attach any anno­ta­tion and meta­da­ta relat­ed to the invo­ca­tion there.


Conclusion

So there you have it, my week­end escapade with AWS X-Ray and Lamb­da :-)

Over­all I’m impressed with what I saw, and feel X-Ray would have added a lot of val­ue to the server­less archi­tec­ture I was build­ing at Yubl. How­ev­er, the inabil­i­ty to span traces over API Gate­way end­points makes it a far less use­ful addi­tion to our ecosys­tem.

Fur­ther­more, the X-Ray ser­vice is focused on exe­cu­tion time and help­ing you iden­ti­fy per­for­mance bot­tle­necks. How­ev­er, there’s anoth­er impor­tant aspect to dis­trib­uted tracing—helping you debug your sys­tem by ensur­ing a set of cor­re­la­tion IDs are cap­tured in all log mes­sages. X-Ray does not con­cern itself with this, although you can prob­a­bly use the trace ids X-Ray pro­vides you with it’s still up to you to cap­ture them in all log mes­sages and prop­a­gat­ing all your logs to one eas­i­ly search­able place. We invest­ed some effort into enabling dis­trib­uted trac­ing in our server­less archi­tec­ture at Yubl, which you can read about in detail in this post.

Are you using X-Ray in pro­duc­tion? I’d love to hear your thoughts and feed­backs on the ser­vice, and any pit­falls to look out for.

Like what you’re read­ing? Check out my video course Pro­duc­tion-Ready Server­less and learn the essen­tials of how to run a server­less appli­ca­tion in pro­duc­tion.

We will cov­er top­ics includ­ing:

  • authen­ti­ca­tion & autho­riza­tion with API Gate­way & Cog­ni­to
  • test­ing & run­ning func­tions local­ly
  • CI/CD
  • log aggre­ga­tion
  • mon­i­tor­ing best prac­tices
  • dis­trib­uted trac­ing with X-Ray
  • track­ing cor­re­la­tion IDs
  • per­for­mance & cost opti­miza­tion
  • error han­dling
  • con­fig man­age­ment
  • canary deploy­ment
  • VPC
  • secu­ri­ty
  • lead­ing prac­tices for Lamb­da, Kine­sis, and API Gate­way

You can also get 40% off the face price with the code ytcui. Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP).