AWS Lambda — use the invocation context to better handle slow HTTP responses

With API Gate­way and Lamb­da, you’re forced to use rel­a­tive­ly short time­outs on the serv­er-side:

  • API Gate­way have a 30s max time­out on all inte­gra­tion points
  • Server­less frame­work uses a default of 6s for AWS Lamb­da func­tions

How­ev­er, as you have lim­it­ed influ­ence over a Lamb­da function’s cold start time and have no con­trol over the amount of laten­cy over­head API Gate­way intro­duces, the actu­al client-fac­ing laten­cy you’d expe­ri­ence from a call­ing func­tion is far less pre­dictable.

 
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/api-gateway-metrics-dimensions.html

To pre­vent slow HTTP respons­es from caus­ing the call­ing func­tion to time­out (and there­fore impact the user expe­ri­ence we offer) we should make sure we stop wait­ing for a response before the call­ing func­tion times out.

the goal of the time­out strat­e­gy is to give HTTP requests the best chance to suc­ceed, pro­vid­ed that doing so does not cause the call­ing func­tion itself to err”

- me

Most of the time, I see folks use fixed (either hard cod­ed or spec­i­fied via con­fig) time­out val­ues, which is often tricky to decide:

  • too short and you won’t give the request the best chance to suc­ceed, e.g. there’s 5s left in the invo­ca­tion but we had set time­out to 3s
  • too long and you run the risk of let­ting the request time­out the call­ing func­tion, e.g. there’s 5s left in the invo­ca­tion but we had set time­out to 6s

This chal­lenge of choos­ing the right time­out val­ue is fur­ther com­pli­cat­ed by the fact that we often per­form more than one HTTP request dur­ing a func­tion invo­ca­tion — e.g. read from DynamoDB, talk to some inter­nal API, then save changes to DynamoDB equals a total of 3 HTTP requests in one invo­ca­tion.

Let’s look at two com­mon approach­es for pick­ing time­out val­ues and sce­nar­ios where they fall short of meet­ing our goal.

requests are not giv­en the best chance to suc­ceed
requests are allowed too much time to exe­cute and caused the func­tion to time­out.

Instead, we should set the request time­out based on the amount of invo­ca­tion time left, whilst tak­ing into account the time required to per­form any recov­ery steps — e.g. return a mean­ing­ful error with appli­ca­tion spe­cif­ic error code in the response body, or return a fall­back result instead.

You can eas­i­ly find out how much time is left in the cur­rent invo­ca­tion through the context object your func­tion is invoked with.

https://docs.aws.amazon.com/lambda/latest/dg/nodejs-prog-model-context.html

For exam­ple, if a function’s timeout is 6s, but by the time you make the HTTP request you’re already 1s into the invo­ca­tion (per­haps you had to do some expen­sive com­pu­ta­tion first), and if we reserve 500ms for recov­ery, then that leaves us with 4.5s to wait for HTTP response.

With this approach, we get the best of both worlds:

  • allow requests the best chance to suc­ceed based on the actu­al amount of invo­ca­tion time we have left; and
  • pre­vent slow respons­es from tim­ing out the func­tion, which allows us a win­dow of oppor­tu­ni­ty to per­form recov­ery actions

    requests are giv­en the best chance to suc­ceed, with­out being restrict­ed by an arbi­trar­i­ly deter­mined time­out.
slow respons­es are timed out before they cause the call­ing func­tion to time out

But what are you going to do AFTER you time out these requests? Aren’t you still going to have to respond with a HTTP error since you couldn’t fin­ish what­ev­er oper­a­tions you need­ed to per­form?

At the min­i­mum, the recov­ery actions should include:

  • log the time­out inci­dent with as much con­text as pos­si­ble (e.g. how much time the request had), includ­ing all the rel­e­vant cor­re­la­tion IDs
  • track cus­tom met­rics for serviceX.timedout so it can be mon­i­tored and the team can be alert­ed if the sit­u­a­tion esca­lates
  • return an appli­ca­tion error code in the response body (see exam­ple below), along with the request ID so the user-fac­ing client app can dis­play a user friend­ly mes­sage like “Oops, looks like this fea­ture is cur­rent­ly unavail­able, please try again lat­er. If this is urgent, please con­tact us at xxx@domain.com and quote the request ID f19a7dca. Thank you for your coop­er­a­tion :-)”
{
  "errorCode": 10021,
  "requestId": "f19a7dca",
  "message": "service X timed out"
}

In some cas­es, you can also recov­er even more grace­ful­ly using fall­backs.

Netflix’s Hys­trix library, for instance, sup­ports sev­er­al flavours of fall­backs via the Com­mand pat­tern it employs so heav­i­ly. In fact, if you haven’t read its wiki page already then I strong­ly rec­om­mend that you go and give it a thor­ough read, there are tons of use­ful infor­ma­tion and ideas there.

At the very least, every com­mand lets you spec­i­fy a fall­back action.

You can also chain the fall­back togeth­er by chain­ing com­mands via their respec­tive getFallback meth­ods.

For exam­ple,

  1. exe­cute a DynamoDB read inside CommandA
  2. in the getFallback method, exe­cute CommandB which would return a pre­vi­ous­ly cached response if avail­able
  3. if there is no cached response then CommandB would fail, and trig­ger its own getFallback method
  4. exe­cute CommandC, which returns a stubbed response

Any­way, check out Hys­trix if you haven’t already, most of the pat­terns that are baked into Hys­trix can be eas­i­ly adopt­ed in our server­less appli­ca­tions to help make them more resilient to fail­ures — some­thing that I’m active­ly explor­ing with a sep­a­rate series on apply­ing prin­ci­ples of chaos engi­neer­ing to Server­less.

Like what you’re read­ing but want more help? I’m hap­py to offer my ser­vices as an inde­pen­dent con­sul­tant and help you with your server­less project — archi­tec­ture reviews, code reviews, build­ing proof-of-con­cepts, or offer advice on lead­ing prac­tices and tools.

I’m based in Lon­don, UK and cur­rent­ly the only UK-based AWS Server­less Hero. I have near­ly 10 years of expe­ri­ence with run­ning pro­duc­tion work­loads in AWS at scale. I oper­ate pre­dom­i­nant­ly in the UK but I’m open to trav­el­ling for engage­ments that are longer than a week. To see how we might be able to work togeth­er, tell me more about the prob­lems you are try­ing to solve here.

I can also run an in-house work­shops to help you get pro­duc­tion-ready with your server­less archi­tec­ture. You can find out more about the two-day work­shop here, which takes you from the basics of AWS Lamb­da all the way through to com­mon oper­a­tional pat­terns for log aggre­ga­tion, dis­tri­b­u­tion trac­ing and secu­ri­ty best prac­tices.

If you pre­fer to study at your own pace, then you can also find all the same con­tent of the work­shop as a video course I have pro­duced for Man­ning. We will cov­er top­ics includ­ing:

  • authen­ti­ca­tion & autho­riza­tion with API Gate­way & Cog­ni­to
  • test­ing & run­ning func­tions local­ly
  • CI/CD
  • log aggre­ga­tion
  • mon­i­tor­ing best prac­tices
  • dis­trib­uted trac­ing with X-Ray
  • track­ing cor­re­la­tion IDs
  • per­for­mance & cost opti­miza­tion
  • error han­dling
  • con­fig man­age­ment
  • canary deploy­ment
  • VPC
  • secu­ri­ty
  • lead­ing prac­tices for Lamb­da, Kine­sis, and API Gate­way

You can also get 40% off the face price with the code ytcui. Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP).