Hello! Sorry for the lack of posts recently, it’s been a pretty hectic time here at Yubl, with plenty of exciting work happening and even more on the way. Hopefully I will be able to share with you some of the cool things we have done and valuable lessons we have learnt from working with AWS Lambda and Serverless in production.
Today’s post is one such lesson, a slightly baffling and painful one at that.
We noticed that the Lambda function behind one of our APIs in Amazon API Gateway was timing out consistently (the function is configured with a 6s timeout, which is what you see in the diagram below).
Looking in the logs it appears that one instance of our function (based on the frequency of the timeouts I could deduce that AWS Lambda had 3 instances of my function running at the same time) was constantly timing out.
What’s even more baffling is that, after the first timeout, the subsequent Lambda invocations never even enters the body of my handler function!
Considering that this is a Node.js function (running on the Node.js 4.3 runtime), this symptom is similar to what one’d expect if a synchronous operation is blocking the event queue so that nothing else gets a chance to run. (oh, how I miss Erlang VM’s pre-emptive scheduling at this point!)
So, as a summary, here’s the symptoms that we observed:
- function times out the first time
- all subsequent invocations times out without executing the handler function
- continues to timeout until Lambda recycles the underlying resource that runs your function
which, as you can imagine, is pretty scary – one strike, and you’re out!
Oh, and I managed to reproduce the symptoms with Lambda functions with other event source types too, so it’s not specific to API Gateway endpoints.
Bluebird – the likely Culprit
After investigating the issue some more, I was able to isolate the problem to the use of bluebird Promises.
I was able to replicate the issue with a simple example below, where the function itself is configured to timeout after 1s.
As you can see from the log messages below, as I repeatedly hit the API Gateway endpoint, the invocations continue to timeout without printing the hello~~~ message.
At this point, your options are:
a) wait it out, or
b) do a dummy update with no code change
On the other hand, a hand-rolled delay function using vanilla Promise works as expected with regards to timeouts.
The obvious workaround is not to use bluebird, and any library that uses bluebird under the hood – e.g. promised-mongo.
Which sucks, because:
- bluebird is actually quite useful, and we use both bluebird and co quite heavily in our codebase
- having to check every dependency to make sure it’s not using bluebird under the hood
- can’t use other useful libraries that use bluebird internally
However, I did find that, if you specify an explicit timeout using bluebird‘s Promise.timeout function then it’s able to recover correctly. Presumably using bluebird’s own timeout function gives it a clean timeout whereas being forcibly timed out by the Lambda runtime screws with the internal state of its Promises.
The following example works as expected:
But, it wouldn’t be a workaround if it doesn’t have its own caveats.
It means you now have one more error that you need to handle in a graceful way (e.g. mapping the response in API Gateway to a 5XX HTTP status code), otherwise you’ll end up sending this kinda unhelpful responses back to your callers.
So there, a painful lesson we learnt whilst running Node.js Lambda functions in production. Hopefully you have found this post in time before running into the issue yourself!