AWS Lambda – constant timeout when using bluebird Promise

Hello! Sorry for the lack of posts recently, it’s been a pretty hectic time here at Yubl, with plenty of exciting work happening and even more on the way. Hopefully I will be able to share with you some of the cool things we have done and valuable lessons we have learnt from working with AWS Lambda and Serverless in production.

Today’s post is one such lesson, a slightly baffling and painful one at that.

 

The Symptoms

We noticed that the Lambda function behind one of our APIs in Amazon API Gateway was timing out consistently (the function is configured with a 6s timeout, which is what you see in the diagram below).

lambda-bluebird-latency-spike

Looking in the logs it appears that one instance of our function (based on the frequency of the timeouts I could deduce that AWS Lambda had 3 instances of my function running at the same time) was constantly timing out.

What’s even more baffling is that, after the first timeout, the subsequent Lambda invocations never even enters the body of my handler function!

Considering that this is a Node.js function (running on the Node.js 4.3 runtime), this symptom is similar to what one’d expect if a synchronous operation is blocking the event queue so that nothing else gets a chance to run. (oh, how I miss Erlang VM’s pre-emptive scheduling at this point!)

So, as a summary, here’s the symptoms that we observed:

  1. function times out the first time
  2. all subsequent invocations times out without executing the handler function
  3. continues to timeout until Lambda recycles the underlying resource that runs your function

which, as you can imagine, is pretty scary – one strike, and you’re out

Oh, and I managed to reproduce the symptoms with Lambda functions with other event source types too, so it’s not specific to API Gateway endpoints.

 

Bluebird – the likely Culprit

After investigating the issue some more, I was able to isolate the problem to the use of bluebird Promises.

I was able to replicate the issue with a simple example below, where the function itself is configured to timeout after 1s.

lambda-bluebird-timeout-example

As you can see from the log messages below, as I repeatedly hit the API Gateway endpoint, the invocations continue to timeout without printing the hello~~~ message.

lambda-bluebird-timeout-example-log

At this point, your options are:

a) wait it out, or

b) do a dummy update with no code change

 

On the other hand, a hand-rolled delay function using vanilla Promise works as expected with regards to timeouts.

lambda-bluebird-timeout-example-2

lambda-bluebird-timeout-example-log-2

 

Workarounds

The obvious workaround is not to use bluebird, and any library that uses bluebird under the hood – e.g. promised-mongo.

Which sucks, because:

  1. bluebird is actually quite useful, and we use both bluebird and co quite heavily in our codebase
  2. having to check every dependency to make sure it’s not using bluebird under the hood
  3. can’t use other useful libraries that use bluebird internally

However, I did find that, if you specify an explicit timeout using bluebird‘s Promise.timeout function then it’s able to recover correctly. Presumably using bluebird’s own timeout function gives it a clean timeout whereas being forcibly timed out by the Lambda runtime screws with the internal state of its Promises.

The following example works as expected:

lambda-bluebird-timeout-example-3

lambda-bluebird-timeout-example-log-3

But, it wouldn’t be a workaround if it doesn’t have its own caveats.

It means you now have one more error that you need to handle in a graceful way (e.g. mapping the response in API Gateway to a 5XX HTTP status code), otherwise you’ll end up sending this kinda unhelpful responses back to your callers.

lambda-bluebird-timeout-example-log-4

 

So there, a painful lesson we learnt whilst running Node.js Lambda functions in production. Hopefully you have found this post in time before running into the issue yourself!

AWS Lambda – use recursive function to process SQS messages (Part 1)

It’s been a year since the release of AWS Lambda service and here at Yubl we’re at the start of an exciting journey to move our stack to Lambda (using the awesome Serverless framework).

One feature that’s sadly missing though, is support for SQS. Every AWS evangelist I have spoken to tells me that it’s one of the most requested features from their customers and that it’s coming. It’s not the first time I have heard this kind of non-committal response from those involved with Amazon and experience tells me not to expect it to happen anytime soon.

But, if you’re itching to use AWS Lambda with SQS and don’t wanna wait an unspecified amount of time to get there, you have some options right now:

  • use SNS or Kinesis instead
  • do-it-yourself with recursive Lambda function that polls and processes SQS messages

 

TL;DR

Whilst you can use SNS and Kinesis with Lambda already, SQS‘s support for dead letter queues still makes it a better choice in situations where eventual consistency can be tolerated but outright data losses should be mitigated.

Whilst you can process SQS using EC2-hosted applications already, Lambda delivers a potential cost saving for low-traffic environments and more granular cost control as you scale out. It also provides an easy and fast deployment pipeline, with support for versioning and rollbacks.

Finally, we’ll walk through a simple example and write a recursive Lambda function in Node.js using the Serverless framework.

 

Lambda + SNS/Kinesis vs. Lambda + SQS

As a compromise, both SNS and Kinesis can be used with Lambda to allow you to delay the execution of some work till later. But semantically there are some important differences to consider.

 

Retries

SQS has built-in support for dead letter queues – i.e. if a message is received N times and still not processed successfully then it is moved to a specified dead letter queue. Messages in the dead letter queue likely require some manual intervention and you typically would set up CloudWatch alarms to alert you when messages start to pour into the dead letter queue.

If the cause is temporal, for example, there are outages to downstream systems – DB is unavailable, or other services are down/timing out – then the dead letter queue helps:

  1. prevent the build up of messages in the primary queue; and
  2. gives you a chance to retry failed messages after the downstream systems have recovered.

With SNS, messages are retried 3 times and then lost forever. Whether or not this behaviour is a showstopper for using SNS needs to be judged against your requirements.

With Kinesis, the question becomes slightly more complicated. When using Kinesis with Lambda, your degree of parallelism (discussed in more details below) is equal to the no. of shards in the stream. When your Lambda function fails to process a batch of events, it’ll be called again with the same batch of events because AWS is keeping track of the position of your function in that shard.

lambda+kinesis-v2

In essence, this means the retry strategy is up to you, but your choices are limited to:

  1. fail and always retry the whole batch (even if some messages were processed successfully) until either the fault heals itself or the messages in the batch are no longer available (Kinesis only keeps messages for up to 24 hours)
  2. never retry a failed batch

If you choose option 1 (which is the default behaviour), then you also have to ensure that messages are processed in a way that’s idempotent. If you choose option 2, then there’s a significant chance for data loss.

Neither option is very appealing, which is why I would normally use Kinesis in conjunction with SQS:

  • process a batch of messages, and queue failed messages into SQS
  • allow the processing of the shard to move on in spite of the failed messages
  • SQS messages are processed by the same logic, which required me to decouple the processing logic from the delivery of payloads (SNS, SQS, Kinesis, tests, etc.)

 

Parallelism

SNS executes your function for every notification, and as such has the highest degree of parallelism possible.

lambda-parallelism-sns-01

There’s a soft limit of 100 concurrent Lambda executions per region (which you can increase by raising a support ticket), though according to the documentation, AWS might increase the concurrent execution limit on your behalf in order to execute your function at least once per notification. However, as a safety precaution you should still set up CloudWatch alarms on the Throttled metric for your Lambda functions.

 

With Kinesis (and DynamoDB Streams for that matter), the degree of parallelism is the same as the no. of shards.

lambda-parallelism-kinesis

If you’re working with SQS today, the degree of parallelism would equal to the no. of poll loops you’re running in your cluster.

sqs-parallelism

For example, if you’re running a tight poll loop for each core, and you have 3 quad-core EC2 instances in a cluster, then the degree of parallelism would be 3 instances * 4 cores = 12.

Moving forward, if you choose to use recursive Lambda functions to process SQS messages then you can choose the degree of parallelism you want to use.

 

Lambda + SQS vs. EC2 + SQS

Which brings us to the next question : if you can use EC2 instances to process SQS messages already, why bother moving to Lambda? I think the cost saving potentials, and the ease and speed of deployment are the main benefits.

 

Cost

If you use the smallest production-ready EC2 class – Linux t2.micro – it will cost you $10.25/month in the eu-west-1 region (Ireland).

Whilst the auto-scaling service is free to use, the default EC2 health checks cannot be relied upon to detect when your application has stopped working. To solve this problem, you’ll typically setup an ELB and use ELB health checks to trigger application level checks to ensure it is still running and processing messages.

The ELB health checks also enables the auto-scaling service to automatically replace unhealthy instances.

A minimum production deployment would therefore cost you $30.75 a month.

 

A recursive Lambda function running non-stop 24/7 would run for 2678400 seconds a month.

60 s * 60 m * 24 hr * 31 days = 2678400 s

If you assign 128MB to your function, then your monthly cost for Lambda would be $5.61 a month.

Monthly Compute Charge

  = Total Compute (GB-seconds) * ($0.00001667 /GB-second)

  = (2678400 s * 128 MB / 1024 MB) * $0.00001667

  = 334800 GB-seconds * $0.00001667

  = $5.581116

Monthly Request Charge

  = Total Requests * ($ 0.20/Million Reqs)

  = (2678400 s  / 20 s) / 1000000 * $ 0.20

  = 133920 Req / 1000000 * $0.20

  = $0.026784

Monthly Charge (Total)

  = Monthly Compute Charge + Monthly Request Charge

  = $5.581116 + $0.026784

= $5.6079

Since Lambda’s free tier does not expire 12 months after sign up, this would fall within the free tier of 400000 GB-seconds per month too.

 

However, there are other aspects to consider:

  • you can process several SQS queues in one EC2 instance
  • the cost of ELB is amortised as the size of cluster increases
  • EC2 instance class jumps in cost but also offers more compute and networking capability
  • in order to auto-scale your SQS-processing Lambda functions, you’ll need to provision other resources

The exact cost saving from using Lambda for SQS is not as clear cut as I first thought. But at the very least, you have a more granular control of your cost as you scale out.

A recursively executed, 128MB Lambda function would cost $5.61/month, whereas an autoscaled cluster of t2.micro instances would go up in monthly cost $10.25 at a time.

lambda-vs-ec2-cost

 

Deployment

In the IAAS world, the emergence of container technologies has vastly improved the deployment story.

But, as a developer, I now have another set of technologies which I need to come to grips with. The challenge of navigating this fast-changing space and making sensible decisions on a set of overlapping technologies (Kubernetes, Mesos, Nomad, Docker, Rocket, just to name a few) is not an easy one, and the consequence of these decisions will have long lasting impact on your organization.

Don’t get me wrong, I think container technologies are amazing and am excited to see the pace of innovation in that space. But it’s a solution, not the goal, the goal is and has always been to deliver values to users quickly and safely.

Keeping up with all that is happening in the container space is overwhelming, and at times I can’t help but feel that I am trading one set of problems with another.

As a developer, I want to deliver value to my users first and foremost. I want all the benefits container technologies bring, but make their complexities someone else’s problem!

 

One the best things about AWS Lambda – besides all the reactive programming goodness – is that deployment and scaling becomes Amazon‘s problem.

I don’t have to think about provisioning VMs and clustering applications together; I don’t have to think about scaling the cluster and deploying my code onto them. I don’t have to rely on an ops team to monitor and manage my cluster and streamline our deployment. All of them, Amazon‘s problem! 

All I have to do to deploy a new version of my code to AWS Lambda is upload a zip file and hook up my Lambda function to the relevant event sources and it’s job done!

Life becomes so much simpler 

With the Serverless framework, things get even easier!

Lambda supports the concept of versions and aliases, where an alias is a named version of your code that has its own ARN. Serverless uses aliases to implement the concept of stages – ie. dev, staging, prod – to mirror the concept of stages in Amazon API Gateway.

To deploy a new version of your code to staging:

  • you publish a new version with your new code
  • update the staging alias to point to that new version
  • and that’s it! Instant deployment with no downtime!

Similarly, to rollback staging to a previous version:

  • update the staging alias to point to the previous version
  • sit back and admire the instant rollback, again with no downtime!

What’s more? Serverless streamlines these processes into a single command!

 

Writing a recursive Lambda function to process SQS messages with Serverless

(p.s. you can find the source code for the experiment here.)

First, let’s create a queue in SQS called HelloWorld.

recursive_sqs_01

Notice that although we have specified the default visibility timeout and receive message wait time (for long polling) values here, we’ll override them in the ReceiveMessage request later.

Then we’ll create a new Serverless project, and add a function called say-hello.

Our project structure looks roughly like this:

recursive_sqs_004

In the handler.js module, let’s add the following.

recursive_sqs_000

Notice we’re relying on two environment variables here – QUEUE_URL and FUNC_NAME. Both will be populated by Serverless using values that we specify in s-function.json (to understand how this works, check out Serverless’s documentation).

 

Next, we’ll write the handler code for our Lambda function.

Here, we will:

  1. make a ReceiveMessage request to SQS using long polling (20s)
  2. for every message we receive, we’ll process it with a sayHello function (which we’ll write next)
  3. the sayHello function will return a Promise
  4. when all the messages have been processed, we’ll recurse by invoking this Lambda function again

recursive_sqs_001

In the sayHello function, we’ll log a message and delete the message in SQS.

One caveat to remember is that, Promise.all will reject immediately if any of the Promises rejects. Which is why I’m handling any error related to deleting the message in SQS here with .catch – it’ll restore the chain rather than allowing the reject to bubble up.

recursive_sqs_002

This implementation, however, doesn’t handle errors arising from processing of the message (i.e. logging a message in this case). To do that, you’ll have to wrap the processing logic in a Promise, eg.

return new Promise((resolve, reject) => { 

    // processing logic goes here

    resolve();

})

.then(() => SQS.deleteMessage(delParams).promise())

.then(…)

.catch(…);

 

Finally, let’s add a recurse function to invoke this function again.

recursive_sqs_003

Couple of things to note about this function:

  1. unlike sayHello, it doesn’t catch its own errors, this allows the reject to bubble up to the main handler function which will then fail the current execution
    • by failing the current execution this way we can use the Errors metric in CloudWatch to re-trigger this Lambda function (more on this in Part 2)
  2. we’re calling invokeAsync instead of invoke so we don’t have to (in this case we can’t!) wait for the function to return

 

Deploying the Lambda function

You’ll notice that we haven’t provided any values for the two environment variables – QUEUE_URL and FUNC_NAME – we mentioned earlier. That’s because we don’t have the ARN for the Lambda function yet!

So, let’s deploy the function using Serverless. Run serverless dash deploy and follow the prompt.

recursive_sqs_02

Aha, there’s the ARN for our Lambda function!

Go back to the project, open _meta/variables/s-variables-dev-euwest1.json and add the variables queueUrl and funcName.

recursive_sqs_03

However, there’s another layer of abstraction we need to address to ensure our environment variables are populated correctly.

Open processors/say-hello/s-function.json.

In the environment section, add QUEUE_URL and FUNC_NAME like below:

recursive_sqs_14

Now do you see how things fit together?

serverless-env-vars

Since we’ve made changes to environment variables, we’ll need to redeploy our Lambda function.

 

Testing the deployed Lambda function

Once your Lambda function is deployed, you can test it from the management console.

recursive_sqs_04

See that blue Test button on the top left? Click that.

Since our function doesn’t make use of the incoming payload, you can send it anything.

Oops.

If you run the function as it stands right now, you’ll get a permission denied error because the IAM role the Lambda service uses to call our function with doesn’t have permissions to use the SQS queue or to invoke a Lambda function.

That’s great, because it tells me I have granular control over what my Lambda functions can do and I can control it using IAM roles.

But we do need to go back to the project and give our function the necessary permissions.

 

Configuring execution role permissions

Back in the Serverless project, take a look at s-resources-cf.json in the root directory. This is a templated CloudFormation file that’s used to create the AWS resources for your function, including the execution role for your functions.

Wherever you see ${stage} or ${region}, these will be substituted by variables of the same name under _meta/variables/xxx.json files. Have a read of the Serverless documentation if you wanna dig deeper into how the templating system works.

By default, Serverless creates an execution role that can log to CloudWatch Logs.

{
  “Effect”: “Allow”,
  “Action”: [
    “logs:CreateLogGroup”,
    “logs:CreateLogStream”,
    “logs:PutLogEvents”
  ],
  “Resource”: “arn:aws:logs:${region}:*:*”
}

You can edit s-resources-cf.json to grant the execution role permissions to use our queue and invoke the function to recurse. Add the following statements:

{
  “Effect”: “Allow”,
  “Action”: [
    “sqs:ReceiveMessage”,
    “sqs:DeleteMessage”
  ],
  “Resource”: “arn:aws:sqs:${region}:*:HelloWorld”
},
{
  “Effect”: “Allow”,
  “Action”: “lambda:InvokeFunction”,
  “Resource”: “arn:aws:lambda:${region}:*:function:recursive-lambda-for-sqs-say-hello:${stage}”
}

We can update the existing execution role by deploying the CloudFormation change. To do that with Serverless, run serverless resources deploy (you can also use sls shorthand to save yourself some typing).

recursive_sqs_06

If you test the function again, you’ll see in the logs (which you can find in the CloudWatch Logs management console) that the function is running and recursing.

 

Now, pop over to the SQS management console and select the HelloWorld queue we created earlier. Under the Queue Actions drop down, select Send Message.

In the following popup, you can send a message into the queue.

recursive_sqs_07

Once the message is sent, you can check the CloudWatch Logs again, and see that message is received and a “Hello, Yan of message ID […]” message was logged.

recursive_sqs_08

and voila, our recursive Lambda function is processing SQS messages! 

recurse-lambda-for-sqs-v0

 

In Part 2, we’ll look at a simple approach to autoscale the no. of concurrent executions of our function, and restart failed functions.

Javascript – string replace all without Regex

I have been working with Node.js and Serverless heavily. It’s the first time I’ve really spent serious amount of time in Javascript, and I’m making plenty of beginner mistakes and learning lots.

One peculiar thing I find in Javascript is that String.replace only replaces the first instance of the substring, unless you use a Regex. Fortunately I found a clever little pattern to do this without Regex : String.split(subString).join(replaceString).

So suppose you want to remove the hyphen (-) in an AWS region name, you could write:

let region = ‘eu-west-1’;

let regionNoHyphen = region.split(‘-‘).join(”);

Much easier than using Regex, wouldn’t you agree?

Serverless – enable caching on query string parameters in API Gateway

Since I started working at Yubl less than 2 weeks ago, I have been doing a lot of work with Amazon API Gateway & Lambda with the help of the Serverless framework. So far that experience has been really great.

One little caveat I ran into was that, it wasn’t clear on how to enable caching on query string and request path parameters. For instance, if I declare in my s-function.json file that my API Gateway endpoint has a query string parameter called query:

"requestParameters": {
    "integration.request.querystring.query": "method.request.querystring.query"
},

When the endpoint is deployed, you will see in the API Gateway console that ‘Caching’ is not enabled:

f0c8f61c-03fc-11e6-8aef-3f2b197f09b6

Which means when I enable caching on my deployed stage, the query string parameter would not be taken into account.

api-gateway-dev-stage

Users visiting the following URLs would get the same cached response back:

    https://my-awesome-service.com/yubl?query=one-awesome-yubl

    https://my-awesome-service.com/yubl?query=another-awesome-yubl

That’s obviously not OK.

Sadly, the same applied to request path parameters too:

"path": "yubl/{query}",
"method": "GET",
"type": "AWS",

07d58dfa-03ff-11e6-9869-46d4de5c13a2

Thankfully, the guys working on Serverless has been very actively in responding to their issues page and one of the team members picked up my ticket and offered some insight on how to make this work.

API Gateway’s own documentation eludes to two Integration parameters called cacheNamespace and cacheKeyParameters. Although, to say they’re poorly documented would be an understatement…

api-gateway-cache-params

So, after some fiddling around:

  1. create an API with cached query string param
  2. deploy it
  3. export as Swagger + API Gateway Extensions
  4. inspect the JSON to see what value is outputted for cacheNamespace and cacheKeyParameters

api-gateway-cache-params-example

Interestingly, when I put the above value – “method.input.params.query” – in my s-function.json I get the following error during deployment:

Invalid cache key parameter specified

what did work though, is the same value I had specified in my requestParameters dictionary, which is “method.request.querystring.query” in this case. ie

"requestParameters": {
    "integration.request.querystring.query": "method.request.querystring.query"
},
"cacheKeyParametes" [
    "method.request.querystring.query"
]

However, applying the same approaches to request path parameters proved fruitless.. I guess we’ll just have to wait for official documentation to be updated to see how it should work, which should be soon by the sound of things!

Ticket for Craft Conf for sale at Early Bird price

craftconf2016

It’s less than 2 weeks to this year’s Craft Conf and once again the line up looks amazing – Adrian Cockcroft, Adam Tornhill, Amir Chaudhry, Bart De Smet, Bodil Stokke, Dan North and plenty of other big names!

I really enjoyed myself at last year’s event (here are my takeaways) but unfortunately with the new job I won’t be able to make it this year. Let me know if you’re interested in going in my stead, I’m selling my ticket for the 2 session days at the early bird price I purchased at – $275, which is a good deal cheaper than the late bird tickets available right now ($699).