AWS X-Ray and Lambda : the good, the bad and the ugly

AWS announced general availability of AWS Lambda support for AWS X-Ray back in May. It’s taken me a while to try it out, and whilst I see plenty of values I think its current limitations significantly restricts its usefulness in a complex system.

I found Lambda-specific documentations to be fragmented and I had to rely on experimentation and piece together clues from several sources:

I also found recording annotations and metadata didn’t work as advertised in the sample (although admittedly I could be doing something wrong…).


Update 03/07/2017 : after this post was published the folks at AWS got in touch and kindly cleared up some of the issues highlighted here which were caused by poor documentation which they’ll rectify in the near future. Scroll down to see the clarification on the relevant sections.


The Sample Application

The sample project I created centres around a Lambda function called service-a, which in term calls a number of downstream systems:

  • publishing to a SNS topic
  • GET’ng and PUT’ng an object in S3
  • GET’ng and PUT’ng a row in DynamoDB
  • invoking another Lambda function (service-c) using the Lambda API
  • making a HTTP request to an API Gateway endpoint backed by another Lambda function (one of service-b, error and timeout functions in the diagram above, which represents the success, error and timeout cases respectively)

You can find all the source code here.

The Good

Once I figured out the magic incantations I was able to get the results I’m after. It took more time and energy than should have, but by and large most features worked as advertised at the first (or second) time of asking.

This is a trace of the service-a function, which includes the time it takes for Lambda to initialise the function, and the various downstream systems it talked to, all nested by under custom subsegments. It even includes the trace of the service-c function (and the time it spent publishing to SNS) which was invoked using the Lambda API.

The service map for service-a includes service-c as a downstream dependency, as well as service-c’s dependency on SNS.

The Bad

It’s always 200…

When the service-a function is invoked through its API Gateway endpoint and errors, the corresponding trace still reports a 200 response code.

Presumably what X-Ray sees is a 200 response from the Lambda service whose payload indicates a 502 response to the API Gateway invocation and so it thought “hey, it’s a 200!”.

Here, I can see the service-a endpoint returned a 502 in Postman..

..but the trace reported a 200 response code.

Oddly enough, the trace for the error function also reports a 200 even though its own status field indicates it had errored.

This behaviour is both confusing and inconsistent to me, perhaps I have misunderstood how it works. Sadly, the X-Ray’s concepts page also does not explain the difference between an Error and a Fault

Whilst this might seem like a small nuisance now, the inability to quickly identify error traces will hurt you most when you need to diagnose problems in production, possibly when you’re under the most time pressure.


Update 03/07/2017 : AWS confirmed that the reason the errors are reported as 200 is due to Lambda service returning a 200 response (with payload that indicates an error). One workaround is to use the filter expression service() { fault } which returns all traces that contains a fault.


Traces don’t span over API Gateway

When the service-a function makes an outgoing HTTP request to an API Gateway endpoint the trace stops at the API Gateway endpoint and doesn’t extend to the Lambda functions that are triggered by API Gateway.

This behaviour was consistent with all 3 endpoints I tested—service-b, error and timeout.

For this test, I have followed the X-Ray documentation and used the X-Ray SDK to wrap the Nodejs https module when making the HTTP request.

I can see the trace IDs are correctly passed along in the outgoing HTTP request and received by the handling Lambda function.

This is the service map I expected to see in this case—where service-a’s trace follows through the HTTP request to API Gateway and includes the invocation of the timeout function.

ps. this is not an actual screenshot, it’s an image I composed together to show what I expected to see!

Instead, the actual service map stops at the API Gateway.

However, when invoking another Lambda function directly (using the Lambda API and wrapped AWS SDK) the trace worked as expected.

Perhaps the limitation lies with API Gateway?

The Ugly

No sampling

According to the Lambda’s documentation on X-Ray, requests should be sampled at 1 request per minute.

However, that wasn’t the case in my experiments. EVERY request was sampled, as you can see from the Age of the traces in the screenshot below.

This behaviour was consistent when invoking Lambda via API Gateway as well as via the Lambda management console.

Whilst the X-Ray service is not expensive per se—$5.00 per million traces—it’s nonetheless a cost that can easily spring up on you if you are unwillingly tracing every request through your system. As an example, I worked on a moderately successful social game at Gamesys with ~1M DAU. At roughly 250M user requests per day, X-Ray would have cost $5 * 250 * 30 days = $37500, which was more than our entire AWS bill at the time!


Update 03/07/2017 : this turns out to be a problem with the documentation, which doesn’t mention that sampling is volume-based and only kicks in once you reach a certain volume of requests/s.


Annotations and Metadata only work on subsegments

The one thing that just refused to work (even though I have followed the examples) was adding annotation and metadata to the root segment:

module.exports.handler = (event, context, callback) => {
  ...
  let segment = AWSXRay.getSegment();
  let n = Math.random() * 3;
  segment.addMetadata('random', `${n}`);      // this doesn't work
  segment.addAnnotation('path', event.path);  // this doesn't work
  ...
}

Interestingly, adding annotations and metadata to subsegments works just fine.



Looking at the logs, I found something interesting: the segment ID for the root segment doesn’t match the segment ID in the X-Ray trace.

For instance, I fetch the root segment for my function in the handler and logs it to CloudWatch Logs.

const AWSXRay = require('aws-xray-sdk');
module.exports.handler = (event, context, callback) => {
  ...
  // this should be the root segment for my function
  let segment = AWSXRay.getSegment();
  console.log(JSON.stringify(segment));
  ...
}

In the logs I can see the segment ID is 05b2b9ac6c9e5682.

But in the X-Ray trace, the segment ID for the root segment is 2b7d5b4a2a2d96e9.


Furthermore, the trace ID is also different:

  • in the logs it’s 1–59504311-d765e7accb8558871fa89d6d
  • in the X-Ray console it’s 1–59504312–5ef2a3eda0c1b2c4d64dda00

This was very odd, so I decided to track the trace ID in the logs vs in the X-Ray console, starting with a coldstart.

Bingo! Looks like it’s a bug in the X-Ray SDK for Nodejs where AWSXray.getSegment() returns the root segment from the previous invocation..


Update 03/07/2017 : whilst there was a bug in the X-Ray SDK wrt tracking the trace-id, adding annotations and metadata to the root segment is simply not supported, which the doc doesn’t explicitly state. You can work around this by creating a subsegment that covers the entire span of your function invocation to act as your custom root segment and attach any annotation and metadata related to the invocation there.


Conclusion

So there you have it, my weekend escapade with AWS X-Ray and Lambda :-)

Overall I’m impressed with what I saw, and feel X-Ray would have added a lot of value to the serverless architecture I was building at Yubl. However, the inability to span traces over API Gateway endpoints makes it a far less useful addition to our ecosystem.

Furthermore, the X-Ray service is focused on execution time and helping you identify performance bottlenecks. However, there’s another important aspect to distributed tracing—helping you debug your system by ensuring a set of correlation IDs are captured in all log messages. X-Ray does not concern itself with this, although you can probably use the trace ids X-Ray provides you with it’s still up to you to capture them in all log messages and propagating all your logs to one easily searchable place. We invested some effort into enabling distributed tracing in our serverless architecture at Yubl, which you can read about in detail in this post.

Are you using X-Ray in production? I’d love to hear your thoughts and feedbacks on the service, and any pitfalls to look out for.

Finding coldstarts : how long does AWS Lambda keep your idle functions around?

In the last post I compared the coldstart time for Lambda functions with different language, memory and code size. One of the things I learnt was that idle functions are no longer terminated after 5 minutes of inactivity.

AWS Lambda – compare coldstart time with different languages, memory and code sizes

It is a fantastic news and something that Amazon has quietly changed behind the scene. However, it lead me to ask some follow up questions:

  1. what’s the new idle time that would trigger a coldstart?
  2. does it differ by memory allocation?
  3. are functions still recycled 4 hours from the creation of host VM?

To answer the first 2 questions, I devised an experiment.

First, here are my hypotheses going into the experiment.


WARNING: this experiment is intended to help us glimpse into implementation details of the AWS Lambda platform, they are fun and satisfy my curiosity but you shouldn’t build your application with the results in mind as AWS can change these implementation details without notice!


Hypotheses

Hypothesis 1 : there is an upper bound to how long Lambda allows your function to stay idle before reclaiming the associated resources

This should be a given. Idle functions occupy resources that can be used to help other AWS customers scale up to meet their needs (and not to mention the first customer is not paying for his idle functions!), it simply wouldn’t make any sense for AWS to keep idle functions around forever.

Hypothesis 2 : the idle timeout is not a constant

From an implementor’s point-of-view, it might be simpler to keep this timeout a constant?—?ie. functions are always terminated after X mins of inactivity. However, I’m sure AWS will vary this timeout to optimise for higher utilisation and keep the utilisation levels more evenly distributed across its fleet of physical servers.

For example, if there’s an elevated level of resource contention in a region, why not terminate idle functions earlier to free up space?

Hypothesis 3 : the upper bound for inactivity varies by memory allocation

An idle function with 1536 MB of memory allocation is wasting a lot more resource than an idle function with 128 MB of memory, so it makes sense for AWS to terminate idle functions with higher memory allocation earlier.

Experiment : find the upper bound for inactivity

To find the upper bound for inactivity, we need a Lambda function to act as the system-under-test and report when it has experienced a coldstart. We also need a mechanism to progressively increase the interval between invocations until we arrive at an interval where each invocation is guaranteed to be a coldstart?—?the upper bound. We will determine the upper bound when we see 10 consecutive coldstarts when invoked X minutes apart.

To answer hypothesis 3 we will also replicate the system-under-test function with different memory allocations.

This experiment is a time consuming process, it requires discipline and a degree of precision in timing. Suffice to say I won’t be doing this by hand!

My first approach was to use a CloudWatch Schedule to trigger the system-under-test function, and let the function dynamically adjust the schedule based on whether it’s experienced a coldstart. It failed miserably?—?whenever the system-under-test updates the schedule the schedule will fire shortly after rather than wait for the newly specified interval…

Instead, I turned to Step Functions for help.

AWS Step Functions allows you to create a state machine where you can invoke Lambda functions, wait for a specified amount of time, execute parallel tasks, retry, catch errors, etc.

A Wait state allows you to drive the no. of seconds to wait using data (see SecondsPath param in the documentation). Which means I can start the state machine with an input like this:

{ 
    “target”: “when-will-i-coldstart-dev-system-under-test-128”, 
    “interval”: 600, 
    “coldstarts”: 0 
}

The input is passed to another find-idle-timeout function as invocation event. The function will invoke the target (which is one of the variants of the system-under-test function with different memory allocations) and increase the interval if the system-under-test function doesn’t report a coldstart. The find-idle-timeout function will return a new piece of data for the Step Function execution:

{ 
    “target”: “when-will-i-coldstart-dev-system-under-test-128”, 
    “interval”: 660, 
    “coldstarts”: 0 
}

Now, the Wait state will use the interval value and wait 660 seconds before switching back to the FindIdleTimeout state where it’ll invoke our system-under-test function again (with the previous output as input).

"Wait": {
    "Type": "Wait",
    "SecondsPath": "$.interval",
    "Next": "FindIdleTimeout"
},

With this setup I’m able to kick off multiple executions?—?one for each memory setting.

Along the way I have plenty of visibility into what’s happening, all from the comfort of the Step Functions management console.

Here are the results of the experiment:

From the data, it’s clear that AWS Lambda shuts down idle functions around the hour mark. It’s also interesting to note that the function with 1536 MB memory is terminate over 10 mins earlier, this supports hypothesis 3.

I also collected data on all the idle intervals where we saw a coldstart and categorised them into 5 minute brackets.

Even though the data is serious lacking, but from what little data I managed to collect you can still spot some high level trends:

  • over 60% of coldstarts (prior to hitting the upper bound) happened after 45 mins of inactivity
  • the function with 1536 MB memory sees significantly fewer no. of cold starts prior to hitting the upper bound (worth noting that it also has a lower upper bound (48 mins) than other functions in this test

The data supports hypothesis 2 though there’s no way for us to figure out the reason behind these coldstarts or if there’s significance to the 45 mins barrier.

Conclusions

To summary the findings from our little experiment in one line:

AWS Lambda will generally terminate functions after 45–60 mins of inactivity, although idle functions can sometimes be terminated a lot earlier to free up resources needed by other customers.

I hope you find this experiment interesting, but please do not build applications on the assumptions that:

    a) these results are valid, and

    b) they will remain valid for the foreseeable future

I cannot stress enough that this experiment is meant for fun and to satisfy a curious mind, and nothing more!

The results from this experiment also deserve further investigation. For instance, the 1536 MB function exhibited very different behaviour to other functions, but is it a special case or would functions with more than 1024 MB of memory all share these traits? I’d love to find out, maybe I’ll write a follow up to this experiment in the future.

Watch this space ;-)

AWS Lambda – compare coldstart time with different languages, memory and code sizes

A while back we looked at the performance difference between the language runtimes AWS Lambda supports natively.

AWS Lambda – comparing platform performances

We intentionally omitted coldstart time from that experiment as we were interested in performance differences when a function is “warm”.

However, coldstart is still an important performance consideration, so let’s take a closer look with some experiments designed to measure only coldstart times.

Methodology

From my personal experience running Lambda functions in production, coldstarts happen when a function is idle for ~5 mins. Additionally, functions will be recycled 4 hours after it starts – which was also backed up by analysis by the folks at IO Pipe.

However, the 5 mins rule seems to have changed. After a few tests, I was not able to see coldstart even after a function had been idle for more than 30 mins.

I needed a more reliable way to trigger coldstart.

After a few failed attempts, I settled on a surefire way to cause coldstart : by deploying a new version of my functions before invoking them.

I have a total of 45 functions for both experiments. Using a simple script (see below) I’m able to:

  1. deploy all 45 functions using the Serverless framework
  2. after each round of deployments, invoke the functions programmatically

the deploy + invoke loop takes around 3 mins. I ran the experiment for over 24 hours to collect a meaningful amount of data points. Thankfully the Serverless framework made it easy to create variants of the same function with different memory sizes and to deploy them quickly.

Hypothesis

Here were my hypothesis before the experiments, based on the knowledge that the amount of CPU resource you get is proportional to the amount of memory you allocate to a AWS Lambda function.

  1. C# and Java have higher coldstart time
  2. memory size affects coldstart time linearly
  3. code size affects coldstart time linearly

Let’s see if the experiments support these hypothesis.

Experiment 1 : coldstart time by runtime & memory

For this experiment, I created 20 functions with 5 variants (different memory sizes) for each language runtime – C#, Java, Python and Nodejs.

After running the experiment for a little over 24 hours, I collected a bunch of metric data (which you can download yourself here).

Here is how they look.

Observation #1 : C# and Java have much higher coldstart time

The most obvious trend is that statically typed languages (C# and Java) have over 100 times higher coldstart time. This clearly supports our hypothesis, although to a much greater extent than I anticipated.

Observation #2 : Python has ridiculously low codstart time

I’m pleasantly surprised by how little coldstart the Python runtime experiences. OK, there were some outlier data points that heavily influenced some of the 99 percentile and standard deviations, but you can’t argue with a 0.41ms coldstart time at the 95 percentile of a 128MB function.

Observation #3 : memory size improves coldstart time linearly

The more memory you allocate to your function, the smaller the coldstart time and the less standard deviation in coldstart time too. This is most obvious with the C# and Java runtimes as the baseline (128MB) coldstart time for both are very significant.

Again, the data from this experiment clearly supports our hypothesis.

Experiment 2: coldstart time by code size & memory

For this second experiment, I decided to fix the runtime to Nodejs and create variants with different deployment package size and memory.

Here are the results.

Observation #1 : memory size improves coldstart time linearly

As with the first experiment, the memory size improves the coldstart time (and standard deviation) in a roughly linear fashion.

Observation #2 : code size improves coldstart time

Interestingly the size of the deployment package does not increase the coldstart time (bigger package = more time to download & unzip, or so one might assume). Instead it seems to have a positive effect and decreases the overall coldstart time.

I would love to see someone else repeat the experiment with another language runtime to see if the behaviour is consistent.

Conclusions

The things I learnt from these experiments are:

  • functions are no longer recycled after ~5 mins of idleness, which makes coldstarts far less punishing than before
  • memory size improves coldstart time linearly
  • C# and Java runtimes experience ~100 times the coldstart time of Python and suffer from much higher standard deviation too
  • as a result of the above you should consider running your C#/Java Lambda functions with a higher memory allocation than you would Nodejs/Python functions
  • bigger deployment package size does not increase coldstart time

ps. the source code used for these experiments can be found here, including the scripts used to calculate the stats and generate the plot.ly box charts.

Serverless 1.X – enable API Gateway caching on request parameters

Having previously blogged about the untrodden path to enable caching on API Gateway request parameters in the Serverless framework 0.5.X, it’s a little disappointing that it’s still not officially fixed in the 1.X versions…

The Problem

The problem is two-fold:

  1. there’s currently no way to specify caching should be enabled for path & query string parameters
  2. the CloudFormation template Serverless 1.X generates for API Gateway is missing a few optional fields, these missing fields stop you from manually enable caching in the API Gateway management console too

After you deploy your Lambda function with associated API, if you go to the management console and enable caching on path or request parameters you will get an error saying “Invalid cache key parameter specified”.

The Workaround

A friend pointed me to a neat trick to modify the CloudFormation template that Serverless 1.X auto-generates for you.

After the project is deployed, you can go to CloudFormation and view the template that Serverless has generated. These templates are pretty big (and poorly formatted), so I find it easier to open them up in the Designer view and use that view to navigate to the endpoint I’m looking for.

Once you find the resource template for the endpoint, write down its name. Now go back to the serverless.yml file in your project, and add the resource name to the resources section at the bottom. You only need to include fields that you want to update or add to the template.

The CloudFormation syntax for an API Gateway method looks like this:

We also need to fill in some blanks for the Integration section:

For more details on the CloudFormation syntax, see here and here.

After some trial-and-error, the minimum set of fields I had to add are:

Redeploy with Serverless and the path parameter is enabled for caching:

Wrap Up

I hope you have found this post useful, though I’m surprised by the lack of information out there during my research and the lack of official support from the Serverless framework.

You know of a better way to do this, please let me know in the comments.

Links

Yubl’s road to Serverless architecture – Part 4 – building a scalable push notification system

The Road So Far

part 1 : overview

part 2 : testing and continuous delivery strategies

part 3 : ops

 

Just before Yubl’s untimely demise we did an interesting piece of work to redesign the system for sending targeted push notifications to our users to improve retention.

The old system relied on MixPanel for both selecting users as well as sending out the push notifications. Whilst MixPanel was great for getting us basic analytics quickly, we soon found our use cases outgrew MixPanel. The most pressing limitation was that we were not able to query users based on their social graph to create target push notifications – eg. notify an influencer’s followers when he/she publishes a new post or runs a new social media campaign.

Since all of our analytics events are streamed to Google BigQuery (using a combination of Kinesis Firehose, S3 and Lambda) we have all the data we need to support the complex use cases the product team has.

What we needed, was a push notification system that can integrate with BigQuery results and is capable of sending millions of push notifications in a batch.

Design Goals

From a high level, we need to support 2 types of notifications.

Ad-hoc notifications are driven by the marketing team, working closely with influencers and the BI team to match users with influencers or contents that they might be interested in. Example notifications include:

  • users who follow Accessorize and other fashion brands might be interested to know when another notable fashion brand joins the platform
  • users who follow an influencer might be interested to know when the influencer publishes a new post or is running a social media campaign (usually with give-away prizes, etc.)
  • users who have shared/liked music related contents might be interested to know that Tinie Tempah has joined the platform

Scheduled notifications are driven by the product team, these notifications are designed to nudge users to finish the sign up process or to come back to the platform after they have lapsed. Example notifications include:

  • day-1 unfinished sign up : notify users who didn’t finish the sign up process to come back to complete the process
  • day-2 engagement : notify users to come back and follow more people or invite friends on day 2
  • day-21 inactive : notify users who have not logged into the app for 21 days to come back and check out what’s new

A/B testing

For the scheduled notifications, we want to test out different messages/layouts to optimise their effectiveness over time. To do that, we wanted to support A/B testing as part of the new system (which MixPanel already supports).

We should be able to create multiple variants (each with a percentage), along with a control group who will not receive any push notifications.

Oversight vs Frictionless

For the ad-hoc notifications, we don’t want to get in the way of the marketing team doing their job, so the process for creating ad-hoc push notifications should be as frictionless as possible. However, we also don’t want the marketing team to operate completely without oversight and run the risk of long term damage by spamming users with unwanted push notifications (which might cause users to disable notifications or even rage quit the app).

The compromise we reached was an automated approval process whereby:

  1. the marketing team will work with BI on a query to identify users (eg. followers of Tinie Tempah)
  2. fill in a request form, which informs designated approvers via email
  3. approvers can send themselves a test push notification to see how it will be formatted on both Android and iOS
  4. approvers can approve or reject the request
  5. once approved, the request will be executed

Implementation

We decided to use S3 as the source for a send-batch-notifications function because it allows us to pass large list of users (remember, the goal is to support sending push notifications to millions of users in a batch) without having to worry about pagination or limits on payload size.

The function will work with any JSON file in the right format, and that JSON file can be generated in many ways:

  • by the cron jobs that generate scheduled notifications
  • by the approval system after an ad-hoc push notification is approved
  • by the approval system to send a test push notification to the approvers (to visually inspect how the message will appear on both Android and iOS devices)
  • by members of the engineering team when manual interventions are required

We also considered moving to SNS but decided against it in the end because it doesn’t provide useful enough an abstraction to justify the effort to migrate (involves client work) and the additional cost for sending push notifications. Instead, we used node-gcm and apn to communicate with GCM and APN directly.

Recursive Functions FTW

Lambda has a hard limit of 5 mins execution time (it might be softened in the near future), and that might not be enough time to send millions of push notifications.

Our approach to long-running tasks like this is to run the Lambda function as a recursive function.

A naive recursive function would process the payload in fixed size batches and recurse at the end of each batch whilst passing along a token/position to allow the next invocation to continue from where it left off. In this particular case, we have additional considerations because the total number of work items can be very large:

  • minimising the no. of recursions required, which equates to no. of Invoke requests to Lambda and carries a cost implication at scale
  • caching the content of the JSON file to improve performance (by avoiding loading and parsing a large JSON file more than once) and reduce S3 cost

To minimise the no. of recursions, our function would:

  1. process the list of users in small batches of 500
  2. at the end of each batch, call context.getRemainingTimeInMillis() to check how much time is left in this invocation
  3. if there is more than 1 min left in the invocation then process another batch; otherwise recurse

When caching the content of the JSON file from S3, we also need to compare the ETAG to ensure that the content of the file hasn’t changed.

With this set up the system was able to easily handle JSON files with more than 1 million users during our load test (sorry Apple and Google for sending all those fake device tokens :-P).