Applying principles of chaos engineering to AWS Lambda with latency injection

This is part 2 of a multipart series that explores ideas on how we could apply the principles of chaos engineering to serverless architectures built around Lambda functions.


The most common issue I have encountered in production are latency/performance related. They can be symptomatic of a host of underlying causes ranging from AWS network issues (which can also manifest itself in latency/error-rate spikes in any of the AWS services), overloaded servers, to simple GC pauses.

Latency spikes are inevitable – as much as you can improve the performance of your application, things will go wrong, eventually, and often they’re out of your control.

So you must design for them, and degrade the quality of your application gracefully to minimize the impact on your users.

In the case of API Gateway and Lambda, there are additional considerations:

  • API Gateway has a hard limit of 29s timeout for integration points, so even if your Lambda function can run for up to 5 mins, API Gateway will timeout way before that

If you use moderate timeout settings for your API functions (and you should!) then you need to consider the effects of cold starts when calling an intermediate service.

Where to inject latency

Suppose our client application communicates directly with 2 public facing APIs, whom in turn depends on an internal API.

In this setup, I can think of 3 places where we can inject latency and each would validate a different hypothesis.

Inject latency at HTTP clients

The first, and easiest place to inject latency is in the HTTP client library we use to communicate with the internal API.

This will test that our function has appropriate timeout on this HTTP communication and can degrade gracefully when this request time out.

We can inject latency to the HTTP client libraries for our internal APIs, hence validating that the caller function has configured appropriate timeout and error handling for timeouts.

Furthermore, this practice should also be applied to other 3rd party services we depend on, such as DynamoDB. We will discuss how we can inject latency to these 3rd party libraries later in the post.

We can also inject latency to 3rd party client libraries for other managed services we depend on.

This is a reasonably safe place to inject latency as the immediate blast radius is limited to this function.

However, you can (and arguably, should) consider applying this type of latency injection to intermediate services as well. Doing so does carry extra risk as it has a broader blast radius in failures case – ie. if the function under test does not degrade gracefully then it can cause unintended problems to outer services. In this case, the blast radius for these failure cases is the same as if you’re injecting latency to the intermediate functions directly.

Inject latency to intermediate functions

You can also inject latency directly to the functions themselves (we’ll look at how later on). This has the same effect as injecting latency to the HTTP client to each of its dependents, except it’ll affect all its dependents at once.

We can inject latency to a function’s invocation. If that function is behind an internal API that are used by multiple public-facing APIs then it can cause all its dependents to experience timeouts.

This might seem risky (it can be), but is an effective way to validate that every service that depends on this API endpoint is expecting, and handling timeouts gracefully.

It makes most sense when applied to intermediate APIs that are part of a bounded context (or, a microservice), maintained by the same team of developers. That way, you avoid unleashing chaos upon unsuspecting developers who might not be ready to deal with the chaos.

That said, I think there is a good counter-argument for doing just that.

We often fall into the pitfall of using the performance characteristics of dev environments as predictor for the production environment. Whilst we seldom experience load-related latency problems in the dev environments?—?because we don’t have enough load in those environments to begin with?—?production is quite another story. Which means, we’re not programmed to think about these failure modes during development.

So, a good way to hack the brains of your fellow developers and to programme them to expect timeouts is to expose them to these failure modes regularly in the dev environments, by injecting latency to our internal APIs in those environments.

We can figuratively hold up a sign and tell other developers to expect latency spikes and timeouts by literally exposing them to these scenarios in dev environments, regularly, so they know to expect it.

In fact, if we make our dev environments exhibit the most hostile and turbulent conditions that our systems should be expected to handle, then we know for sure that any system that makes its way to production are ready to face what awaits it in the wild.

Inject latency to public-facing functions

So far, we have focused on validating the handling of latency spikes and timeouts in our APIs. The same validation is needed for our client application.

We can apply all the same arguments mentioned above here. By injecting latency to our public-facing API functions (in both production as well as dev environments), we can:

  • validate the client application handles latency spikes and timeouts gracefully, and offers the best UX as possible in these situations
  • train our client developers to expect latency spikes and timeouts

When I was working on a MMORPG at Gamesys years ago, we uncovered a host of frailties in the game when we injected latency spikes and faults to our APIs. The game would crash during startup if any of the first handful of requests fails. In some cases, if the response time was longer than a few seconds then the game would also get into a weird state because of race conditions.

Turns out I was setting my colleagues up for failure in production because the dev environment was so forgiving and gave them a false sense of comfort.

With that, let’s talk about how we can apply the practice of latency injection.

But wait, can’t you inject latency in the client-side HTTP clients too?

Absolutely! And you should! However, for the purpose of this post we are going to look at how and where we can inject latency to our Lambda functions only, hence why I have willfully ignored this part of the equation.

How to inject latency

There are 2 aspects to actually injecting latencies:

  1. adding delays to operations
  2. configuring how often and how much delay to add

If you read my previous posts on capturing and forwarding correlation IDsand managing configurations with SSM Parameter Store, then you have already seen the basic building blocks we need to do both.

How to inject latency to HTTP client

Since you are unlikely to write a HTTP client from scratch, so I consider the problem for injecting latency to HTTP client and 3rd party clients (such as the AWS SDK) to be one and the same.

A couple of solutions jump to mind:

  • in static languages, you can consider using a static weaver such as AspectJ or PostSharp, this is the approach I took previously
  • in static languages, you can consider using dynamic proxies, which many IoC frameworks offer (another form of AOP)
  • you can create a wrapper for the client, either manually or with a factory function (bluebirdjs’s promisifyAll function is a good example)

Since I’m going to use Node.js as example, I’m going to focus on wrappers.

For the HTTP client, given the relatively small number of methods you will need, it’s feasible to craft the wrapper by hand, especially if you have a particular API design in mind.

Using the HTTP client I created for the correlation ID post as base, I modified it to accept a configuration object to control the latency injection behaviour.

{
  "isEnabled": true,
  "probability": 0.5,
  "minDelay": 100,
  "maxDelay": 5000
}

You can find this modified HTTP client here, below is a simplified version of this client (which uses superagent under the hood).

To configure the function and the latency injection behaviour, we can use the configClient I first created in the SSM Parameter Store post.

First, let’s create the configs in the SSM Parameter Store.

You can create and optionally encrypt parameter values in the SSM Parameter Store.

The configs contains the URL for the internal API, as well as a chaosConfigobject. For now, we just have a httpClientLatencyInjectionConfig property, which is used to control the HTTP client’s latency injection behaviour.

{ 
  "internalApi": "https://xx.amazonaws.com/dev/internal", 
  "chaosConfig": {
    "httpClientLatencyInjectionConfig": {
      "isEnabled": true,
      "probability": 0.5,
      "minDelay": 100,
      "maxDelay": 5000
    }
  } 
}

Using the aforementioned configClient, we can fetch the JSON config from SSM Parameter Store at runtime.

const configKey = "public-api-a.config";
const configObj = configClient.loadConfigs([ configKey ]);

let config = JSON.parse(yield configObj["public-api-a.config"]);
let internalApiUrl = config.internalApi;
let chaosConfig = config.chaosConfig || {};
let injectionConfig = chaosConfig.httpClientLatencyInjectionConfig;

let reply = yield http({ 
  method : 'GET', 
  uri : internalApiUrl, 
  latencyInjectionConfig: injectionConfig 
});

The above configuration gives us a 50% chance of injecting a latency between 100ms and 3sec when we make the HTTP request to internal-api.

This is reflected in the following X-Ray traces.

How to inject latency to AWSSDK

With the AWS SDK, it’s not feasible to craft the wrapper by hand. Instead, we could do with a factory function like bluebird’s promisifyAll.

We can apply the same approach here, and I made a crude attempt at doing just that. I must add that, whilst I consider myself a competent Node.js programmer, I’m sure there’s a better way to implement this factory function.

My factory function will only work with promisified objects (told you it’s crude..), and replaces their xxxAsync functions with a wrapper that takes in one more argument of the shape:

{
  "isEnabled": true,
  "probability": 0.5,
  "minDelay": 100,
  "maxDelay": 3000
}

Again, it’s clumsy, but we can take the DocumentClient from the AWS SDK, promisify it with bluebird, then wrap the promisified object with our own wrapper factory. Then, we can call its async functions with an optional argument to control the latency injection behaviour.

You can see this in action in the handler function for public-api-b .

For some reason, the wrapped function is not able to record subsegments in X-Ray. I suspect it’s some nuance about Javascript or the X-Ray SDK that I do not fully understand.

Nonetheless, judging from the logs, I can confirm that the wrapped function does indeed inject latency to the getAsync call to DynamoDB.

If you know of a way to improve the factory function, or to get the X-Ray tracing work with the wrapped function, please let me know in the comments.

How to inject latency to function invocations

The apiHandler factory function I created in the correlation ID post is a good place to apply common implementation patterns that we want from our API functions, including:

  • log the event source as debug
  • log the response and/or error from the invocation (which, surprisingly, Lambda doesn’t capture by default)
  • initialize global context (eg. for tracking correlation IDs)
  • handle serialization for the response object
  • etc..
// this is how you use the apiHandler factory function to create a
// handler function for API Gateway event source
module.exports.handler = apiHandler(
  co.wrap(function* (event, context) {
    ... // do bunch of stuff
    // instead of invoking the callback directly, you return the
    // response you want to send, and the wrapped handler function
    // would handle the serialization and invoking callback for you
    // also, it takes care of other things for you, like logging
    // the event source, and logging unhandled exceptions, etc.
   return { message : "everything is awesome" };
  })
);

In this case, it’s also a good place for us to inject latency to the API function.

However, to do that, we need to access the configuration for the function. Time to lift the responsibility for fetching configurations into the apiHandlerfactory then!

The full apiHandler factory function can be found here, below is a simplified version that illustrates the point.

Now, we can write our API function like the following.

Now that the apiHandler has access to the config for the function, it can access the chaosConfig object too.

Let’s extend the definition for the chaosConfig object to add a functionLatencyInjectionConfig property.

"chaosConfig": {
  "functionLatencyInjectionConfig": {
    "isEnabled": true,
    "probability": 0.5,
    "minDelay": 100,
    "maxDelay": 5000
  },
  "httpClientLatencyInjectionConfig": {
    "isEnabled": true,
    "probability": 0.5,
    "minDelay": 100,
    "maxDelay": 5000
  }
}

With this additional configuration, we can modify the apiHandler factory function to use it to inject latency to a function’s invocation much like what we did in the HTTP client.

Just like that, we can now inject latency to function invocations via configuration. This will work for any API function that is created using the apiHandler factory.

With this change and both kinds of latency injections enabled, I can observe all the expected scenarios through X-Ray:

  • no latency was injected

  • latency was injected to the function invocation only

  • latency was injected to the HTTP client only

  • latency was injected to both HTTP client and the function invocation, but the invocation did not timeout as a result

  • latency was injected to both HTTP client and the function invocation, and the invocation times out as a result

I can get further confirmation of the expected behaviour through logs, and the metadata recorded in the X-Ray traces.

Recap, and future works

In this post we discussed:

  • why you should consider applying the practice of latency injection to APIs created with API Gateway and Lambda
  • additional considerations specific to API Gateway and Lambda
  • where you can inject latencies, and why you should consider injecting latency at each of these places
  • how you can inject latency in HTTP clients, AWS SDK, as well as the function invocation

The approach we have discussed here is driven by configuration, and the configuration is refreshed every 3 mins by default.

We can go much further with this.

Fine grained configuration

The configurations can be more fine grained, and allow you to control latency injection to specific resources.

For example, instead of a blanket httpClientLatencyInjectionConfig for all HTTP requests (including those requests to AWS services), the configuration can be specific to an API, or a DynamoDB table.

Automation

The configurations can be changed by an automated process to:

  • run routine validations daily
  • stop all latency injections during off hours, and holidays
  • forcefully stop all latency injections, eg. during an actual outage
  • orchestrate complex scenarios that are difficult to manage by hand, eg. enable latency injection at several places at once

Again, we can look to Netflix for inspiration for such an automated platform.

Usually, you would want to enable one latency injection in a bounded context at a time. This helps contain the blast radius of unintended damages, and make sure your experiments are actually controlled. Also, when latency is injected at several places, it is harder to understand the causality we observe as there are multiple variables to consider.

Unless, of course, you’re validating against specific hypothesis such as:

The system can tolerate outage to both the primary store (DynamoDB) as well as the backup store (S3) for user preferences, and would return a hardcoded default value in that case.

Better communication

Another good thing to do, is to inform the caller of the fact that latency has been added to the invocation by design.

This might take the form of a HTTP header in the response to tell the caller how much latency was injected in total. If you’re using an automated process to generate these experiments, then you should also include the id/tag/name for the specific instance of the experiment as HTTP header as well.

What’s next?

As I mentioned in the previous post, you need to apply common sense when deciding when and where you apply chaos engineering practices.

Don’t attempt an exercises that you know is beyond your abilities.

Before you even consider applying latency injection to your APIs in production, you need to think about how you can deal with these latency spikes given the inherent constraints of API Gateway and Lambda.

Unfortunately, we have run out of time to talk about this in this post, but come back in 2 weeks and we will talk about several strategies you can employ in part 3.

The code for the demo in this post is available on github here. Feel free to play around with it and let me know if you have any suggestions for improvement!

References

Using Protocol Buffers with API Gateway and AWS Lambda

AWS announced binary support for API Gateway in late 2016, which opened up the door for you to use more efficient binary formats such as Google’s Protocol Buffers and Apache Thrift.

Why?

Compared to JSON – which is the bread and butter for APIs built with API Gateway and Lambda – these binary formats can produce significantly smaller payloads.

At scale, they can make a big difference to your bandwidth cost.

In restricted environments such as low-end devices or in countries with poor mobile connections, sending smaller payloads can also improve your user experience by improving the end-to-end network latency, and possibly processing time on the device too.

Comparison of serializer performance between Proto Buffers and JSON in .Net

How

Follow these 3 simple steps (assuming you’re using Serverless framework):

  1. install the awesome serverless-apigw-binary plugin
  2. add application/x-protobuf to binary media types (see screenshot below)
  3. add function that returns Protocol Buffers as base64 encoded response

The serverless-apigw-binary plugin has made it really easy to add binary support to API Gateway

To encode & decode Protocol Buffers payload in Nodejs, you can use the protobufjs package from NPM.

It lets you work with your existing .proto files, or you can use JSON descriptors. Give the docs a read to see how you can get started.

In the demo project (link at the bottom of the post) you’ll find a Lambda function that always returns a response in Protocol Buffers.

Couple of things to note from this function:

  • we set the Content-Type header to application/x-protobuf
  • body is base64 encoded representation of the Protocol Buffers payload
  • isBase64Encoded is set to true

you need to do all 3 of these things to make API Gateway return the response as binary data.

Consider them the magic incantation for making API Gateway return binary data, and, the caller also has to set the Accept header to application/x-protobuf.

In the same project, there’s also a JSON endpoint that returns the same payload as comparison.

The response from this JSON endpoint looks like this:

{"players":[{"id":"eb66db14992e06b36282d607cf0134ce4fe45f50","name":"Calvin Ortiz","scores":[57,12,100,56,47,78,20,37,32,48]},{"id":"7b9b38e535453d120e706ff57fef41f6fee991cb","name":"Marcus Cummings","scores":[40,57,24,15,45,54,25,67,59,23]},{"id":"db34a2a5f4d16e77a6d3d6154a8b8bb6760b3b99","name":"Harry James","scores":[61,85,14,70,8,80,14,22,76,87]},{"id":"e21018c4f43eef10771e0fa71bc54156b00a64dd","name":"Gregory Bishop","scores":[51,31,27,47,72,75,61,28,100,41]},{"id":"b3ee29ee49b640ce15be1737d0dca60e48108ee1","name":"Ann Evans","scores":[69,17,48,99,85,8,75,55,78,46]},{"id":"9c1e6d4d46bb0c0d2c92bab11e5dbd5f4ab0c619","name":"Juan Perez","scores":[71,34,60,84,21,98,60,8,91,92]},{"id":"d8de89222633c61393931457c1e72558eba48639","name":"Loretta Harvey","scores":[15,40,73,92,42,65,58,30,26,84]},{"id":"141dad672ec559431f808964391d128d2c3274bf","name":"Ian Powell","scores":[17,21,14,84,64,14,22,22,34,92]},{"id":"8a97e85e2e5385c45fc31f24bfe781c26f78c0b7","name":"Steve Gibson","scores":[33,97,6,1,20,1,78,3,77,19]},{"id":"6b3ca6924e17cd5fd9d91b36d49b36a5d542c9ea","name":"Harold Ferguson","scores":[31,32,4,10,37,85,46,86,39,17]}]}

As you can see, it’s just a bunch of randomly generated names and GUIDs, and integers. The same response in Protocol Buffers is nearly 40% smaller.

Problem with the protobufjs package

Before we move on, there is one important detail about using the protobufjspacakge in a Lambda function – you need to npm install the package on a Linux system.

This is because it has a dependency that is distributed as native binaries, so if you installed the packaged on OSX then the binaries that are packaged and deployed to Lambda will not run on the Lambda execution environment.

I had similar problems with other Google libraries in the past. I find the best way to deal with this is to take a leaf out of aws-serverless-go-shim’s approach and deploy your code inside a Docker container.

This way, you would locally install a compatible version of the native binaries for your OS so you can continue to run and debug your function with sls invoke local (see this post for details).

But, during deployment, a script would run npm install --force in a Docker container running a compatible Linux distribution. This would then install a version of the native binaries that can be executed in the Lambda execution environment. The script would then use sls deploy to deploy the function.

The deployment script can be something simple like this:

In the demo project, I also have a docker-compose.yml file:

The Serverless framework requires my AWS credentials, hence why I’ve attached the $HOME/.aws directory to the container for the AWSSDK to find at runtime.

To deploy, run docker-compose up.

Use HTTP content negotiation

Whilst binary formats are more efficient when it comes to payload size, they do have one major problem: they’re really hard to debug.

Imagine the scenario – you have observed a bug, but you’re not sure if the problem is in the client app or the server. But hey, let’s just observe the HTTP conversation with a HTTP proxy such as Charles or Fiddler.

This workflow works great for JSON but breaks down when it comes to binary formats such as Protocol Buffers as the payloads are not human readable.

As we have discussed in this post, the human readability of JSON comes with the cost of heavier bandwidth usage. For most network communications, be it service-to-service, or service-to-client, unless a human is actively “reading” the payloads it’s not worth paying the cost. But when a human is trying to read it, that human readability is very valuable.

Fortunately, HTTP’s content negotiation mechanism means we can have the best of both worlds.

In the demo project, there is a contentNegotiated function which returns either JSON or Protocol Buffers payloads based on what the Accept header.

By default, you should use Protocol Buffers for all your network communications to minimise bandwidth use.

But, you should build in a mechanism for toggling the communication to JSON when you need to observe the communications. This might mean:

  • for debug builds of your mobile app, allow super users (devs, QA, etc.) the ability to turn on debug mode, which would switch the networking layer to send Accept header as application/json
  • for services, include a configuration option to turn on debug mode (see this post on configuring functions with SSM parameters and cache client for hot-swapping) to make service-to-service calls use JSON too, so you can capture and analyze the request and responses more easily

As usual, you can try out the demo code yourself, the repo is available here.

Capture and forward correlation IDs through different Lambda event sources

Serverless architectures are microservices by default, you need correlation IDs to help debug issues that spans across multiple functions, and possibly different event source types – asynchronous, synchronous and streams.

This is the last of a 3-part mini series on managing your AWS Lambda logs.

If you haven’t read part 1 yet, please give it a read now. We’ll be building on top of the basic infrastructure of shipping logs from CloudWatch Logs detailed in that post.

part 1 : centralise logging

part 2: tips and tricks

Why correlation IDs?

As your architecture becomes more complex, many services have to work together in order to deliver the features your users want.

Microservice death stars, circa 2015.

When everything works, it’s like watching an orchestra, lots of small pieces all acting independently whilst at the same time collaborating to form a whole that’s greater than the sum of its parts.

However, when things don’t work, it’s a pain in the ass to debug. Finding that one clue is like finding needle in the haystack as there are so many moving parts, and they’re all constantly moving.

Imagine you’re an engineer at Twitter and trying to debug why a user’s tweet was not delivered to one of his followers’ timeline.

“Let me cross reference the logs from hundreds of services and find the logs that mention the author’s user ID, the tweet ID, or the recipient’s user ID, and put together a story of how the tweet flowed through our system and why it wasn’t delivered to the recipient’s timeline.”

“What about logs that don’t explicitly mention those fields?”

“mm… let me get back to you on that…”

Needle in the haystack.

This is the problem that correlation IDs solve in the microservice world – to tag every log message with the relevant context so that it’s easy to find them later on.

Aside from common IDs such as user ID, order ID, tweet ID, etc. you might also want to include the X-Ray trace ID in every log message. That way, if you’re using X-Ray with Lambda then you can use it to quickly load up the relevant trace in the X-Ray console.

By default, Lambda automatically generates a _X_AMZN_TRACE_ID value in the environment variable.

Also, if you’re going to add a bunch of correlation IDs to every log message then you should consider switching to JSON. Then you need to update the ship-logs function we introduced in part 1 to handle log messages that are formatted as JSON.

Enable debug logging on entire call chain

Another common problem people run into, is that by the time we realise there’s a problem in production we find out that the crucial piece of information we need to debug the problem is logged as DEBUG, and we disable DEBUG logs in production because they’re too noisy.

“Darn it, now we have to enable debug logging and redeploy all these services! What a pain!”

“Don’t forget to disable debug logging and redeploy them, after you’ve found the problem ;-)”

Fortunately it doesn’t have to be a catch-22 situation. You can enable DEBUG logging on the entire call chain by:

  1. make the decision to enable DEBUG logging (for say, 5% of all requests) at the edge service
  2. pass the decision on all outward requests alongside the correlation IDs
  3. on receiving the request from the edge service, possibly through async event sources such as SNS, the intermediate services will capture this decision and turn on DEBUG logging if asked to do so
  4. the intermediate services will also pass that decision on all outward requests alongside the correlation IDs

The edge service decides to turn DEBUG logging on for 5% of requests, that decision is captured and passed along throughout the entire call chain, through HTTP requests, SNS message and Kinesis events.

Capture and forward correlation IDs

With that out of the way, let’s dive into some code to see how you can actually make it work. If you want to follow along, then the code is available in this repo, and the architecture of the demo project looks like this:

The demo project consists of an edge API, api-a, which initialises a bunch of correlation IDs as well as the decision on whether or not to turn on debug logging. It’ll pass these along through HTTP requests to api-b, Kinesis events and SNS messages. Each of these downstream function would in turn capture and pass them along to api-c.

We can take advantage of the fact that concurrency is now managed by the platform, which means we can safely use global variables to store contextual information relevant for the current invocation.

In the handler function we can capture incoming correlation IDs in global variables, and then include them in log messages, as well as any outgoing messages/HTTP requests/events, etc.

To abstract away the implementation details, let’s create a requestContextmodule that makes it easy to fetch and update these context data:

And then add a log module which:

  • disables DEBUG logging by default
  • enables DEBUG logging if explicitly overriden via environment variables or a Debug-Log-Enabled field was captured in the incoming request alongside other correlation IDs
  • logs messages as JSON

Once we start capturing correlation IDs, our log messages would look something like this:

Notice that I have also captured the User-Agent from the incoming request, as well as the decision to not enable DEBUG logging.

Now let’s see how we can capture and forward correlation IDs through API Gateway and outgoing HTTP requests.

API Gateway

You can capture and pass along correlation IDs via HTTP headers. The trick is making sure that everyone in the team follows the same conventions.

To standardise these conventions (what to name headers that are correlation IDs, etc.) you can provide a factory function that your developers can use to create API handlers. Something like this perhaps:

When you need to implement another HTTP endpoint, pass your handler code to this factory function. Now, with minimal change, all your logs will have the captured correlation IDs (as well as User-Agent, whether to enable debug logging, etc.).

The api-a function in our earlier architecture looks something like this:

Since this is the API on the edge, so it initialises the x-correlation-id using the AWS Request ID for its invocation. This, along with several other pieces of contextual information is recorded with every log message.

By adding a custom HTTP module like this one, you can also make it easy to include these contextual information in outgoing HTTP requests. Encapsulating these conventions in an easy-to-use library also helps you standardise the approach across your team.

In the api-a function above, we made a HTTP request to the api-bendpoint. Looking in the logs, you can see the aforementioned contextual information has been passed along.

In this case, we also have the User-Agent from the original user-initiated request to api-a. This is useful because when I look at the logs for intermediate services, I often miss the context of what platform the user is using which makes it harder to correlate the information I gather from the logs to the symptoms the user describes in their bug reports.

When the api-b function (see here) makes its own outbound HTTP request to api-c it’ll pass along all of these contextual information plus anything we add in the api-b function itself.

Log message for when api-b calls api-c with the custom HTTP module. Notice it includes the “x-correlation-character-b” header which is set by the api-b function.

When you see the corresponding log message in api-c’s logs, you’ll see all the context from both api-a and api-b.

SNS

To capture and forward correlation IDs through SNS messages, you can use message attributes.

In the api-a function above, we also published a message to SNS (omitted from the code snippet above) with a custom sns module which includes the captured correlation IDs as message attributes, see below.

When this SNS message is delivered to a Lambda function, you can see the correlation IDs in the MessageAttributes field of the SNS event.

Let’s create a snsHandler factory function to standardise the process of capturing incoming correlation IDs via SNS message attributes.

We can use this factory function to quickly create SNS handler functions. The log messages from these handler functions will have access to the captured correlation IDs. If you use the aforementioned custom httpmodule to make outgoing HTTP requests then they’ll be included as HTTP headers automatically.

For instance, the following SNS handler function would capture incoming correlation IDs, include them in log messages, and pass them on when making a HTTP request to api-c (see architecture diagram).

Those correlation IDs (including the one added by the SNS handler function) are included as HTTP headers.

Kinesis Streams

Unfortunately, with Kinesis and DynamoDB Streams, there’s no way to tag additional information with the payload. Instead, in order to pass correlation IDs along, we’d have to modify the actual payload itself.

Let’s create a kinesis module for sending events to a Kinesis stream, so that we can insert a __context field to the payload to carry the correlation IDs.

On the receiving end, we can take it out, use it to set the current requestContext, and delete this __context field before passing it on to the Kinesis handler function for processing. The sender and receiver functions won’t even notice we modified the payload.

Wait, there’s one more problem – our Lambda function will receive a batch of Kinesis records, each with its own context. How will we consolidate that?

The simplest way is to force the handler function to process records one at a time. That’s what we’ve done in the kinesisHandler factory function here.

The handler function (created with the kinesisHandler factory function) would process one record at at time, and won’t have to worry about managing the request context. All of its log messages would have the right correlation IDs, and outgoing HTTP requests, SNS messages and Kinesis events would also pass those correlation IDs along.

When api-c receives the invocation event, you can see the correlation IDs have been passed along via HTTP headers.

This approach is simple, developers working on Kinesis handler functions won’t have to worry about the implementation details of how correlation IDs are captured and passed along, and things “just work”.

However, it also removes the opportunity to optimize by processing all the records in a batch. Perhaps your handler function has to persist the events to a persistence store that’s better suited for storing large payloads rather than lots of small ones.

This simple approach is not the right fit for every situation, an alternative would be to leave the __context field on the Kinesis records and let the handler function deal with them as it sees fit. In which case you would also need to update the shared libraries – the loghttpsns and kinesismodules we have talked about so far – to give the caller to option to pass in a requestContext as override.

This way, the handler function can process the Kinesis records in a batch. Where it needs to log or make a network call in the context of a specific record, it can extract and pass the request context along as need be.

The End

That’s it, folks. A blueprint for how to capture and forward correlation IDs through 3 of the most commonly used event sources for Lambda.

Here’s an annotated version of the architecture diagram earlier, showing the flow of data as they’re captured and forwarded from one invocation to another, through HTTP headers, message attributes, Kinesis record data.

You can find a deployable version of the code you have seen in this post in this repo. It’s intended for demo sessions in my O’Reilly course detailed below, so documentation is seriously lacking at the moment, but hopefully this post gives you a decent idea of how the project is held together.

Other event sources

There are plenty of event sources that we didn’t cover in this post.

It’s not possible to pass correlation IDs through every event source, as some do not originate from your system – eg. CloudWatch Events that are triggered by API calls made by AWS service.

And it might be hard to pass correlation IDs through, say, DynamoDB Streams – the only way (that I can think of) for it to work is to include the correlation IDs as fields in the row (which, might not be such a bad idea but it does have cost implications).

Yubl’s road to Serverless – Part 5 – building better recommendations with Lambda, BigQuery and GrapheneDB

Note: see here for the rest of the series.

 

When I joined Yubl in April 2016, it had launched just 2 months earlier, after a long and chaotic development cycle that lasted more than 2 years – all the while there was a fully armed sales team before there was even a product!

Some seriously bad decisions happened at Yubl.. and judging by Silicon Valley this kind of decision making is far more common than we realised.

That said, many good things also happened at Yubl, and I had the pleasure to work with some of the best people I have met in my career. This post is about one of the ailing features we were able to quickly turn around with the power of AWS Lambda and using the right tool for the job.

Animated GIF  - Find & Share on GIPHY

Fans of Silicon Valley probably remember that scene from Season 3 when Richard and co walked into their shiny new Pipe Piper office to find “Action” Jack Barker had hired an army of sales people before they even had a product.


A Broken Feature

Upon joining the company, I found out the app already had a Find People feature although it didn’t do what I expected. The likes of Twitter and Facebook would employ sophisticated algorithms to find people with shared interest to you. Our feature on the other hand would return the first 30 users in MongoDB that you aren’t already following, by the order of account creation time. For most users this list would equate to the first 30 Yubl employees that installed the app… talk about rigging the game!

One of the devs made a valiant attempt to improve the feature by returning only users who have shared connections with you – either you both follow X or you are both followed by X.

However, the implementation was a series of expensive (and complicated) MongoDB queries per user request. Ultimately it was an approach that would not scale with throughput nor complexity as it’s using the wrong tool for the job.

Lambda + GrapheneDB = Efficient Graph Queries

I had previously worked with Neo4j at Gamesys and used it to analyze and model the complex in-game economy of a MMORPG.

A graph database like Neo4j is the perfect place to store our social graph, and allows us to efficiently perform the kind of graph queries we need in order to find users you should follow, eg. 2nd/3rd degree connections.

GrapheneDB offers hosted Neo4j database as a service, with built-in monitoring, dashboards, automated backup and scaling up. It was the perfect choice to get us going and start delivering value to our users quickly.

At this point in time we were already streaming all state changes in the system into Kinesis. To export all of our social graph into GrapheneDB and to keep it in sync with MongoDB we:

  1. ran a one-off task to export all the relationship data into GrapheneDB
  2. subscribed a Lambda function to the Relationship Kinesis stream to process any subsequent relationship changes and update the social graph (in GrapheneDB) in real time

We then exposed the data via API Gateway and Lambda so that the client app and other internal services can use it to easily find suggested users for a user to follow.

Future Plans

Given the limitation that Neo4j requires all of your graph to be stored on one machine (and it has pretty taxing hardware requirement too) it was not the long term solution for us.

Based on my estimates, the biggest instance available on GrapheneDB would suffice until we have more than 10M users. It was calculated based on the average no. of connections per user in our platform and using Twitter’s user stats as a guideline for where we might be at 10M users.

We can push that ceiling much further by moving to a batch model and preprocess recommendations for each user to reduce the no. of live queries against a large graph. The recommendations can be restricted to active (eg. users that have logged in in the last X days) users only, and only when:

  • the recommendations are stale, ie. not acted upon by the user for more than X days so they might not be what the user wants; or
  • when the user’s extended social graph has changed, ie. followers/followees have new connections

From what I was able to gather, all the big social networks use a batch model for scalability and cost reasons.

As for a long term solutions, we hadn’t settled on anything. I looked at Facebook’s Giraph briefly but it’s far more sophisticated than we were ready for. There are other “fantasy” ideas like the Mosaic system described in this paper. It would have been a fantastic challenge had we got that far.

Finding Trending Users

Because we were still a small social network – with just over 800K installs, it’s not sufficient to make recommendations based on a user’s social graph alone as most users have a pretty small social graph.

To bridge the gap we decided to also include trending users on the platform in your recommendations.

Thankfully, all of our events (eg. X followed Y, X liked Y’s post, etc.) are streamed into Google BigQuery. We chose BigQuery because AWS Athena hadn’t been announced yet and RedShift is not the right model for making ad-hoc, live queries that need to respond quickly. Also, I had many years of experience using BigQuery at Gamesys so it was a no-brainer at the time.

ps. if you’re curious about the difference between Athena and BigQuery, Lynn Langit gave a comprehensive comparison at Serverless Austin this year.

To find trending users, we worked with the product team to create a formula to calculate a user’s “trendiness” based on no. of new followers in the last 24 hours. The follower count is weighted exponentially by how recently the user was followed. For instance, a follower that followed you in the past hour gives you a score of 1, but a follower that followed you 3 hours ago would only earn you a score of 0.1.

We created a cron job with CloudWatch Events and Lambda to perform the aforementioned query against BigQuery every 3 hours. To save on cost, our query would only process events that were inserted in the last 24 hours.

The result are then saved into a DynamoDB table, which is overwritten at the end of each run.

Once again, we exposed the data via API Gateway and Lambda.

Migration to new APIs

Now, we have 2 new APIs to provide live suggestions based on a user’s social graph, and to find users who are currently trending on our platform.

However, the client apps would need to be updated to take advantage of these new APIs. Instead of waiting for the client teams to catch up, we updated the legacy API’s suggestion endpoint to use results from both so we can provide value to our users earlier.

“The lead time to someone saying thank you is the only reputation metric that matters.”

– Dan North

This is how it looks when we put everything together:

One of the most satisfying aspect of this work was how quickly we were able to turn this feature around and deploy the new system into production. Everything came together in less than 2 weeks, which is largely because we were able to focus on our business needs and let services such as Lambda, BigQuery and GrapheneDB deal with the undifferentiated efforts.

Serverless 1.X – enable API Gateway caching on request parameters

Having previously blogged about the untrodden path to enable caching on API Gateway request parameters in the Serverless framework 0.5.X, it’s a little disappointing that it’s still not officially fixed in the 1.X versions…

The Problem

The problem is two-fold:

  1. there’s currently no way to specify caching should be enabled for path & query string parameters
  2. the CloudFormation template Serverless 1.X generates for API Gateway is missing a few optional fields, these missing fields stop you from manually enable caching in the API Gateway management console too

After you deploy your Lambda function with associated API, if you go to the management console and enable caching on path or request parameters you will get an error saying “Invalid cache key parameter specified”.

The Workaround

A friend pointed me to a neat trick to modify the CloudFormation template that Serverless 1.X auto-generates for you.

After the project is deployed, you can go to CloudFormation and view the template that Serverless has generated. These templates are pretty big (and poorly formatted), so I find it easier to open them up in the Designer view and use that view to navigate to the endpoint I’m looking for.

Once you find the resource template for the endpoint, write down its name. Now go back to the serverless.yml file in your project, and add the resource name to the resources section at the bottom. You only need to include fields that you want to update or add to the template.

The CloudFormation syntax for an API Gateway method looks like this:

We also need to fill in some blanks for the Integration section:

For more details on the CloudFormation syntax, see here and here.

After some trial-and-error, the minimum set of fields I had to add are:

Redeploy with Serverless and the path parameter is enabled for caching:

Wrap Up

I hope you have found this post useful, though I’m surprised by the lack of information out there during my research and the lack of official support from the Serverless framework.

You know of a better way to do this, please let me know in the comments.

Links