Applying principles of chaos engineering to AWS Lambda with latency injection

This is part 2 of a multipart series that explores ideas on how we could apply the principles of chaos engineering to serverless architectures built around Lambda functions.


The most common issue I have encountered in production are latency/performance related. They can be symptomatic of a host of underlying causes ranging from AWS network issues (which can also manifest itself in latency/error-rate spikes in any of the AWS services), overloaded servers, to simple GC pauses.

Latency spikes are inevitable – as much as you can improve the performance of your application, things will go wrong, eventually, and often they’re out of your control.

So you must design for them, and degrade the quality of your application gracefully to minimize the impact on your users.

In the case of API Gateway and Lambda, there are additional considerations:

  • API Gateway has a hard limit of 29s timeout for integration points, so even if your Lambda function can run for up to 5 mins, API Gateway will timeout way before that

If you use moderate timeout settings for your API functions (and you should!) then you need to consider the effects of cold starts when calling an intermediate service.

Where to inject latency

Suppose our client application communicates directly with 2 public facing APIs, whom in turn depends on an internal API.

In this setup, I can think of 3 places where we can inject latency and each would validate a different hypothesis.

Inject latency at HTTP clients

The first, and easiest place to inject latency is in the HTTP client library we use to communicate with the internal API.

This will test that our function has appropriate timeout on this HTTP communication and can degrade gracefully when this request time out.

We can inject latency to the HTTP client libraries for our internal APIs, hence validating that the caller function has configured appropriate timeout and error handling for timeouts.

Furthermore, this practice should also be applied to other 3rd party services we depend on, such as DynamoDB. We will discuss how we can inject latency to these 3rd party libraries later in the post.

We can also inject latency to 3rd party client libraries for other managed services we depend on.

This is a reasonably safe place to inject latency as the immediate blast radius is limited to this function.

However, you can (and arguably, should) consider applying this type of latency injection to intermediate services as well. Doing so does carry extra risk as it has a broader blast radius in failures case – ie. if the function under test does not degrade gracefully then it can cause unintended problems to outer services. In this case, the blast radius for these failure cases is the same as if you’re injecting latency to the intermediate functions directly.

Inject latency to intermediate functions

You can also inject latency directly to the functions themselves (we’ll look at how later on). This has the same effect as injecting latency to the HTTP client to each of its dependents, except it’ll affect all its dependents at once.

We can inject latency to a function’s invocation. If that function is behind an internal API that are used by multiple public-facing APIs then it can cause all its dependents to experience timeouts.

This might seem risky (it can be), but is an effective way to validate that every service that depends on this API endpoint is expecting, and handling timeouts gracefully.

It makes most sense when applied to intermediate APIs that are part of a bounded context (or, a microservice), maintained by the same team of developers. That way, you avoid unleashing chaos upon unsuspecting developers who might not be ready to deal with the chaos.

That said, I think there is a good counter-argument for doing just that.

We often fall into the pitfall of using the performance characteristics of dev environments as predictor for the production environment. Whilst we seldom experience load-related latency problems in the dev environments?—?because we don’t have enough load in those environments to begin with?—?production is quite another story. Which means, we’re not programmed to think about these failure modes during development.

So, a good way to hack the brains of your fellow developers and to programme them to expect timeouts is to expose them to these failure modes regularly in the dev environments, by injecting latency to our internal APIs in those environments.

We can figuratively hold up a sign and tell other developers to expect latency spikes and timeouts by literally exposing them to these scenarios in dev environments, regularly, so they know to expect it.

In fact, if we make our dev environments exhibit the most hostile and turbulent conditions that our systems should be expected to handle, then we know for sure that any system that makes its way to production are ready to face what awaits it in the wild.

Inject latency to public-facing functions

So far, we have focused on validating the handling of latency spikes and timeouts in our APIs. The same validation is needed for our client application.

We can apply all the same arguments mentioned above here. By injecting latency to our public-facing API functions (in both production as well as dev environments), we can:

  • validate the client application handles latency spikes and timeouts gracefully, and offers the best UX as possible in these situations
  • train our client developers to expect latency spikes and timeouts

When I was working on a MMORPG at Gamesys years ago, we uncovered a host of frailties in the game when we injected latency spikes and faults to our APIs. The game would crash during startup if any of the first handful of requests fails. In some cases, if the response time was longer than a few seconds then the game would also get into a weird state because of race conditions.

Turns out I was setting my colleagues up for failure in production because the dev environment was so forgiving and gave them a false sense of comfort.

With that, let’s talk about how we can apply the practice of latency injection.

But wait, can’t you inject latency in the client-side HTTP clients too?

Absolutely! And you should! However, for the purpose of this post we are going to look at how and where we can inject latency to our Lambda functions only, hence why I have willfully ignored this part of the equation.

How to inject latency

There are 2 aspects to actually injecting latencies:

  1. adding delays to operations
  2. configuring how often and how much delay to add

If you read my previous posts on capturing and forwarding correlation IDsand managing configurations with SSM Parameter Store, then you have already seen the basic building blocks we need to do both.

How to inject latency to HTTP client

Since you are unlikely to write a HTTP client from scratch, so I consider the problem for injecting latency to HTTP client and 3rd party clients (such as the AWS SDK) to be one and the same.

A couple of solutions jump to mind:

  • in static languages, you can consider using a static weaver such as AspectJ or PostSharp, this is the approach I took previously
  • in static languages, you can consider using dynamic proxies, which many IoC frameworks offer (another form of AOP)
  • you can create a wrapper for the client, either manually or with a factory function (bluebirdjs’s promisifyAll function is a good example)

Since I’m going to use Node.js as example, I’m going to focus on wrappers.

For the HTTP client, given the relatively small number of methods you will need, it’s feasible to craft the wrapper by hand, especially if you have a particular API design in mind.

Using the HTTP client I created for the correlation ID post as base, I modified it to accept a configuration object to control the latency injection behaviour.

{
  "isEnabled": true,
  "probability": 0.5,
  "minDelay": 100,
  "maxDelay": 5000
}

You can find this modified HTTP client here, below is a simplified version of this client (which uses superagent under the hood).

To configure the function and the latency injection behaviour, we can use the configClient I first created in the SSM Parameter Store post.

First, let’s create the configs in the SSM Parameter Store.

You can create and optionally encrypt parameter values in the SSM Parameter Store.

The configs contains the URL for the internal API, as well as a chaosConfigobject. For now, we just have a httpClientLatencyInjectionConfig property, which is used to control the HTTP client’s latency injection behaviour.

{ 
  "internalApi": "https://xx.amazonaws.com/dev/internal", 
  "chaosConfig": {
    "httpClientLatencyInjectionConfig": {
      "isEnabled": true,
      "probability": 0.5,
      "minDelay": 100,
      "maxDelay": 5000
    }
  } 
}

Using the aforementioned configClient, we can fetch the JSON config from SSM Parameter Store at runtime.

const configKey = "public-api-a.config";
const configObj = configClient.loadConfigs([ configKey ]);

let config = JSON.parse(yield configObj["public-api-a.config"]);
let internalApiUrl = config.internalApi;
let chaosConfig = config.chaosConfig || {};
let injectionConfig = chaosConfig.httpClientLatencyInjectionConfig;

let reply = yield http({ 
  method : 'GET', 
  uri : internalApiUrl, 
  latencyInjectionConfig: injectionConfig 
});

The above configuration gives us a 50% chance of injecting a latency between 100ms and 3sec when we make the HTTP request to internal-api.

This is reflected in the following X-Ray traces.

How to inject latency to AWSSDK

With the AWS SDK, it’s not feasible to craft the wrapper by hand. Instead, we could do with a factory function like bluebird’s promisifyAll.

We can apply the same approach here, and I made a crude attempt at doing just that. I must add that, whilst I consider myself a competent Node.js programmer, I’m sure there’s a better way to implement this factory function.

My factory function will only work with promisified objects (told you it’s crude..), and replaces their xxxAsync functions with a wrapper that takes in one more argument of the shape:

{
  "isEnabled": true,
  "probability": 0.5,
  "minDelay": 100,
  "maxDelay": 3000
}

Again, it’s clumsy, but we can take the DocumentClient from the AWS SDK, promisify it with bluebird, then wrap the promisified object with our own wrapper factory. Then, we can call its async functions with an optional argument to control the latency injection behaviour.

You can see this in action in the handler function for public-api-b .

For some reason, the wrapped function is not able to record subsegments in X-Ray. I suspect it’s some nuance about Javascript or the X-Ray SDK that I do not fully understand.

Nonetheless, judging from the logs, I can confirm that the wrapped function does indeed inject latency to the getAsync call to DynamoDB.

If you know of a way to improve the factory function, or to get the X-Ray tracing work with the wrapped function, please let me know in the comments.

How to inject latency to function invocations

The apiHandler factory function I created in the correlation ID post is a good place to apply common implementation patterns that we want from our API functions, including:

  • log the event source as debug
  • log the response and/or error from the invocation (which, surprisingly, Lambda doesn’t capture by default)
  • initialize global context (eg. for tracking correlation IDs)
  • handle serialization for the response object
  • etc..
// this is how you use the apiHandler factory function to create a
// handler function for API Gateway event source
module.exports.handler = apiHandler(
  co.wrap(function* (event, context) {
    ... // do bunch of stuff
    // instead of invoking the callback directly, you return the
    // response you want to send, and the wrapped handler function
    // would handle the serialization and invoking callback for you
    // also, it takes care of other things for you, like logging
    // the event source, and logging unhandled exceptions, etc.
   return { message : "everything is awesome" };
  })
);

In this case, it’s also a good place for us to inject latency to the API function.

However, to do that, we need to access the configuration for the function. Time to lift the responsibility for fetching configurations into the apiHandlerfactory then!

The full apiHandler factory function can be found here, below is a simplified version that illustrates the point.

Now, we can write our API function like the following.

Now that the apiHandler has access to the config for the function, it can access the chaosConfig object too.

Let’s extend the definition for the chaosConfig object to add a functionLatencyInjectionConfig property.

"chaosConfig": {
  "functionLatencyInjectionConfig": {
    "isEnabled": true,
    "probability": 0.5,
    "minDelay": 100,
    "maxDelay": 5000
  },
  "httpClientLatencyInjectionConfig": {
    "isEnabled": true,
    "probability": 0.5,
    "minDelay": 100,
    "maxDelay": 5000
  }
}

With this additional configuration, we can modify the apiHandler factory function to use it to inject latency to a function’s invocation much like what we did in the HTTP client.

Just like that, we can now inject latency to function invocations via configuration. This will work for any API function that is created using the apiHandler factory.

With this change and both kinds of latency injections enabled, I can observe all the expected scenarios through X-Ray:

  • no latency was injected

  • latency was injected to the function invocation only

  • latency was injected to the HTTP client only

  • latency was injected to both HTTP client and the function invocation, but the invocation did not timeout as a result

  • latency was injected to both HTTP client and the function invocation, and the invocation times out as a result

I can get further confirmation of the expected behaviour through logs, and the metadata recorded in the X-Ray traces.

Recap, and future works

In this post we discussed:

  • why you should consider applying the practice of latency injection to APIs created with API Gateway and Lambda
  • additional considerations specific to API Gateway and Lambda
  • where you can inject latencies, and why you should consider injecting latency at each of these places
  • how you can inject latency in HTTP clients, AWS SDK, as well as the function invocation

The approach we have discussed here is driven by configuration, and the configuration is refreshed every 3 mins by default.

We can go much further with this.

Fine grained configuration

The configurations can be more fine grained, and allow you to control latency injection to specific resources.

For example, instead of a blanket httpClientLatencyInjectionConfig for all HTTP requests (including those requests to AWS services), the configuration can be specific to an API, or a DynamoDB table.

Automation

The configurations can be changed by an automated process to:

  • run routine validations daily
  • stop all latency injections during off hours, and holidays
  • forcefully stop all latency injections, eg. during an actual outage
  • orchestrate complex scenarios that are difficult to manage by hand, eg. enable latency injection at several places at once

Again, we can look to Netflix for inspiration for such an automated platform.

Usually, you would want to enable one latency injection in a bounded context at a time. This helps contain the blast radius of unintended damages, and make sure your experiments are actually controlled. Also, when latency is injected at several places, it is harder to understand the causality we observe as there are multiple variables to consider.

Unless, of course, you’re validating against specific hypothesis such as:

The system can tolerate outage to both the primary store (DynamoDB) as well as the backup store (S3) for user preferences, and would return a hardcoded default value in that case.

Better communication

Another good thing to do, is to inform the caller of the fact that latency has been added to the invocation by design.

This might take the form of a HTTP header in the response to tell the caller how much latency was injected in total. If you’re using an automated process to generate these experiments, then you should also include the id/tag/name for the specific instance of the experiment as HTTP header as well.

What’s next?

As I mentioned in the previous post, you need to apply common sense when deciding when and where you apply chaos engineering practices.

Don’t attempt an exercises that you know is beyond your abilities.

Before you even consider applying latency injection to your APIs in production, you need to think about how you can deal with these latency spikes given the inherent constraints of API Gateway and Lambda.

Unfortunately, we have run out of time to talk about this in this post, but come back in 2 weeks and we will talk about several strategies you can employ in part 3.

The code for the demo in this post is available on github here. Feel free to play around with it and let me know if you have any suggestions for improvement!

References

How can we apply the principles of chaos engineering to AWS Lambda?

This is the first of a multipart series that explores ideas on how we could apply the principles of chaos engineering to serverless architectures built around Lambda functions.

  • part 1: how can we apply principles of chaos engineering to Lambda?
  • part 2: latency injection for APIs
  • part 3: fault injection for Lambda functions

All the way back in 2011, Simon Wardley had identified Chaos Engines as a practice that will be employed by the next generation of tech companies, along with continuous deployment, being data-driven, and organised around small, autonomous teams (think microservices & inverse-conway’s law).

There’s no question about it, Netflix has popularised the principles of chaos engineering. By open sourcing some of their tools – notably the Simian Army – they have also helped others build confidence in their system’s capability to withstand turbulent conditions in production.

There seems to be a renewed interest in chaos engineering recently. As Russ Miles noted in a recent post, perhaps many companies have finally come to understand that chaos engineering is not about “hurting production”, but to build better understanding of, and confidence in a system’s resilience through controlled experiments.

This trend has been helped by the valuable (and freely available) information that Netflix has published, such as the Chaos Engineering e-book, and principlesofchaos.org.

Tools such as chaos-lambda by Shoreditch Ops (the folks behind the Artillery load test tool) look to replicate Netflix’s Chaos Monkey, but execute from inside a Lambda function instead of an EC2 instance – ence bringing you the cost saving and convenience Lambda offers.

I want to ask a different question however: how can one apply the principles of chaos engineering and some of the current practices to a serverless architecture comprised of Lambda functions?

When your system runs on EC2 instances, then naturally, you build resilience by designing for the most likely failure mode – server crashes (due to both hardware and software issues). Hence, a controlled experiment to validate the resilience of your system would artificially recreate the failure scenario by terminating EC2 instances, and then AZs, and then entire regions.

AWS Lambda, however, is a higher-level abstraction and has different failure modes to its EC2 counterparts. Hypothesis that focus on “what if we lose these EC2 instances” no longer apply as the platform handles these failure modes for you out of the box.

We need to ask different questions in order to understand the weaknesses within our serverless architectures.

More inherent chaos, not less

We need to identify weaknesses before they manifest in system-wide, aberrant behaviors. Systemic weaknesses could take the form of: improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when a downstream dependency receives too much traffic; cascading failures when a single point of failure crashes; etc. We must address the most significant weaknesses proactively, before they affect our customers in production. We need a way to manage the chaos inherent in these systems, take advantage of increasing flexibility and velocity, and have confidence in our production deployments despite the complexity that they represent.

— Principles of Chaos Engineering

Having built and operated a non-trivial serverless architecture, I have some understanding of the dangers awaiting you in this new world.

If anything, there are a lot more inherent chaos and complexity in these systems built around Lambda functions.

  • modularity (unit of deployment) shifts from “services” to “functions”, and there are a lot more of them
  • it’s harder to harden around the boundaries, because you need to harden around each function as opposed to a service which encapsulates a set of related functionalities
  • there are more intermediary services (eg. Kinesis, SNS, API Gateway just to name a few), each with their own failure modes
  • there are more configurations overall (timeout, IAM permissions, etc.), and therefore more opportunities for misconfiguration

Also, since we have traded off more control of our infrastructure* it means we now face more unknown failure modes** and often there’s little we can do when an outage does occur***.


* For better scalability, availability, cost efficiency and more convenience, which I for one, think it’s a fair trade in most cases.

** Everything the platform does for you – scheduling containers, scaling, polling Kinesis, retry failed invocations, etc. – have their own failure modes. These are often not obvious to us since they’re implementation details that are typically undocumented and are prone to change without notice.

*** For example, if an outage happens and prevents Lambda functions from processing Kinesis events, then we have no meaningful alternative than to wait for AWS to fix the problem. Since the current position on the shards is abstracted away and unavailable to us, we can’t even replace the Lambda functions with KCL processors that run on EC2.


Applying chaos to AWS Lambda

A good exercise regime would continuously push you to your limits but never actually put you over the limit and cause injury. If there’s an exercise that is clearly beyond your current abilities then surely you would not attempt it as the only possible outcome is getting yourself hurt!

The same common sense should be applied when designing controlled experiments for your serverless architecture. In order to “know” what the experiments tell us about the resilience of our system we also need to decide what metrics to monitor – ideally using client-side metrics, since the most important metric is the quality of service our users experience.

There are plenty of failure modes that we know about and can design for, and we can run simple experiments to validate our design. For example, since a serverless architecture is (almost always) also a microservice architecture, many of its inherent failure modes still apply:

  • improperly tuned timeouts, especially for intermediate services, which can cause services at the edge to also timeout

Intermediate services should have more strict timeout settings compared to services at the edge.

  • missing error handling, which allows exceptions from downstream services to escape

  • missing fallbacks for when a downstream service is unavailable or experiences an outage

Over the next couple of posts, we will explore how we can apply the practices of latency and fault injection to Lambda functions in order to simulate these failure modes and validate our design.

Further readings:

Mind the 75GB limit on AWS Lambda deployment packages

Gotta clean up those old Lambda deployment packages!

With AWS Lambda and the Serverless framework, deploying your code has become so simple and frictionless.

As you move more and more of your architecture to run on Lambda, you might find that, in addition to getting things done faster you are also deploying your code more frequently.

That’s awesome!

But, as you rejoice in this new found superpower to make your users and stakeholders happy, you need to keep an eye out for that regional limit of 75GB for all the uploaded deployment packages.

http://docs.aws.amazon.com/lambda/latest/dg/limits.html

At Yubl, me and a small team of 6 server engineers managed to rack up nearly 20GB of deployment packages in 3 months.

We wrote all of our Lambda functions in Nodejs, and deployment packages were typically less than 2MB. But the frequency of deployments made sure that the overall size of deployment packages went up steadily.

Now that I’m writing most of my Lambda functions in Scala (it’s the weapon of choice for the Space Ape Games server team), I’m dealing with deployment packages that are significantly bigger!

When authoring Lambda functions in Java, be prepared to significantly bigger deployment packages.

Serverless framework: disable versionFunctions

By default, the Serverless framework would create a new version of your function every time you deploy.

In Serverless 0.X, this is (kinda) needed because it used function alias. For example, I can have multiple deployment stages for the same function?—?devstaging and production. But in the Lambda console there is only one function, and each stage is simply an alias pointing to a different version of the same function.

Unfortunately this behaviour also made it difficult to manage the IAM permissions because multiple versions of the same function share the same IAM role. Since you can’t version the IAM role with the function, this makes it hard for you to add or remove permissions without breaking older versions.

Fortunately, the developers listened to the community and since the 1.0 release each stage is deployed as a separate function.

Essentially, this allows you to “version” IAM roles with deployment stages since each stage gets a separate IAM role. So there’s technically no need for you to create a new version for every deployment anymore. But, that is still the default behaviour, unless you explicitly disable it in your serverless.ymlby setting versionFunctions to false.

You might argue that having old versions of the function in production makes it quicker to rollback.

In that case, enable it for the production stage only. To do that, here’s a handy trick to allow a default configuration in your serverless.yml to be overridable by deployment stage.

In my personal experience though, unless you have taken great care and used aliases to tag the production releases it’s actually quite hard to know which version correlates to what. Assuming that you have reproducible builds, I would have much more confidence if we rollback by deploying from a hotfixor support branch of our code.

Clean up old versions with janitor-lambda

If disabling versionFunctions in the serverless.yml for all of your projects is hard to enforce, another approach would be to retroactively delete old versions of functions that are no longer referenced by an alias.

To do that, you can create a cron job (ie. scheduled CloudWatch event + Lambda) that will scan through your functions and look for versions that are not referenced and delete them.

I took some inspiration from Netflix’s Janitor Monkey and created a Janitor Lambda function that you can deploy to your AWS environment to clean unused versions of your functions.

After we employed this Janitor Lambda function, our total deployment package went from 20GB to ~1GB (we had a lot of functions…).

You should use SSM Parameter Store over Lambda env variables

AWS Lambda announced native support for environment variables at the end of 2016. But even before that, the Serverless framework had supported environment variables and I was using them happily as me and my team at the time migrated our monolithic Node.js backend to serverless.

However, as our architecture expanded we found several drawbacks with managing configurations with environment variables.

Hard to share configs across projects

The biggest problem for us was the inability to share configurations across projects since environment variables are function specific at runtime.

The Serverless framework has the notion of services, which is just a way of grouping related functions together. You can specify service-wide environment variables as well as function-specific ones.

A sample serverless.yml that specifies both service-wide as well as function-specific environment variables.

However, we often found that configurations need to be shared across multiple services. When these configurations change we had to update and redeploy all functions that depend on them – which in itself was becoming a challenge to track these dependencies across many Github repos that are maintained by different members of the team.

For example, as we were migrating from a monolithic system piece by piece whilst delivering new features, we weren’t able to move away from the monolithic MongoDB database in one go. It meant that lots of functions shared MongoDB connection strings. When one of these connection strings changed – and it did several times – pain and suffering followed.

Another configurable value we often share are the root URL of intermediate services. Being a social network, many of our user-initiated operations depend on relationship data, so many of our microservices depend on the Relationship API. Instead of hardcoding the URL to the Relationship API in every service (one of the deadly microservice anti-patterns), it should be stored in a central configuration service.

Hard to implement fine-grained access control

When you need to configure sensitive data such as credentials, API keys or DB connection strings, the rule of thumb are:

  1. data should be encrypted at rest (includes not checking them into source control in plain text)
  2. data should be encrypted in-transit
  3. apply the principle of least privilege to function’s and personnel’s access to data

If you’re operating in a heavily regulated environment then point 3. might be more than a good practice but a regulatory requirement. I know of many fintech companies and financial juggernauts where access to production credentials are tightly controlled and available only to a handful of people in the company.

Whilst efforts such as the serverless-secrets-plugin delivers on point 1. it couples one’s ability to deploy Lambda functions with one’s access to sensitive data – ie. he who deploys the function must have access to the sensitive data too. This might be OK for many startups, as everyone has access to everything, ideally your process for managing access to data can evolve with the company’s needs as it grows up.

SSM Parameter Store

My team outgrew environment variables, and I started looking at other popular solutions in this space – etcd, consul, etc. But I really didn’t fancy these solutions because:

  • they’re costly to run: you need to run several EC2 instances in multi-AZ setting for HA
  • you have to manage these servers
  • they each have a learning curve with regards to both configuring the service as well as the CLI tools
  • we needed a fraction of the features they offer

This was 5 months before Amazon announced SSM Parameter Store at re:invent 2016, so at the time we built our own Configuration API with API Gateway and Lambda.

Nowadays, you should just use the SSM Parameter Store because:

  • it’s a fully managed service
  • sharing configurations is easy, as it’s a centralised service
  • it integrates with KMS out-of-the-box
  • it offers fine-grained control via IAM
  • it records a history of changes
  • you can use it via the console, AWS CLI as well as via its HTTPS API

In short, it ticks all our boxes.

You have fine-grained control over what parameters a function is allowed to access.

There are couple of service limits to be aware of:

  • max 10,000 parameters per account
  • max length of parameter value is 4096 characters
  • max 100 past values for a parameter

Client library

Having a centralised place to store parameters is just one side of the coin. You should still invest effort into making a robust client library that is easy to use, and supports:

  • caching & cache expiration
  • hot-swapping configurations when source config value has changed

Here is one such client library that I put together for a demo:

To use it, you can create config objects with the loadConfigs function. These objects will expose properties that return the config values as Promise (hence the yield, which is the magic power we get with co).

You can have different config values with different cache expiration too.

If you want to play around with using SSM Parameter Store from Lambda (or to see this cache client in action), then check out this repo and deploy it to your AWS environment. I haven’t included any HTTP events, so you’d have to invoke the functions from the console.

Update 15/09/2017: the Serverless framework release 1.22.0 which introduced support for SSM parameters out of the box.

With this latest version of the Serverless framework, you can specify the value of environment variables to come from SSM parameter store directly.

Compared to many of the existing approaches, it has some benefits:

  • avoid checking in sensitive data in plain text in source control
  • avoid duplicating the same config values in multiple services

However, it still falls short on many fronts (based on my own requirements):

  • since it’s fetching the SSM parameter values at deployment time, it still couples your ability to deploy your function with access to sensitive configuration data
  • the configuration values are stored in plain text as Lambda environment variables, which means you don’t need the KMS permissions to access them, you can see it the Lambda console in plain sight
  • further to the above, if the function is compromised by an attacker (who would then have access to process.env) then they’ll be able to easily find the decrypted values during the initial probe (go to 13:05 mark on this video where I gave a demo of how easily this can be done)
  • because the values are baked at deployment time, it doesn’t allow you to easily propagate config value changes. To make a config value change, you will need to a) identify all dependent functions; and b) re-deploying all these functions

Of course, your requirement might be very different from mine, and I certainly think it’s an improvement over many of the approaches I have seen. But, personally I still think you should:

  1. fetch SSM parameter values at runtime
  2. cache these values, and hot-swap when source values change

Using Protocol Buffers with API Gateway and AWS Lambda

AWS announced binary support for API Gateway in late 2016, which opened up the door for you to use more efficient binary formats such as Google’s Protocol Buffers and Apache Thrift.

Why?

Compared to JSON – which is the bread and butter for APIs built with API Gateway and Lambda – these binary formats can produce significantly smaller payloads.

At scale, they can make a big difference to your bandwidth cost.

In restricted environments such as low-end devices or in countries with poor mobile connections, sending smaller payloads can also improve your user experience by improving the end-to-end network latency, and possibly processing time on the device too.

Comparison of serializer performance between Proto Buffers and JSON in .Net

How

Follow these 3 simple steps (assuming you’re using Serverless framework):

  1. install the awesome serverless-apigw-binary plugin
  2. add application/x-protobuf to binary media types (see screenshot below)
  3. add function that returns Protocol Buffers as base64 encoded response

The serverless-apigw-binary plugin has made it really easy to add binary support to API Gateway

To encode & decode Protocol Buffers payload in Nodejs, you can use the protobufjs package from NPM.

It lets you work with your existing .proto files, or you can use JSON descriptors. Give the docs a read to see how you can get started.

In the demo project (link at the bottom of the post) you’ll find a Lambda function that always returns a response in Protocol Buffers.

Couple of things to note from this function:

  • we set the Content-Type header to application/x-protobuf
  • body is base64 encoded representation of the Protocol Buffers payload
  • isBase64Encoded is set to true

you need to do all 3 of these things to make API Gateway return the response as binary data.

Consider them the magic incantation for making API Gateway return binary data, and, the caller also has to set the Accept header to application/x-protobuf.

In the same project, there’s also a JSON endpoint that returns the same payload as comparison.

The response from this JSON endpoint looks like this:

{"players":[{"id":"eb66db14992e06b36282d607cf0134ce4fe45f50","name":"Calvin Ortiz","scores":[57,12,100,56,47,78,20,37,32,48]},{"id":"7b9b38e535453d120e706ff57fef41f6fee991cb","name":"Marcus Cummings","scores":[40,57,24,15,45,54,25,67,59,23]},{"id":"db34a2a5f4d16e77a6d3d6154a8b8bb6760b3b99","name":"Harry James","scores":[61,85,14,70,8,80,14,22,76,87]},{"id":"e21018c4f43eef10771e0fa71bc54156b00a64dd","name":"Gregory Bishop","scores":[51,31,27,47,72,75,61,28,100,41]},{"id":"b3ee29ee49b640ce15be1737d0dca60e48108ee1","name":"Ann Evans","scores":[69,17,48,99,85,8,75,55,78,46]},{"id":"9c1e6d4d46bb0c0d2c92bab11e5dbd5f4ab0c619","name":"Juan Perez","scores":[71,34,60,84,21,98,60,8,91,92]},{"id":"d8de89222633c61393931457c1e72558eba48639","name":"Loretta Harvey","scores":[15,40,73,92,42,65,58,30,26,84]},{"id":"141dad672ec559431f808964391d128d2c3274bf","name":"Ian Powell","scores":[17,21,14,84,64,14,22,22,34,92]},{"id":"8a97e85e2e5385c45fc31f24bfe781c26f78c0b7","name":"Steve Gibson","scores":[33,97,6,1,20,1,78,3,77,19]},{"id":"6b3ca6924e17cd5fd9d91b36d49b36a5d542c9ea","name":"Harold Ferguson","scores":[31,32,4,10,37,85,46,86,39,17]}]}

As you can see, it’s just a bunch of randomly generated names and GUIDs, and integers. The same response in Protocol Buffers is nearly 40% smaller.

Problem with the protobufjs package

Before we move on, there is one important detail about using the protobufjspacakge in a Lambda function – you need to npm install the package on a Linux system.

This is because it has a dependency that is distributed as native binaries, so if you installed the packaged on OSX then the binaries that are packaged and deployed to Lambda will not run on the Lambda execution environment.

I had similar problems with other Google libraries in the past. I find the best way to deal with this is to take a leaf out of aws-serverless-go-shim’s approach and deploy your code inside a Docker container.

This way, you would locally install a compatible version of the native binaries for your OS so you can continue to run and debug your function with sls invoke local (see this post for details).

But, during deployment, a script would run npm install --force in a Docker container running a compatible Linux distribution. This would then install a version of the native binaries that can be executed in the Lambda execution environment. The script would then use sls deploy to deploy the function.

The deployment script can be something simple like this:

In the demo project, I also have a docker-compose.yml file:

The Serverless framework requires my AWS credentials, hence why I’ve attached the $HOME/.aws directory to the container for the AWSSDK to find at runtime.

To deploy, run docker-compose up.

Use HTTP content negotiation

Whilst binary formats are more efficient when it comes to payload size, they do have one major problem: they’re really hard to debug.

Imagine the scenario – you have observed a bug, but you’re not sure if the problem is in the client app or the server. But hey, let’s just observe the HTTP conversation with a HTTP proxy such as Charles or Fiddler.

This workflow works great for JSON but breaks down when it comes to binary formats such as Protocol Buffers as the payloads are not human readable.

As we have discussed in this post, the human readability of JSON comes with the cost of heavier bandwidth usage. For most network communications, be it service-to-service, or service-to-client, unless a human is actively “reading” the payloads it’s not worth paying the cost. But when a human is trying to read it, that human readability is very valuable.

Fortunately, HTTP’s content negotiation mechanism means we can have the best of both worlds.

In the demo project, there is a contentNegotiated function which returns either JSON or Protocol Buffers payloads based on what the Accept header.

By default, you should use Protocol Buffers for all your network communications to minimise bandwidth use.

But, you should build in a mechanism for toggling the communication to JSON when you need to observe the communications. This might mean:

  • for debug builds of your mobile app, allow super users (devs, QA, etc.) the ability to turn on debug mode, which would switch the networking layer to send Accept header as application/json
  • for services, include a configuration option to turn on debug mode (see this post on configuring functions with SSM parameters and cache client for hot-swapping) to make service-to-service calls use JSON too, so you can capture and analyze the request and responses more easily

As usual, you can try out the demo code yourself, the repo is available here.