Applying principles of chaos engineering to AWS Lambda with latency injection

This is part 2 of a mul­ti­part series that explores ideas on how we could apply the prin­ci­ples of chaos engi­neer­ing to server­less archi­tec­tures built around Lamb­da func­tions.

The most com­mon issue I have encoun­tered in pro­duc­tion are latency/performance relat­ed. They can be symp­to­matic of a host of under­ly­ing caus­es rang­ing from AWS net­work issues (which can also man­i­fest itself in laten­cy/er­ror-rate spikes in any of the AWS ser­vices), over­loaded servers, to sim­ple GC paus­es.

Laten­cy spikes are inevitable — as much as you can improve the per­for­mance of your appli­ca­tion, things will go wrong, even­tu­al­ly, and often they’re out of your con­trol.

So you must design for them, and degrade the qual­i­ty of your appli­ca­tion grace­ful­ly to min­i­mize the impact on your users.

In the case of API Gate­way and Lamb­da, there are addi­tion­al con­sid­er­a­tions:

  • API Gate­way has a hard lim­it of 29s time­out for inte­gra­tion points, so even if your Lamb­da func­tion can run for up to 5 mins, API Gate­way will time­out way before that

If you use mod­er­ate time­out set­tings for your API func­tions (and you should!) then you need to con­sid­er the effects of cold starts when call­ing an inter­me­di­ate ser­vice.

Where to inject latency

Sup­pose our client appli­ca­tion com­mu­ni­cates direct­ly with 2 pub­lic fac­ing APIs, whom in turn depends on an inter­nal API.

In this set­up, I can think of 3 places where we can inject laten­cy and each would val­i­date a dif­fer­ent hypoth­e­sis.

Inject latency at HTTP clients

The first, and eas­i­est place to inject laten­cy is in the HTTP client library we use to com­mu­ni­cate with the inter­nal API.

This will test that our func­tion has appro­pri­ate time­out on this HTTP com­mu­ni­ca­tion and can degrade grace­ful­ly when this request time out.

We can inject laten­cy to the HTTP client libraries for our inter­nal APIs, hence val­i­dat­ing that the caller func­tion has con­fig­ured appro­pri­ate time­out and error han­dling for time­outs.

Fur­ther­more, this prac­tice should also be applied to oth­er 3rd par­ty ser­vices we depend on, such as DynamoDB. We will dis­cuss how we can inject laten­cy to these 3rd par­ty libraries lat­er in the post.

We can also inject laten­cy to 3rd par­ty client libraries for oth­er man­aged ser­vices we depend on.

This is a rea­son­ably safe place to inject laten­cy as the imme­di­ate blast radius is lim­it­ed to this func­tion.

How­ev­er, you can (and arguably, should) con­sid­er apply­ing this type of laten­cy injec­tion to inter­me­di­ate ser­vices as well. Doing so does car­ry extra risk as it has a broad­er blast radius in fail­ures case — ie. if the func­tion under test does not degrade grace­ful­ly then it can cause unin­tend­ed prob­lems to out­er ser­vices. In this case, the blast radius for these fail­ure cas­es is the same as if you’re inject­ing laten­cy to the inter­me­di­ate func­tions direct­ly.

Inject latency to intermediate functions

You can also inject laten­cy direct­ly to the func­tions them­selves (we’ll look at how lat­er on). This has the same effect as inject­ing laten­cy to the HTTP client to each of its depen­dents, except it’ll affect all its depen­dents at once.

We can inject laten­cy to a function’s invo­ca­tion. If that func­tion is behind an inter­nal API that are used by mul­ti­ple pub­lic-fac­ing APIs then it can cause all its depen­dents to expe­ri­ence time­outs.

This might seem risky (it can be), but is an effec­tive way to val­i­date that every ser­vice that depends on this API end­point is expect­ing, and han­dling time­outs grace­ful­ly.

It makes most sense when applied to inter­me­di­ate APIs that are part of a bound­ed con­text (or, a microser­vice), main­tained by the same team of devel­op­ers. That way, you avoid unleash­ing chaos upon unsus­pect­ing devel­op­ers who might not be ready to deal with the chaos.

That said, I think there is a good counter-argu­ment for doing just that.

We often fall into the pit­fall of using the per­for­mance char­ac­ter­is­tics of dev envi­ron­ments as pre­dic­tor for the pro­duc­tion envi­ron­ment. Whilst we sel­dom expe­ri­ence load-relat­ed laten­cy prob­lems in the dev environments?—?because we don’t have enough load in those envi­ron­ments to begin with?—?production is quite anoth­er sto­ry. Which means, we’re not pro­grammed to think about these fail­ure modes dur­ing devel­op­ment.

So, a good way to hack the brains of your fel­low devel­op­ers and to pro­gramme them to expect time­outs is to expose them to these fail­ure modes reg­u­lar­ly in the dev envi­ron­ments, by inject­ing laten­cy to our inter­nal APIs in those envi­ron­ments.

We can fig­u­ra­tive­ly hold up a sign and tell oth­er devel­op­ers to expect laten­cy spikes and time­outs by lit­er­al­ly expos­ing them to these sce­nar­ios in dev envi­ron­ments, reg­u­lar­ly, so they know to expect it.

In fact, if we make our dev envi­ron­ments exhib­it the most hos­tile and tur­bu­lent con­di­tions that our sys­tems should be expect­ed to han­dle, then we know for sure that any sys­tem that makes its way to pro­duc­tion are ready to face what awaits it in the wild.

Inject latency to public-facing functions

So far, we have focused on val­i­dat­ing the han­dling of laten­cy spikes and time­outs in our APIs. The same val­i­da­tion is need­ed for our client appli­ca­tion.

We can apply all the same argu­ments men­tioned above here. By inject­ing laten­cy to our pub­lic-fac­ing API func­tions (in both pro­duc­tion as well as dev envi­ron­ments), we can:

  • val­i­date the client appli­ca­tion han­dles laten­cy spikes and time­outs grace­ful­ly, and offers the best UX as pos­si­ble in these sit­u­a­tions
  • train our client devel­op­ers to expect laten­cy spikes and time­outs

When I was work­ing on a MMORPG at Gamesys years ago, we uncov­ered a host of frail­ties in the game when we inject­ed laten­cy spikes and faults to our APIs. The game would crash dur­ing start­up if any of the first hand­ful of requests fails. In some cas­es, if the response time was longer than a few sec­onds then the game would also get into a weird state because of race con­di­tions.

Turns out I was set­ting my col­leagues up for fail­ure in pro­duc­tion because the dev envi­ron­ment was so for­giv­ing and gave them a false sense of com­fort.

With that, let’s talk about how we can apply the prac­tice of laten­cy injec­tion.

But wait, can’t you inject laten­cy in the client-side HTTP clients too?

Absolute­ly! And you should! How­ev­er, for the pur­pose of this post we are going to look at how and where we can inject laten­cy to our Lamb­da func­tions only, hence why I have will­ful­ly ignored this part of the equa­tion.

How to inject latency

There are 2 aspects to actu­al­ly inject­ing laten­cies:

  1. adding delays to oper­a­tions
  2. con­fig­ur­ing how often and how much delay to add

If you read my pre­vi­ous posts on cap­tur­ing and for­ward­ing cor­re­la­tion IDsand man­ag­ing con­fig­u­ra­tions with SSM Para­me­ter Store, then you have already seen the basic build­ing blocks we need to do both.

How to inject latency to HTTP client

Since you are unlike­ly to write a HTTP client from scratch, so I con­sid­er the prob­lem for inject­ing laten­cy to HTTP client and 3rd par­ty clients (such as the AWS SDK) to be one and the same.

A cou­ple of solu­tions jump to mind:

  • in sta­t­ic lan­guages, you can con­sid­er using a sta­t­ic weaver such as Aspec­tJ or Post­Sharp, this is the approach I took pre­vi­ous­ly
  • in sta­t­ic lan­guages, you can con­sid­er using dynam­ic prox­ies, which many IoC frame­works offer (anoth­er form of AOP)
  • you can cre­ate a wrap­per for the client, either man­u­al­ly or with a fac­to­ry func­tion (blue­bird­js’s promisifyAll func­tion is a good exam­ple)

Since I’m going to use Node.js as exam­ple, I’m going to focus on wrap­pers.

For the HTTP client, giv­en the rel­a­tive­ly small num­ber of meth­ods you will need, it’s fea­si­ble to craft the wrap­per by hand, espe­cial­ly if you have a par­tic­u­lar API design in mind.

Using the HTTP client I cre­at­ed for the cor­re­la­tion ID post as base, I mod­i­fied it to accept a con­fig­u­ra­tion object to con­trol the laten­cy injec­tion behav­iour.

  "isEnabled": true,
  "probability": 0.5,
  "minDelay": 100,
  "maxDelay": 5000

You can find this mod­i­fied HTTP client here, below is a sim­pli­fied ver­sion of this client (which uses superagent under the hood).

To con­fig­ure the func­tion and the laten­cy injec­tion behav­iour, we can use the configClient I first cre­at­ed in the SSM Para­me­ter Store post.

First, let’s cre­ate the con­figs in the SSM Para­me­ter Store.

You can cre­ate and option­al­ly encrypt para­me­ter val­ues in the SSM Para­me­ter Store.

The con­figs con­tains the URL for the inter­nal API, as well as a chaosConfigobject. For now, we just have a httpClientLatencyInjectionConfig prop­er­ty, which is used to con­trol the HTTP client’s laten­cy injec­tion behav­iour.

  "internalApi": "", 
  "chaosConfig": {
    "httpClientLatencyInjectionConfig": {
      "isEnabled": true,
      "probability": 0.5,
      "minDelay": 100,
      "maxDelay": 5000

Using the afore­men­tioned configClient, we can fetch the JSON con­fig from SSM Para­me­ter Store at run­time.

const configKey = "public-api-a.config";
const configObj = configClient.loadConfigs([ configKey ]);

let config = JSON.parse(yield configObj["public-api-a.config"]);
let internalApiUrl = config.internalApi;
let chaosConfig = config.chaosConfig || {};
let injectionConfig = chaosConfig.httpClientLatencyInjectionConfig;

let reply = yield http({ 
  method : 'GET', 
  uri : internalApiUrl, 
  latencyInjectionConfig: injectionConfig 

The above con­fig­u­ra­tion gives us a 50% chance of inject­ing a laten­cy between 100ms and 3sec when we make the HTTP request to internal-api.

This is reflect­ed in the fol­low­ing X-Ray traces.

How to inject latency to AWSSDK

With the AWS SDK, it’s not fea­si­ble to craft the wrap­per by hand. Instead, we could do with a fac­to­ry func­tion like blue­bird’s promisifyAll.

We can apply the same approach here, and I made a crude attempt at doing just that. I must add that, whilst I con­sid­er myself a com­pe­tent Node.js pro­gram­mer, I’m sure there’s a bet­ter way to imple­ment this fac­to­ry func­tion.

My fac­to­ry func­tion will only work with promisi­fied objects (told you it’s crude..), and replaces their xxxAsync func­tions with a wrap­per that takes in one more argu­ment of the shape:

  "isEnabled": true,
  "probability": 0.5,
  "minDelay": 100,
  "maxDelay": 3000

Again, it’s clum­sy, but we can take the DocumentClient from the AWS SDK, promisi­fy it with blue­bird, then wrap the promisi­fied object with our own wrap­per fac­to­ry. Then, we can call its async func­tions with an option­al argu­ment to con­trol the laten­cy injec­tion behav­iour.

You can see this in action in the han­dler func­tion for public-api-b .

For some rea­son, the wrapped func­tion is not able to record sub­seg­ments in X-Ray. I sus­pect it’s some nuance about Javascript or the X-Ray SDK that I do not ful­ly under­stand.

Nonethe­less, judg­ing from the logs, I can con­firm that the wrapped func­tion does indeed inject laten­cy to the getAsync call to DynamoDB.

If you know of a way to improve the fac­to­ry func­tion, or to get the X-Ray trac­ing work with the wrapped func­tion, please let me know in the com­ments.

How to inject latency to function invocations

The apiHandler fac­to­ry func­tion I cre­at­ed in the cor­re­la­tion ID post is a good place to apply com­mon imple­men­ta­tion pat­terns that we want from our API func­tions, includ­ing:

  • log the event source as debug
  • log the response and/or error from the invo­ca­tion (which, sur­pris­ing­ly, Lamb­da doesn’t cap­ture by default)
  • ini­tial­ize glob­al con­text (eg. for track­ing cor­re­la­tion IDs)
  • han­dle seri­al­iza­tion for the response object
  • etc..
// this is how you use the apiHandler factory function to create a
// handler function for API Gateway event source
module.exports.handler = apiHandler(
  co.wrap(function* (event, context) {
    ... // do bunch of stuff
    // instead of invoking the callback directly, you return the
    // response you want to send, and the wrapped handler function
    // would handle the serialization and invoking callback for you
    // also, it takes care of other things for you, like logging
    // the event source, and logging unhandled exceptions, etc.
   return { message : "everything is awesome" };

In this case, it’s also a good place for us to inject laten­cy to the API func­tion.

How­ev­er, to do that, we need to access the con­fig­u­ra­tion for the func­tion. Time to lift the respon­si­bil­i­ty for fetch­ing con­fig­u­ra­tions into the apiHandlerfac­to­ry then!

The full apiHandler fac­to­ry func­tion can be found here, below is a sim­pli­fied ver­sion that illus­trates the point.

Now, we can write our API func­tion like the fol­low­ing.

Now that the apiHandler has access to the con­fig for the func­tion, it can access the chaosConfig object too.

Let’s extend the def­i­n­i­tion for the chaosConfig object to add a functionLatencyInjectionConfig prop­er­ty.

"chaosConfig": {
  "functionLatencyInjectionConfig": {
    "isEnabled": true,
    "probability": 0.5,
    "minDelay": 100,
    "maxDelay": 5000
  "httpClientLatencyInjectionConfig": {
    "isEnabled": true,
    "probability": 0.5,
    "minDelay": 100,
    "maxDelay": 5000

With this addi­tion­al con­fig­u­ra­tion, we can mod­i­fy the apiHandler fac­to­ry func­tion to use it to inject laten­cy to a function’s invo­ca­tion much like what we did in the HTTP client.

Just like that, we can now inject laten­cy to func­tion invo­ca­tions via con­fig­u­ra­tion. This will work for any API func­tion that is cre­at­ed using the apiHandler fac­to­ry.

With this change and both kinds of laten­cy injec­tions enabled, I can observe all the expect­ed sce­nar­ios through X-Ray:

  • no laten­cy was inject­ed

  • laten­cy was inject­ed to the func­tion invo­ca­tion only

  • laten­cy was inject­ed to the HTTP client only

  • laten­cy was inject­ed to both HTTP client and the func­tion invo­ca­tion, but the invo­ca­tion did not time­out as a result

  • laten­cy was inject­ed to both HTTP client and the func­tion invo­ca­tion, and the invo­ca­tion times out as a result

I can get fur­ther con­fir­ma­tion of the expect­ed behav­iour through logs, and the meta­da­ta record­ed in the X-Ray traces.

Recap, and future works

In this post we dis­cussed:

  • why you should con­sid­er apply­ing the prac­tice of laten­cy injec­tion to APIs cre­at­ed with API Gate­way and Lamb­da
  • addi­tion­al con­sid­er­a­tions spe­cif­ic to API Gate­way and Lamb­da
  • where you can inject laten­cies, and why you should con­sid­er inject­ing laten­cy at each of these places
  • how you can inject laten­cy in HTTP clients, AWS SDK, as well as the func­tion invo­ca­tion

The approach we have dis­cussed here is dri­ven by con­fig­u­ra­tion, and the con­fig­u­ra­tion is refreshed every 3 mins by default.

We can go much fur­ther with this.

Fine grained configuration

The con­fig­u­ra­tions can be more fine grained, and allow you to con­trol laten­cy injec­tion to spe­cif­ic resources.

For exam­ple, instead of a blan­ket httpClientLatencyInjectionConfig for all HTTP requests (includ­ing those requests to AWS ser­vices), the con­fig­u­ra­tion can be spe­cif­ic to an API, or a DynamoDB table.


The con­fig­u­ra­tions can be changed by an auto­mat­ed process to:

  • run rou­tine val­i­da­tions dai­ly
  • stop all laten­cy injec­tions dur­ing off hours, and hol­i­days
  • force­ful­ly stop all laten­cy injec­tions, eg. dur­ing an actu­al out­age
  • orches­trate com­plex sce­nar­ios that are dif­fi­cult to man­age by hand, eg. enable laten­cy injec­tion at sev­er­al places at once

Again, we can look to Net­flix for inspi­ra­tion for such an auto­mat­ed plat­form.

Usu­al­ly, you would want to enable one laten­cy injec­tion in a bound­ed con­text at a time. This helps con­tain the blast radius of unin­tend­ed dam­ages, and make sure your exper­i­ments are actu­al­ly con­trolled. Also, when laten­cy is inject­ed at sev­er­al places, it is hard­er to under­stand the causal­i­ty we observe as there are mul­ti­ple vari­ables to con­sid­er.

Unless, of course, you’re val­i­dat­ing against spe­cif­ic hypoth­e­sis such as:

The sys­tem can tol­er­ate out­age to both the pri­ma­ry store (DynamoDB) as well as the back­up store (S3) for user pref­er­ences, and would return a hard­cod­ed default val­ue in that case.

Better communication

Anoth­er good thing to do, is to inform the caller of the fact that laten­cy has been added to the invo­ca­tion by design.

This might take the form of a HTTP head­er in the response to tell the caller how much laten­cy was inject­ed in total. If you’re using an auto­mat­ed process to gen­er­ate these exper­i­ments, then you should also include the id/tag/name for the spe­cif­ic instance of the exper­i­ment as HTTP head­er as well.

What’s next?

As I men­tioned in the pre­vi­ous post, you need to apply com­mon sense when decid­ing when and where you apply chaos engi­neer­ing prac­tices.

Don’t attempt an exer­cis­es that you know is beyond your abil­i­ties.

Before you even con­sid­er apply­ing laten­cy injec­tion to your APIs in pro­duc­tion, you need to think about how you can deal with these laten­cy spikes giv­en the inher­ent con­straints of API Gate­way and Lamb­da.

Unfor­tu­nate­ly, we have run out of time to talk about this in this post, but come back in 2 weeks and we will talk about sev­er­al strate­gies you can employ in part 3.

The code for the demo in this post is avail­able on github here. Feel free to play around with it and let me know if you have any sug­ges­tions for improve­ment!


Liked this post? Why not support me on Patreon and help me get rid of the ads!