How can we apply the principles of chaos engineering to AWS Lambda?

This is the first of a mul­ti­part series that explores ideas on how we could apply the prin­ci­ples of chaos engi­neer­ing to server­less archi­tec­tures built around Lamb­da func­tions.

  • part 1: how can we apply prin­ci­ples of chaos engi­neer­ing to Lamb­da? <- you’re here
  • part 2: laten­cy injec­tion for APIs
  • part 3: fault injec­tion for Lamb­da func­tions

All the way back in 2011, Simon Ward­ley had iden­ti­fied Chaos Engines as a prac­tice that will be employed by the next gen­er­a­tion of tech com­pa­nies, along with con­tin­u­ous deploy­ment, being data-dri­ven, and organ­ised around small, autonomous teams (think microser­vices & inverse-conway’s law).

There’s no ques­tion about it, Net­flix has pop­u­larised the prin­ci­ples of chaos engi­neer­ing. By open sourc­ing some of their tools — notably the Simi­an Army — they have also helped oth­ers build con­fi­dence in their system’s capa­bil­i­ty to with­stand tur­bu­lent con­di­tions in pro­duc­tion.

There seems to be a renewed inter­est in chaos engi­neer­ing recent­ly. As Russ Miles not­ed in a recent post, per­haps many com­pa­nies have final­ly come to under­stand that chaos engi­neer­ing is not about “hurt­ing pro­duc­tion”, but to build bet­ter under­stand­ing of, and con­fi­dence in a system’s resilience through con­trolled exper­i­ments.

This trend has been helped by the valu­able (and freely avail­able) infor­ma­tion that Net­flix has pub­lished, such as the Chaos Engi­neer­ing e-book, and principlesofchaos.org.

Tools such as chaos-lamb­da by Shored­itch Ops (the folks behind the Artillery load test tool) look to repli­cate Netflix’s Chaos Mon­key, but exe­cute from inside a Lamb­da func­tion instead of an EC2 instance — ence bring­ing you the cost sav­ing and con­ve­nience Lamb­da offers.

I want to ask a dif­fer­ent ques­tion how­ev­er: how can one apply the prin­ci­ples of chaos engi­neer­ing and some of the cur­rent prac­tices to a server­less archi­tec­ture com­prised of Lamb­da func­tions?

When your sys­tem runs on EC2 instances, then nat­u­ral­ly, you build resilience by design­ing for the most like­ly fail­ure mode — serv­er crash­es (due to both hard­ware and soft­ware issues). Hence, a con­trolled exper­i­ment to val­i­date the resilience of your sys­tem would arti­fi­cial­ly recre­ate the fail­ure sce­nario by ter­mi­nat­ing EC2 instances, and then AZs, and then entire regions.

AWS Lamb­da, how­ev­er, is a high­er-lev­el abstrac­tion and has dif­fer­ent fail­ure modes to its EC2 coun­ter­parts. Hypoth­e­sis that focus on “what if we lose these EC2 instances” no longer apply as the plat­form han­dles these fail­ure modes for you out of the box.

We need to ask dif­fer­ent ques­tions in order to under­stand the weak­ness­es with­in our server­less archi­tec­tures.

More inherent chaos, not less

We need to iden­ti­fy weak­ness­es before they man­i­fest in sys­tem-wide, aber­rant behav­iors. Sys­temic weak­ness­es could take the form of: improp­er fall­back set­tings when a ser­vice is unavail­able; retry storms from improp­er­ly tuned time­outs; out­ages when a down­stream depen­den­cy receives too much traf­fic; cas­cad­ing fail­ures when a sin­gle point of fail­ure crash­es; etc. We must address the most sig­nif­i­cant weak­ness­es proac­tive­ly, before they affect our cus­tomers in pro­duc­tion. We need a way to man­age the chaos inher­ent in these sys­tems, take advan­tage of increas­ing flex­i­bil­i­ty and veloc­i­ty, and have con­fi­dence in our pro­duc­tion deploy­ments despite the com­plex­i­ty that they rep­re­sent.

— Prin­ci­ples of Chaos Engi­neer­ing

Hav­ing built and oper­at­ed a non-triv­ial server­less archi­tec­ture, I have some under­stand­ing of the dan­gers await­ing you in this new world.

If any­thing, there are a lot more inher­ent chaos and com­plex­i­ty in these sys­tems built around Lamb­da func­tions.

  • mod­u­lar­i­ty (unit of deploy­ment) shifts from “ser­vices” to “func­tions”, and there are a lot more of them
  • it’s hard­er to hard­en around the bound­aries, because you need to hard­en around each func­tion as opposed to a ser­vice which encap­su­lates a set of relat­ed func­tion­al­i­ties
  • there are more inter­me­di­ary ser­vices (eg. Kine­sis, SNS, API Gate­way just to name a few), each with their own fail­ure modes
  • there are more con­fig­u­ra­tions over­all (time­out, IAM per­mis­sions, etc.), and there­fore more oppor­tu­ni­ties for mis­con­fig­u­ra­tion

Also, since we have trad­ed off more con­trol of our infra­struc­ture* it means we now face more unknown fail­ure modes** and often there’s lit­tle we can do when an out­age does occur***.


* For bet­ter scal­a­bil­i­ty, avail­abil­i­ty, cost effi­cien­cy and more con­ve­nience, which I for one, think it’s a fair trade in most cas­es.

** Every­thing the plat­form does for you — sched­ul­ing con­tain­ers, scal­ing, polling Kine­sis, retry failed invo­ca­tions, etc. — have their own fail­ure modes. These are often not obvi­ous to us since they’re imple­men­ta­tion details that are typ­i­cal­ly undoc­u­ment­ed and are prone to change with­out notice.

*** For exam­ple, if an out­age hap­pens and pre­vents Lamb­da func­tions from pro­cess­ing Kine­sis events, then we have no mean­ing­ful alter­na­tive than to wait for AWS to fix the prob­lem. Since the cur­rent posi­tion on the shards is abstract­ed away and unavail­able to us, we can’t even replace the Lamb­da func­tions with KCL proces­sors that run on EC2.


Applying chaos to AWS Lambda

A good exer­cise regime would con­tin­u­ous­ly push you to your lim­its but nev­er actu­al­ly put you over the lim­it and cause injury. If there’s an exer­cise that is clear­ly beyond your cur­rent abil­i­ties then sure­ly you would not attempt it as the only pos­si­ble out­come is get­ting your­self hurt!

The same com­mon sense should be applied when design­ing con­trolled exper­i­ments for your server­less archi­tec­ture. In order to “know” what the exper­i­ments tell us about the resilience of our sys­tem we also need to decide what met­rics to mon­i­tor — ide­al­ly using client-side met­rics, since the most impor­tant met­ric is the qual­i­ty of ser­vice our users expe­ri­ence.

There are plen­ty of fail­ure modes that we know about and can design for, and we can run sim­ple exper­i­ments to val­i­date our design. For exam­ple, since a server­less archi­tec­ture is (almost always) also a microser­vice archi­tec­ture, many of its inher­ent fail­ure modes still apply:

  • improp­er­ly tuned time­outs, espe­cial­ly for inter­me­di­ate ser­vices, which can cause ser­vices at the edge to also time­out
Inter­me­di­ate ser­vices should have more strict time­out set­tings com­pared to ser­vices at the edge.
  • miss­ing error han­dling, which allows excep­tions from down­stream ser­vices to escape

  • miss­ing fall­backs for when a down­stream ser­vice is unavail­able or expe­ri­ences an out­age

Over the next cou­ple of posts, we will explore how we can apply the prac­tices of laten­cy and fault injec­tion to Lamb­da func­tions in order to sim­u­late these fail­ure modes and val­i­date our design.

Further readings:

Like what you’re read­ing? Check out my video course Pro­duc­tion-Ready Server­less and learn the essen­tials of how to run a server­less appli­ca­tion in pro­duc­tion.

We will cov­er top­ics includ­ing:

  • authen­ti­ca­tion & autho­riza­tion with API Gate­way & Cog­ni­to
  • test­ing & run­ning func­tions local­ly
  • CI/CD
  • log aggre­ga­tion
  • mon­i­tor­ing best prac­tices
  • dis­trib­uted trac­ing with X-Ray
  • track­ing cor­re­la­tion IDs
  • per­for­mance & cost opti­miza­tion
  • error han­dling
  • con­fig man­age­ment
  • canary deploy­ment
  • VPC
  • secu­ri­ty
  • lead­ing prac­tices for Lamb­da, Kine­sis, and API Gate­way

You can also get 40% off the face price with the code ytcui. Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP).