AWS Lambda — how best to manage shared code and shared infrastructure

In the last post I dis­cussed the pros & cons of fol­low­ing the Sin­gle Respon­si­bil­i­ty Prin­ci­ple (SRP) when mov­ing to the server­less par­a­digm.

One of the ques­tions that popped up on both Twit­ter and Medi­um is “how do you deal with shared code?”. It is a FAQ when­ev­er I speak at user groups or con­fer­ences about AWS Lamb­da, along­side “how do you deal with shared infra­struc­ture that doesn’t clear­ly belong to any one par­tic­u­lar ser­vice?”

So here are my thoughts on these two ques­tions.

Again, I’m not look­ing to con­vince you one way or the oth­er and I don’t believe there’s a “right” answer that’d work for every­one. This is sim­ply me talk­ing out loud and shar­ing my inter­nal thought process with you and hope­ful­ly get you ask­ing the same ques­tions of your own archi­tec­ture.

As ever, if you dis­agree with my assess­ment or find flaws in my think­ing, please let me know via the com­ments sec­tion below.


As you build out your sys­tem with all these lit­tle Lamb­da func­tions, there are no doubt going to be busi­ness log­ic, or util­i­ty code that you want to share and reuse amongst your Lamb­da func­tions.

When you have a group of func­tions that are high­ly cohe­sive and are organ­ised into the same repo — like the func­tions that we cre­at­ed to imple­ment the time­line fea­ture in the Yubl app — then shar­ing code is easy, you just do it via a mod­ule inside the repo.

But to share code more gen­er­al­ly between func­tions across the ser­vice bound­ary, it can be done through shared libraries, per­haps pub­lished as pri­vate NPM pack­ages so they’re only avail­able to your team.

Or, you can share busi­ness log­ic by encap­su­lat­ing them into a ser­vice, and there are cou­ple of con­sid­er­a­tions you should make in choos­ing which approach to use.

Shared library vs Service

visibility

When you depend on a shared library, that depen­den­cy is declared explic­it­ly, in the case of Node.js, this depen­den­cy is declared in the package.json.

When you depend on a ser­vice, that depen­den­cy is often not declared at all, and may be dis­cov­ered only through log­ging, and per­haps explic­it attempts at trac­ing, maybe using the AWS X-Ray ser­vice.

deployment

When it comes to deploy­ing updates to these shared code, with shared library, you can pub­lish a new ver­sion but you still have to rely on the con­sumers of your shared library to update.

Where­as with a ser­vice, you as the own­er of the ser­vice has the pow­er to decide when to deploy the update, and you can even use tech­niques such as canary deploy­ment or fea­ture flags to roll out updates in a con­trolled, and safe man­ner.

versioning

With libraries, you will have mul­ti­ple active ver­sions at the same time (as dis­cussed above) depend­ing on the upgrade and deploy­ment sched­ule of the con­sumers. In fact, there’s real­ly no way to avoid it entire­ly, not even with the best efforts at a coor­di­nat­ed update, there will be a peri­od of time where there are mul­ti­ple ver­sions active at once.

With ser­vices, you have a lot more con­trol, and you might choose to run mul­ti­ple ver­sions at the same time. This can be done via canary deploy­ment or to run mul­ti­ple ver­sions side-by-side by maybe putting the ver­sion of the API in the URL, as peo­ple often do.

There are mul­ti­ple ways to ver­sion an API, but I don’t find any of them to be sat­is­fac­to­ry. Sebastien Lam­bla did a good talk on this top­ic and went through sev­er­al of these approach­es and why they’re all bad, so check out his talk if you want to learn more about the per­ils of API ver­sion­ing.

backward compatibility

With a shared library, you can com­mu­ni­cate back­ward com­pat­i­bil­i­ty of updates using seman­tic ver­sion­ing — where a MAJOR ver­sion update sig­ni­fies a break­ing change. If you fol­low seman­tic ver­sion­ing with your releas­es then back­ward com­pat­i­bil­i­ty can be bro­ken in a con­trolled, well com­mu­ni­cat­ed man­ner.

Most pack­age man­agers sup­port seman­tic ver­sion­ing by allow­ing the user to decide whether auto­mat­ic updates should incre­ment to the next MAJOR or MINOR ver­sion.

With a ser­vice, if you roll out a break­ing change then it’ll break any­one that depends on your ser­vice. This is where it ties back to ver­sion­ing again, and as I already said, none of the approach­es that are com­mon­ly prac­ticed feel sat­is­fac­to­ry to me. I have had this dis­cus­sion with my teams many times in the past, and they always end­ed with the deci­sion to “always main­tain back­ward com­pat­i­bil­i­ty” as a gen­er­al rule, unless the cir­cum­stances dic­tate that we have to break the rule and do some­thing spe­cial.

isolation

With a shared library, you gen­er­al­ly expose more than you need, espe­cial­ly if it’s for inter­nal use. And even if you have tak­en the care to con­sid­er what should be part of the library’s pub­lic API, there’s always a way for the con­sumer to get at those inter­nal APIs via reflec­tion.

With a ser­vice, you are much more con­sid­er­ate towards what to expose via the service’s pub­lic API. You have to, because any­thing you share via the service’s pub­lic API is an explic­it design deci­sion that requires effort.

The inter­nal work­ings of that ser­vice is also hid­den from the con­sumer, and there’s no direct (and easy) way for the con­sumers to access them so there’s less risk of the con­sumers of our shared code acci­den­tal­ly depend­ing on inter­nal imple­men­ta­tion details. The worst thing that can hap­pen here is if the con­sumers of your API start to depend on those (acci­den­tal­ly leaked) imple­men­ta­tion details as fea­tures…

failure

When a library fails, your code fails, and it’s often loud & clear and you get the stack trace of what went wrong.

With a ser­vice, it may fail, or maybe it just didn’t respond in time before you stop wait­ing for the response. As a con­sumer you often can’t dis­tin­guish between a ser­vice being down, from it being slow. When that hap­pens, retries can become tricky as well if the actions you’re try­ing to per­form would mod­i­fy state and that the action is not idem­po­tent.

Par­tial fail­ures are also very dif­fi­cult to deal with, and often requires elab­o­rate pat­terns like the Saga pat­tern in order to roll­back state changes that have already been intro­duced in the trans­ac­tion.

latency

Final­ly, and this is per­haps the most obvi­ous, that call­ing a ser­vice intro­duces net­work laten­cy, which is sig­nif­i­cant­ly high­er than call­ing a method or func­tion in a library.

Managing shared infrastructure

Anoth­er ques­tion that I get a lot is “how do you man­age shared AWS resources like DynamoDB tables and Kine­sis streams?”.

If you’re using the Serverless frame­work then you can man­age these direct­ly in your serverless.yml files, and add them as addi­tion­al Cloud­For­ma­tionresources like below.

This is actu­al­ly the approach I took in my video course AWS Lamb­da in Motion, but that is because the demo app I’m lead­ing the stu­dents to build is a project with well defined start and end state.

But what works in a project like that won’t nec­es­sary work when you’re build­ing a prod­uct that will evolve con­tin­u­ous­ly over time. In the con­text of build­ing a prod­uct, there are some prob­lems with this approach.

"sls remove" can delete user data

Since these resources are tied to the Cloud­For­ma­tion stack for the Serverless(as in, the frame­work) project, if you ever decide to delete the func­tions using the sls remove com­mand then you’ll delete those resources too, along with any user data that you have in those resources.

Even if you don’t inten­tion­al­ly run sls remove against pro­duc­tion, the thought that some­one might one day acci­den­tal­ly do it is wor­ri­some enough.

It’s one thing los­ing the func­tions if that hap­pens, and expe­ri­ence a down time in the sys­tem, it’s quite anoth­er to lose all the pro­duc­tion user data along with the func­tions and poten­tial­ly find your­self in a sit­u­a­tion that you can’t eas­i­ly recov­er from…

You can — and you should — lock down IAM per­mis­sions so that devel­op­ers can’t acci­den­tal­ly delete these resources in pro­duc­tion, which goes a long way to mit­i­gate against these acci­dents.

You should also lever­age the new back­up capa­bil­i­ty that DynamoDB offers.

For Kine­sis streams, you should back up the source events in S3 using Kine­sis Fire­hose. That way, you don’t even have to write any back­up code your­self!

But even with all the back­up options avail­able, I still feel uneasy at the thought of tying these resources that store user data with the cre­ation and dele­tion of the com­pute lay­er (i.e. the Lamb­da func­tions) that utilis­es them.

when ownership is not clear cut

The sec­ond prob­lem with man­ag­ing shared infra­struc­ture in the serverless.yml is that, what do you do when the own­er­ship of these resources is not clear cut?

By virtue of being shared resources, it’s not always clear which project should be respon­si­ble for man­ag­ing these resources in its serverless.yml.

Kine­sis streams for exam­ple, can con­sume events from Lamb­da func­tions, appli­ca­tions run­ning in EC2 or in your own dat­a­cen­ter. And since it uses a polling mod­el, you can process Kine­sis events using Lamb­da func­tions, or con­sumer appli­ca­tions run­ning on EC2 and your own data cen­tres.

They (Kine­sis streams) exist as a way for you to noti­fy oth­ers of events that has occurred in your sys­tem, and mod­ern dis­trib­uted sys­tems are het­ero­ge­neous by design to allow for greater flex­i­bil­i­ty and the abil­i­ty to choose the right trade­off in dif­fer­ent cir­cum­stances.

Even with­out the com­plex case of mul­ti-pro­duc­er and mul­ti-con­sumer Kine­sis streams, the basic ques­tion of “should the con­sumer or the pro­duc­er own the stream” is often enough to stop us in our tracks as there doesn’t (at least not to me) seem to be a clear win­ner here.

Manage shared AWS resources separately

One of the bet­ter ways I have seen — and have adopt­ed myself — is to man­age these shared AWS resources in a sep­a­rate repos­i­to­ry using either Ter­raform or Cloud­For­ma­tion tem­plates depend­ing on the exper­tise avail­able in the team.

This seems to be the approach that many com­pa­nies have adopt­ed once their server­less archi­tec­ture matured to the point where shared infra­struc­ture starts to pop up.

But, on its own, it’s prob­a­bly not even a good way as it intro­duces oth­er prob­lems around your work­flow.

For exam­ple. if those shared resources are man­aged by a sep­a­rate infra­struc­ture team, then it can cre­ate bot­tle­necks and fric­tions between your devel­op­ment and infra­struc­ture teams.

That said, by com­par­i­son I still think it’s bet­ter than man­ag­ing those shared AWS resources with your serverless.yml for the rea­sons I men­tioned.

If you know of oth­er ways to man­age shared infra­struc­ture, then by all means let me know in the com­ments, or you can get in touch with me via twit­ter.

Like what you’re read­ing? Check out my video course Pro­duc­tion-Ready Server­less and learn the essen­tials of how to run a server­less appli­ca­tion in pro­duc­tion.

We will cov­er top­ics includ­ing:

  • authen­ti­ca­tion & autho­riza­tion with API Gate­way & Cog­ni­to
  • test­ing & run­ning func­tions local­ly
  • CI/CD
  • log aggre­ga­tion
  • mon­i­tor­ing best prac­tices
  • dis­trib­uted trac­ing with X-Ray
  • track­ing cor­re­la­tion IDs
  • per­for­mance & cost opti­miza­tion
  • error han­dling
  • con­fig man­age­ment
  • canary deploy­ment
  • VPC
  • secu­ri­ty
  • lead­ing prac­tices for Lamb­da, Kine­sis, and API Gate­way

You can also get 40% off the face price with the code ytcui. Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP).