Serverless observability brings new challenges to current practices

This is a the first in a mini 3-part series that accom­pa­nies my “the present and future of Server­less observ­abil­i­ty” talk at Server­less­Conf Paris and QCon Lon­don this year.

part 1 : new chal­lenges to observ­abil­i­ty <- you’re here

part 2 : the present of Server­less observ­abil­i­ty

part 3 : the future of Server­less observ­abil­i­ty

2017 was no doubt the year the con­cept of observ­abil­i­ty became main­stream, so much so we now have an entire Observ­abil­i­ty track at a big indus­try event such as QCon.

This is no doubt thanks to the excel­lent writ­ing and talks by some real­ly smart peo­ple like Cindy Srid­ha­ran and Chair­ty Majors:

As Cindy men­tioned in her post though, the first mur­murs of observ­abil­i­tycame from a post by Twit­ter way back in 2013 where they dis­cussed many of the chal­lenges they face to debug their com­plex, dis­trib­uted sys­tem.

A few years lat­er, Net­flix start­ed writ­ing about the relat­ed idea of intu­ition engi­neer­ing around how do we design tools that can give us holis­tic under­stand­ing of our com­plex sys­tem — that is, how can we design our tools so that they present us the most rel­e­vant infor­ma­tion about our sys­tem, at the right time, and min­i­mize the amount of time and cog­ni­tive ener­gy we need to invest to build a cor­rect men­tal mod­el of our sys­tem.

Challenges with Serverless observability

With Server­less tech­nolo­gies like AWS Lamb­da, we face a num­ber of new chal­lenges to the prac­tices and tools that we have slow­ly mas­tered as we learnt how to gain observ­abil­i­ty for ser­vices run­ning inside vir­tu­al machines as well as con­tain­ers.

As a start, we lose access to the under­ly­ing infra­struc­ture that runs our code. The exe­cu­tion envi­ron­ment is locked down, and we have nowhere to install agents & dae­mons for col­lect­ing, batch­ing and pub­lish­ing data to our observ­abil­i­ty sys­tem.

These agents & dae­mons used to go about their job qui­et­ly in the back­ground, far away from the crit­i­cal paths of your code. For exam­ple, if you’re col­lect­ing met­rics & logs for your REST API, you would col­lect and pub­lish these observ­abil­i­ty data out­side the request han­dling code where a human user is wait­ing for a response on the oth­er side of the net­work.

But with Lamb­da, every­thing you do has to be done inside your function’s invo­ca­tion, which means you lose the abil­i­ty to per­form back­ground pro­cess­ing. Except what the plat­form does for you, such as:

  • col­lect­ing logs from std­out and send­ing them to Cloud­Watch Logs
  • col­lect­ing trac­ing data and send­ing them to X-Ray

Anoth­er aspect that has dras­ti­cal­ly changed, is how con­cur­ren­cy of our sys­tem is con­trolled.

Where­as before, we would write our REST API with a web frame­work, and we’ll run it as an appli­ca­tion inside an EC2 serv­er or a con­tain­er, and this appli­ca­tion would han­dle many con­cur­rent requests. In fact, one of the things we com­pare dif­fer­ent web frame­works with, is their abil­i­ty to han­dle large no. of con­cur­rent requests.

Now, we don’t need web frame­works to cre­ate a scal­able REST API any­more, API Gate­way and Lamb­da takes care of all the hard work for us. Con­cur­ren­cy is now man­aged by the plat­form, and that’s great news!

How­ev­er, this also means that any attempt to batch observ­abil­i­ty data becomes less effec­tive (more on this lat­er), and for the same vol­ume of incom­ing traf­fic you’ll exert a much high­er vol­ume of traf­fic to your observ­abil­i­ty sys­tem. This in turn can have a non-triv­ial per­for­mance and cost impli­ca­tions at scale.

You might argue that “well, in that case, I’ll just use a big­ger batch size for these observ­abil­i­ty data and pub­lish them less fre­quent­ly so I don’t over­whelm the observ­abil­i­ty sys­tem”.

Except, it’s not that sim­ple, enter, the life­cy­cle of an AWS Lamb­da func­tion.

One of the ben­e­fits of Lamb­da is that you don’t pay for it if you don’t use it. To achieve that, the Lamb­da ser­vice would garbage col­lect con­tain­ers (or, con­cur­rent exe­cu­tions of your func­tion) that have not received a request for some time. I did some exper­i­ments to see how long that idle time is, which you can read about in this post.

And if you have observ­abil­i­ty data that have not been pub­lished yet, then you’ll loss those data when the con­tain­er is GC’d.

Even if the con­tain­er is con­tin­u­ous­ly receiv­ing requests, maybe with the help of some­thing like the warmup plu­g­in for the Server­less frame­work, the Lamb­da ser­vice would still GC the con­tain­er after it has been active for a few hours and replace it with a fresh con­tain­er.

Again, this is a good thing, as it elim­i­nates com­mon prob­lems with long run­ning code, such as mem­o­ry frag­men­ta­tion and so on. But it also means, you can still lose unpub­lished observ­abil­i­ty data when it hap­pens.

Also, as I explained in a pre­vi­ous post on cold starts, those attempts to keep con­tain­ers warm stop being effec­tive when you have even a mod­er­ate amount of load against your sys­tem.

So, you’re back to send­ing observ­abil­i­ty data eager­ly. Maybe this time, you’ll build an observ­abil­i­ty sys­tem that can han­dle this extra load, maybe you’ll build it using Lamb­da!

But wait, remem­ber, you don’t have back­ground pro­cess­ing time any­more…

So if you’re send­ing observ­abil­i­ty data eager­ly as part of your func­tion invo­ca­tion, then that means you’re hurt­ing the user-fac­ing laten­cy and we know that laten­cy affects busi­ness rev­enue direct­ly (well, at least in any rea­son­ably com­pet­i­tive mar­ket where there’s anoth­er provider the cus­tomer can eas­i­ly switch to).

Talk about being caught between a rock and a hard place…

Animated GIF - Find & Share on GIPHY

Final­ly, one of the trends that I see in the Server­less space — and one that I have expe­ri­enced myself when I migrat­ed a social network’s archi­tec­ture to AWS Lamb­da — is how pow­er­ful, and how sim­ple it is to build an event-dri­ven archi­tec­ture. And Randy Shoup seems to think so too.

And in this event-dri­ven, server­less world, func­tion invo­ca­tions are often chained through some asyn­chro­nous event source such as Kine­sis Streams, SNS, S3, IoT, DynamoDB Streams, and so on.

In fact, of all the sup­port­ed event sources for AWS Lamb­da, only a few are clas­si­fied as syn­chro­nous, so by design, the cards are stacked towards asyn­chrony here.

And guess what, trac­ing asyn­chro­nous invo­ca­tions is hard.

I wrote a post on how you might do it your­self, in the inter­est of col­lect­ing and for­ward­ing cor­re­la­tion IDs for dis­trib­uted trac­ing. But even with the approach I out­lined, it won’t be easy (or in some cas­es, pos­si­ble) to trace through every type of event source.

X-Ray doesn’t help you here either, although it sounds like they’re at least look­ing at sup­port for Kine­sis. At the time of writ­ing, X-Ray also doesn’t trace over API Gate­way but that too, is on their list.

Until next time…

So, I hope I have paint­ed a clear pic­ture of what tool ven­dors are up against in this space, so you real­ly got­ta respect the work peo­ple like IOPipeDash­bird and Thun­dra has done.

That said, there are also many things you have to con­sid­er your­self.

For exam­ple, giv­en the lack of back­ground pro­cess­ing, when you’re build­ing a user fac­ing API where laten­cy is impor­tant, you might want to avoid using an observ­abil­i­ty tool that doesn’t give you the option to send observ­abil­i­ty data asyn­chro­nous­ly (for exam­ple, by lever­ag­ing Cloud­Watch Logs), or you need to use a rather strin­gent sam­ple rate.

At the same time, when you’re pro­cess­ing events asyn­chro­nous­ly, you don’t have to wor­ry about invo­ca­tion time quite as much. But you might care about the cost of writ­ing so much data to Cloud­Watch Logs and the sub­se­quent Lamb­da invo­ca­tions to process them. Or maybe you’re con­cerned that low-pri­or­i­ty func­tions (that process the observ­abil­i­ty data you send via Cloud­Watch Logs) are eat­ing up your quo­ta of con­cur­rent exe­cu­tions and can throt­tle high-pri­or­i­ty func­tions (like the ones that serves user-fac­ing REST APIs!). In this case, you might choose to pub­lish observ­abil­i­ty data eager­ly at the end of each invo­ca­tion.

In part 2, we’ll look at some of the exist­ing observ­abil­i­ty tools avail­able to us, and how these afore­men­tioned chal­lenges should affect the way we eval­u­ate them and when we should use them.

Liked this post? Why not support me on Patreon and help me get rid of the ads!