Serverless observability, what can you use out of the box?

part 1 : new chal­lenges to observ­abil­i­ty

part 2 : 1st par­ty observ­abil­i­ty tools from AWS <- you are here

part 3 : 3rd par­ty observ­abil­i­ty tools

part 4: the future of Server­less observ­abil­i­ty


In part 1 we talked about the chal­lenges server­less brings to the table. In this post, let’s look at 1st par­ty tools from AWS and see how they mea­sure up against these new chal­lenges:

  • nowhere to install agents/daemons
  • no back­ground pro­cess­ing
  • high­er con­cur­ren­cy to observ­abil­i­ty sys­tem
  • high chance of data loss when not send­ing eager­ly
  • hard to trace asyn­chro­nous

But before we start, let’s remind our­selves of the 4 pil­lars of observ­abil­i­ty.

Out of the box we get a bunch of tools pro­vid­ed by AWS itself:

  • Cloud­Watch for mon­i­tor­ing, alert­ing and visu­al­iza­tion
  • Cloud­Watch Logs for logs
  • X-Ray for dis­trib­uted trac­ing
  • Ama­zon Elas­tic­Search for log aggre­ga­tion

CloudWatch Logs

When­ev­er you write to std­out, those out­puts are cap­tured by the Lamb­da ser­vices and sent to Cloud­Watch Logs as logs. This is one of the few back­ground pro­cess­ing you get, as it’s pro­vid­ed by the plat­form.

All the log mes­sages (tech­ni­cal­ly they’re referred to as events) for a giv­en func­tion would appear in Cloud­Watch Logs under a sin­gle Log Group.

As part of a Log Group, you have many Log Streams. Each con­tains the logs from one con­cur­rent exe­cu­tion (or con­tain­er) of your func­tion, so there’s a one-to-one map­ping.

So that’s all well and good, but it’s not easy to search for log mes­sages in Cloud­Watch Logs. There’s cur­rent­ly no way to search the logs for mul­ti­ple func­tions at once. Whilst AWS has been improv­ing the ser­vice, it still pales in com­par­i­son to oth­er alter­na­tives on the mar­ket.

It might suf­fice as you start out, but you’ll prob­a­bly find your­self in need of some­thing more soon after.

For­tu­nate­ly, it’s straight­for­ward to get your logs out of Cloud­Watch Logs.

You can stream them to Amazon’s host­ed Elas­tic­search ser­vice. But don’t expect it to be a like-for-like expe­ri­ence with your self-host­ed ELK stack though. Liz Ben­nett wrote a detailed post on some of the prob­lems they ran into when using Ama­zon Elas­tic­search at scale. Please give that a read if you’re think­ing about adopt­ing Ama­zon Elas­tic­search.

Alter­na­tive­ly, you can stream the logs to a Lamb­da func­tion, and ship them to a log aggre­ga­tion ser­vice of your choice. I won’t go into detail here as I have writ­ten about it at length pre­vi­ous­ly, just go and read this post instead.

You can stream logs from Cloud­Watch Logs to just about any log aggre­ga­tion ser­vice, via Lamb­da.

CloudWatch Metrics

With Cloud­Watch, you get some basic met­rics out of the box. Invo­ca­tion count, error count, invo­ca­tion dura­tion, etc. All the basic teleme­try about the health of a func­tion.

But Cloud­Watch is miss­ing some valu­able data points, such as:

  • esti­mat­ed costs
  • con­cur­rent exe­cu­tions : Cloud­Watch does report this met­ric, but only for func­tions with reserved con­cur­ren­cy
  • cold starts
  • billed dura­tion : Lamb­da reports this in Cloud­Watch Logs, at the end of every invo­ca­tion. Because Lamb­da invo­ca­tions are billed in 100ms blocks, a 102ms invo­ca­tion would be billed for 200ms. It will be a use­ful met­ric to see along­side Invo­ca­tion Dura­tion to iden­ti­fy cost opti­miza­tions)
  • mem­o­ry usage : Lamb­da reports this in Cloud­Watch Logs too, but it’s not record­ed as a Cloud­Watch met­ric
You get 6 basic met­rics about the health of a func­tion.

There are ways to record and track these met­rics your­self, see this post on how to do that. Oth­er providers like IOPipe (more on them in the next post) would also report these data points out of the box.

You can set up Alarms in Cloud­Watch against any of these met­rics, here are a few met­rics that you should con­sid­er alert on:

  • throt­tled invo­ca­tions
  • region­al con­cur­rent exe­cu­tions : set thresh­old based on % of your cur­rent region­al lim­it
  • tail (95 or 99 per­centile) laten­cy against some accept­able thresh­old
  • 4xx and 5xx errors on API Gate­way

And you can set up basic dash­board in Cloud­Watch too, at a tiny cost of $3 per month per dash­board (first 3 are free).

X-Ray

For dis­trib­uted trac­ing, you have X-Ray. To make the most of trac­ing, you should instru­ment your code to gain even bet­ter vis­i­bil­i­ty.

Like Cloud­Watch Logs, col­lect­ing traces do not add addi­tion­al time to your function’s invo­ca­tion. It’s a back­ground pro­cess­ing that the plat­form pro­vides for you.

From the trac­ing data, X-Ray can also show you a ser­vice map like this one.

X-Ray gives you a lot of insight into the run­time per­for­mance of a func­tion. How­ev­er, its focus is nar­row­ly on one func­tion, the dis­trib­uted aspect is severe­ly under­cooked. As it stands, X-Ray cur­rent­ly doesn’t trace over API Gate­way, or asyn­chro­nous invo­ca­tions such as SNS or Kine­sis.

It’s good for hom­ing in on per­for­mance issues for a par­tic­u­lar func­tion. But it offers lit­tle to help you build intu­ition about how your sys­tem oper­ates as a whole. For that, I need to step away from what hap­pens inside one func­tion, and be able to look at the entire call chain.

After all, when the engi­neers at Twit­ter were talk­ing about the need for observ­abil­i­ty, it wasn’t so much to help them debug per­for­mance issues of any sin­gle end­point, but to help them make sense of the behav­iour and per­for­mance of their sys­tem. A sys­tem that is essen­tial­ly one big, com­plex and high­ly con­nect­ed graph of ser­vices.

With Lamb­da, this graph is going to become a lot more com­plex, more sparse and more con­nect­ed because:

  • instead of one ser­vice with 5 end­points, you now have 5 func­tions
  • func­tions are con­nect­ed through a greater vari­ety of medi­ums — SNS, Kine­sis, API Gate­way, IoT, you name it
  • event-dri­ven archi­tec­ture has become the norm

Our trac­ing tools need to help us make sense of this graph. They need to help us visu­al­ize the con­nec­tions between our func­tions. And they need to help us fol­low data as it enters our sys­tem as a user request, and reach­es out to far cor­ners of this graph through both syn­chro­nous and asyn­chro­nous events.

And of course, X-Ray do not span over non-AWS ser­vices such as Auth0, or Google Big­Query, or Azure func­tions.

But those of us deep in the server­less mind­set see the world through SaaS-tint­ed glass­es. We want to use the ser­vice that best address­es our needs, and glue them togeth­er with Lamb­da.

At Yubl, we used a num­ber of non-AWS ser­vices from Lamb­da. Auth0, Google Big­Query, GrapheneDB, Mon­go­Lab, and Twillio to name a few. And it was great, we don’t have to be bound by what AWS offers.

My good friend Raj also did a good talk at NDC on how he uses ser­vices from both AWS and Azure in his wine start­up. You can watch his talk here.

And final­ly, I think of our sys­tem like a brain. Like a brain, our sys­tem is made up of:

  • neu­rons (func­tions)
  • synaps­es (con­nec­tions between func­tions)
  • and elec­tri­cal sig­nals (data) that flow through them

Like a brain, our sys­tem is alive, it’s con­stant­ly chang­ing and evolv­ing and it’s con­stant­ly work­ing! And yet, when I look at my dash­boards and my X-Ray traces, that’s not what I see. Instead, I see a tab­u­lat­ed list that does not reflect the move­ment of data and areas of activ­i­ty. It doesn’t help me build up any intu­itive under­stand­ing of what’s going on in my sys­tem.

A brain sur­geon wouldn’t accept this as the pri­ma­ry source of infor­ma­tion. How can they pos­si­bly use it to build a men­tal pic­ture of the brain they need to cut open and oper­ate on?

Tab­u­lat­ed view of the traces, and you have to man­u­al­ly press a refresh but­ton to see new traces that came in.

I should add that this is not a crit­i­cism of X-Ray, it is built the same way most observ­abil­i­ty tools are built.

But maybe our tools need to evolve beyond human com­put­er inter­faces (HCI) that wouldn’t look out of place on a clip­board (the phys­i­cal kind, if you’re old enough to have seen one!). And it actu­al­ly reminds me of one of Bret Victor’s sem­i­nal talks, stop draw­ing dead fish.

Net­flix made great strides towards this idea of a live dash­board with Vizcer­al. Which they have also kind­ly open sourced.

Conclusions

AWS pro­vides us with some decent tools out of the box. Whilst they each have their short­com­ings, they’re good enough to get start­ed with.

As 1st par­ty tools, they also enjoy home field advan­tages over 3rd par­ty tools. For exam­ple, Lamb­da col­lects logs and traces with­out adding to your func­tion invo­ca­tion time. Since we can’t access the serv­er any­more, 3rd par­ty tools can­not per­form any back­ground pro­cess­ing. Instead they have to resort to workarounds or are forced to col­lect data syn­chro­nous­ly.

How­ev­er, as our server­less appli­ca­tions become more com­plex, these tools need to either evolve with us or they will need to be replaced in our stack. Cloud­Watch Logs for instance, can­not search across mul­ti­ple func­tions. It’s often the first piece that need to be replaced once you have more than a dozen func­tions.

In the next post, we will look at some 3rd par­ty tools such as IOPipe, Dash­bird and Thun­dra. We will dis­cuss their val­ue-add propo­si­tion as well as their short­com­ings.

Like what you’re read­ing? Check out my video course Pro­duc­tion-Ready Server­less and learn the essen­tials of how to run a server­less appli­ca­tion in pro­duc­tion.

We will cov­er top­ics includ­ing:

  • authen­ti­ca­tion & autho­riza­tion with API Gate­way & Cog­ni­to
  • test­ing & run­ning func­tions local­ly
  • CI/CD
  • log aggre­ga­tion
  • mon­i­tor­ing best prac­tices
  • dis­trib­uted trac­ing with X-Ray
  • track­ing cor­re­la­tion IDs
  • per­for­mance & cost opti­miza­tion
  • error han­dling
  • con­fig man­age­ment
  • canary deploy­ment
  • VPC
  • secu­ri­ty
  • lead­ing prac­tices for Lamb­da, Kine­sis, and API Gate­way

You can also get 40% off the face price with the code ytcui. Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP).