Yubl’s road to Serverless architecture — ops

part 1: overview

part 2: test­ing and CI/CD

part 3: ops <- you’re here

part 4: build­ing a scal­able noti­fi­ca­tion sys­tem

part 5: build­ing a bet­ter rec­om­men­da­tion sys­tem


A cou­ple of folks asked me about our strat­e­gy for mon­i­tor­ing, log­ging, etc. after part 2, and hav­ing watched Chris Swan talk about “Server­less Oper­a­tions is not a Solved Prob­lem” at the Server­less meet­up it’s a good time for us to talk about how we approached ops with AWS Lamb­da.

NoOps != No Ops

The notion of “NoOps” have often been men­tioned with server­less tech­nolo­gies (I have done it myself), but it doesn’t mean that you no longer have to wor­ry about ops.

To me, “ops” is the umbrel­la term for every­thing relat­ed to keep­ing my sys­tems oper­a­tional and per­form­ing with­in accept­able para­me­ters, includ­ing (but not lim­it­ed to) resource pro­vi­sion­ing, con­fig­u­ra­tion man­age­ment, mon­i­tor­ing and being on-hand to deal with any live issues. The respon­si­bil­i­ties of keep­ing the sys­tems up and run­ning will always exist regard­less whether your soft­ware is run­ning on VMs in the cloud, on-premise hard­ware, or as small Lamb­da func­tions.

With­in your orga­ni­za­tion, some­one needs to ful­fill these respon­si­bil­i­ties. It might be that you have a ded­i­cat­ed ops team, or per­haps your devel­op­ers will share those respon­si­bil­i­ties.

NoOps to me, means no ops spe­cial­iza­tion in my orga­ni­za­tion — ie. no ded­i­cat­ed ops team — because the skills and efforts required to ful­fill the ops respon­si­bil­i­ties do not jus­ti­fy the need for such spe­cial­iza­tion. As an orga­ni­za­tion it’s in your best inter­est to delay such spe­cial­iza­tion for as long as you can both from a finan­cial point of view and also, per­haps more impor­tant­ly, because Conway’s law tells us that hav­ing an ops team is the sure­fire way to end up with a set of oper­a­tional procedures/processes, tools and infra­struc­ture whose com­plex­i­ty will in turn jus­ti­fy the exis­tence of said ops team.

At Yubl, as we migrat­ed to a server­less archi­tec­ture our deploy­ment pipeline became more stream­lined, our tool­chain became sim­pler and we found less need for a ded­i­cat­ed ops team and were in the process of dis­band­ing our ops team alto­geth­er.

Logging

When­ev­er you write to the std­out from your Lamb­da func­tion — eg. when you do console.log in your node­js code — it ends up in the function’s Log Group in Cloud­Watch Logs.

Centralised Logging

How­ev­er, logs are not eas­i­ly search­able, and once you have a dozen Lamb­da func­tions you will want to col­lect them in one cen­tral place. The ELK stack is the de fac­to stan­dard for cen­tralised log­ging these days, you can run your own ELK stack on EC2, and elastic.co also offers a host­ed ver­sion of Elas­tic­search and Kibana.

To ship your logs from Cloud­Watch Logs to ELK you can sub­scribe the Log Group to a cloud­watch-logs-to-elk func­tion that is respon­si­ble for ship­ping the logs.

You can sub­scribe the Log Group man­u­al­ly via the AWS man­age­ment con­sole.

But, you don’t want a man­u­al step every­one needs to remem­ber every time they cre­ate a new Lamb­da func­tion. Instead, it’s bet­ter to set­up a rule in Cloud­Watch Events to invoke a sub­scribe-log-group Lamb­da func­tion to set up the sub­scrip­tion for new Log Groups.

2 things to keep in mind:

  • lots ser­vices cre­ate logs in Cloud­Watch Logs, so you’d want to fil­ter Log Groups by name, Lamb­da func­tion logs have the pre­fix /aws/lambda/
  • don’t sub­scribe the Log Group for the cloud­watch-logs-to-elk func­tion (or what­ev­er you decide to call it), oth­er­wise you cre­ate an infi­nite loop for the cloud­watch-logs-to-elk func­tion where its own logs will trig­ger itself and pro­duce more logs and so on

Distributed Tracing

Hav­ing all your logs in one eas­i­ly search­able place is great, but as your archi­tec­ture expands into more and more ser­vices that depends on one anoth­er you will need to cor­re­lat­ed logs from dif­fer­ent ser­vices to under­stand all the events that occurred dur­ing one user request.

For instance, when a user cre­ates a new post in the Yubl app we dis­trib­ute the post to all of the user’s fol­low­ers. Many things hap­pen along this flow:

  1. user A’s client calls the lega­cy API to cre­ate the new post
  2. the lega­cy API fires a yubl-post­ed event into a Kine­sis stream
  3. the dis­trib­ute-yubl func­tion is invoked to han­dle this event
  4. dis­trib­ute-yubl func­tion calls the rela­tion­ship-api to find user A’s fol­low­ers
  5. dis­trib­ute-yubl func­tion then per­forms some busi­ness log­ic, group user A’s fol­low­ers into batch­es and for each batch fires a mes­sage to a SNS top­ic
  6. the add-to-feed func­tion is invoked for each SNS mes­sage and adds the new post to each follower’s feed

If one of user A’s fol­low­ers didn’t receive his new post in the feed then the prob­lem can lie in a num­ber of dif­fer­ent places. To make such inves­ti­ga­tions eas­i­er we need to be able to see all the rel­e­vant logs in chrono­log­i­cal order, and that’s where cor­re­la­tion IDs (eg. ini­tial request-id, user-id, yubl-id, etc.) come in.

Because the han­dling of the ini­tial user request flows through API calls, Kine­sis events and SNS mes­sages, it means the cor­re­la­tion IDs also need to be cap­tured and passed through API calls, Kine­sis events and SNS mes­sages.

Our approach was to roll our own client libraries which will pass the cap­tured cor­re­la­tion IDs along.

Capturing Correlation IDs

All of our Lamb­da func­tions are cre­at­ed with wrap­pers that wraps your han­dler code with addi­tion­al good­ness such as cap­tur­ing the cor­re­la­tion IDs into a global.CONTEXT object (which works because node­js is sin­gle-thread­ed).

Forwarding Correlation IDs

Our HTTP client library is a thin wrap­per around the super­a­gent HTTP client and injects the cap­tured cor­re­la­tion IDs into out­go­ing HTTP head­ers.

We also have a client library for pub­lish­ing Kine­sis events, which can inject the cor­re­la­tion IDs into the record pay­load.

For SNS, you can include the cor­re­la­tion IDs as mes­sage attrib­ut­es when pub­lish­ing a mes­sage.

Zipkin and Amazon X-Ray

Since then, AWS has announced x-ray but it’s still in pre­view so I have not had a chance to see how it works in prac­tice, and it doesn’t sup­port Lamb­da at the time of writ­ing.

There is also Zip­kin, but it requires you to run addi­tion­al infra­struc­ture on EC2 and whilst it has wide range of sup­port for instru­men­ta­tion the path to adop­tion in the server­less envi­ron­ment (where you don’t have or need tra­di­tion­al web frame­works) is not clear to me.

Monitoring

Out of the box you get a num­ber of basic met­rics from Cloud­Watch — invo­ca­tion counts, dura­tions, errors, etc.

You can also pub­lish cus­tom met­rics to Cloud­Watch (eg. user cre­at­ed, post viewed) using the AWS SDK. How­ev­er, since these are HTTP calls you have to be con­scious of the laten­cies they’ll add for user-fac­ing func­tions (ie. those serv­ing APIs). You can mit­i­gate the added laten­cies by pub­lish­ing them in a fire-and-for­get fash­ion, and/or bud­get­ing the amount of time (say, to a max of 50ms) you can spend pub­lish­ing met­rics at the end of a request.

Because you have to do every­thing dur­ing the invo­ca­tion of a func­tion, it forces you to make trade offs.

Anoth­er approach is to take a leaf from Datadog’s book and use spe­cial log mes­sages and process them after the fact. For instance, if you write logs in the for­mat MONITORING|epoch_timestamp|metric_value|metric_type|metric_name like below..

console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”);

console.log(“MONITORING|1489795335|8|count|yubls-served”);

then you can process these log mes­sages (see Log­ging sec­tion above) and pub­lish them as met­rics instead. With this approach you’ll be trad­ing off live­ness of met­rics for less API laten­cy over­head.

Of course, you can employ both approach­es in your archi­tec­ture and use the appro­pri­ate one for each sit­u­a­tion:

  • for func­tions on the crit­i­cal path (that will direct­ly impact the laten­cy your users expe­ri­ence), choose the approach of pub­lish­ing met­rics as spe­cial log mes­sages;
  • for oth­er func­tions (cron jobs, kine­sis proces­sors, etc.) where invo­ca­tion dura­tion doesn’t sig­nif­i­cant­ly impact a user’s expe­ri­ence, pub­lish met­rics as part of the invo­ca­tion

Dashboards + Alerts

We have a num­ber of dash­boards set­up in Cloud­Watch as well as Graphite (using host­ed­graphite, for our lega­cy stack run­ning on EC2), and they’re dis­played on large mon­i­tors near the serv­er team area. We also set up alerts against var­i­ous met­rics such as API laten­cies and error count, and have ops­ge­nie set­up to alert whoever’s on-call that week.

Consider alternatives to CloudWatch

Whilst Cloud­Watch is good, cost-effec­tive solu­tion for mon­i­tor­ing (in some cas­es the only way to get met­rics out of AWS ser­vices such as Kine­sis and DynamoDB) it has its draw­backs.

Its UI and cus­tomiza­tion is not on-par with com­peti­tors such as Graphite, Data­dog and Sys­dig, and it lacks advanced fea­tures such as anom­aly detec­tion and find­ing cor­re­la­tions that you find in Stack­drvi­er and Wave­front.

The biggest lim­i­ta­tion how­ev­er, is that Cloud­Watch met­rics are only gran­u­lar to the minute. It means your time to dis­cov­ery of issues is mea­sured in mins (you need a few data points to sep­a­rate real issues that require man­u­al inter­ven­tion from tem­po­rary blips) and con­se­quent­ly your time to recov­er is like­ly to be mea­sured in tens of mins. As your scale up and the cost of unavail­abil­i­ty goes up you need to invest efforts to cut down both times, which might mean that you need more gran­u­lar met­rics than Cloud­Watch is able to give you.

Anoth­er good rea­son for not using Cloud­Watch is that, you real­ly don’t want your mon­i­tor­ing sys­tem to fail at the same time as the sys­tem it mon­i­tors. Over the years we have expe­ri­enced a num­ber of AWS out­ages that impact­ed both our core sys­tems run­ning on EC2 as well as Cloud­Watch itself. As our sys­tem fails and recov­ers we don’t have the vis­i­bil­i­ty to see what’s hap­pen­ing and how it’s impact­ing our users.

Config Management

What­ev­er approach you use for con­fig man­age­ment you should always ensure that:

  • sen­si­tive data (eg. cre­den­tials, con­nec­tion strings) are encrypt­ed in-flight, and at rest
  • access to sen­si­tive data should be based on roles
  • you can eas­i­ly and quick­ly prop­a­gate con­fig changes

Nowa­days, you can add envi­ron­ment vari­ables to your Lamb­da func­tions direct­ly, and have them encrypt­ed with KMS.

This was the approach we start­ed with, albeit using envi­ron­ment vari­ables in the Server­less frame­work since it wasn’t a fea­ture of the Lamb­da ser­vice at the time. After we had a dozen func­tions that share con­fig val­ues (eg. Mon­goDB con­nec­tion strings) this approach became cum­ber­some and it was labo­ri­ous and painful­ly slow to prop­a­gate con­fig changes man­u­al­ly (by updat­ing and re-deploy­ing every func­tion that require the updat­ed con­fig val­ue).

It was at this point in our evo­lu­tion that we moved to a cen­tralised con­fig ser­vice. Hav­ing con­sid­ered con­sul (which I know a lot of folks use) we decid­ed to write our own using API Gate­way, Lamb­da and DynamoDB because:

  • we don’t need many of con­sul’s fea­tures, only the kv store
  • con­sul is anoth­er thing we’d have to run and man­age
  • con­sul is anoth­er thing we’d have to learn
  • even run­ning con­sul with 2 nodes (you need some redun­dan­cy for pro­duc­tion) it is still order of mag­ni­tude more expen­sive

Sen­si­tive data are encrypt­ed (by a devel­op­er) using KMS and stored in the con­fig API in its encrypt­ed form, when a Lamb­da func­tion starts up it’ll ask the con­fig API for the con­fig val­ues it needs and it’ll use KMS to decrypt the encrypt­ed blob.

We secured access to the con­fig API with api keys cre­at­ed in API Gate­way, in the event these keys are com­pro­mised attack­ers will be able to update con­fig val­ues via this API. You can take this a step fur­ther (which we didn’t get around to in the end) by secur­ing the POST end­point with IAM roles, which will require devel­op­ers to make signed requests to update con­fig val­ues.

Attack­ers can still retrieve sen­si­tive data in encrypt­ed form, but they will not be able to decrypt them as KMS also requires role-based access.

client library

As most of our Lamb­da func­tions need to talk to the con­fig API we invest­ed efforts into mak­ing our client library real­ly robust and baked in caching sup­port and peri­od­ic polling to refresh con­fig val­ues from the source.

 

So, that’s it folks, hope you’ve enjoyed this post.

The emer­gence of AWS Lamb­da and oth­er server­less tech­nolo­gies have sig­nif­i­cant­ly sim­pli­fied the skills and tools required to ful­fil the ops respon­si­bil­i­ties inside an orga­ni­za­tion. How­ev­er, this new par­a­digm has also intro­duced new lim­i­ta­tions and chal­lenges for exist­ing tool­chains and requires us to come up with new answers. Things are chang­ing at an incred­i­bly fast pace, and I for one am excit­ed to see what new prac­tices and tools emerge from this space!

Like what you’re read­ing? Check out my video course Pro­duc­tion-Ready Server­less and learn the essen­tials of how to run a server­less appli­ca­tion in pro­duc­tion.

We will cov­er top­ics includ­ing:

  • authen­ti­ca­tion & autho­riza­tion with API Gate­way & Cog­ni­to
  • test­ing & run­ning func­tions local­ly
  • CI/CD
  • log aggre­ga­tion
  • mon­i­tor­ing best prac­tices
  • dis­trib­uted trac­ing with X-Ray
  • track­ing cor­re­la­tion IDs
  • per­for­mance & cost opti­miza­tion
  • error han­dling
  • con­fig man­age­ment
  • canary deploy­ment
  • VPC
  • secu­ri­ty
  • lead­ing prac­tices for Lamb­da, Kine­sis, and API Gate­way

You can also get 40% off the face price with the code ytcui. Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP).