Centralised logging for AWS Lambda, REVISED (2018)

First of all, I would like to thank all of you for fol­low­ing and read­ing my con­tent. My post on cen­tralised log­ging for AWS Lamb­da has been viewed more than 20K times by now, so it is clear­ly a chal­lenge that many of you have run into.

In the post, I out­lined an approach of using a Lamb­da func­tion to ship all your Lamb­da logs from Cloud­Watch Logs to a log aggre­ga­tion ser­vice such as Logz.io.

In the demo project, I also includ­ed func­tions to:

  • auto-sub­scribe new log groups to the log-ship­ping func­tion
  • auto-update the reten­tion pol­i­cy of new log groups to X num­ber of days (default is Nev­er Expire which has a long term cost impact)

This approach works well when you start out. How­ev­er, you can run into some seri­ous prob­lems at scale.

Mind the concurrency

When pro­cess­ing Cloud­Watch Logs with a Lamb­da func­tion, you need to be mind­ful of the no. of con­cur­rent exe­cu­tions it cre­ates. Because Cloud­Watch Logs is an asyn­chro­nous event source for Lamb­da.

When you have 100 func­tions run­ning con­cur­rent­ly, they will each push logs to Cloud­Watch Logs. This in turn can trig­ger 100 con­cur­rent exe­cu­tions of the log ship­ping func­tion. Which can poten­tial­ly dou­ble the num­ber of func­tions that are con­cur­rent­ly run­ning in your region. Remem­ber, there is a soft, region­al lim­it of 1000 con­cur­rent exe­cu­tions for all func­tions!

This means your log ship­ping func­tion can cause cas­cade fail­ures through­out your entire appli­ca­tion. Crit­i­cal func­tions can be throt­tled because too many exe­cu­tions are used to push logs out of Cloud­Watch Logs — not a good way to go down ;-)

You can set the Reserved Con­cur­ren­cy for the log ship­ping func­tion, to lim­it its max num­ber of con­cur­rent exe­cu­tions. How­ev­er, you risk los­ing logs when the log ship­ping func­tion is throt­tled.

You can also request a raise to the region­al lim­it and make it so high that you don’t have to wor­ry about throt­tling.

A better approach at scale is to use Kinesis

How­ev­er, I would sug­gest that a bet­ter approach is to stream the logs from Cloud­Watch Logs to a Kine­sis stream first. From there, a Lamb­da func­tion can process the logs and for­ward them on to a log aggre­ga­tion ser­vice.

With this approach, you have con­trol the con­cur­ren­cy of the log ship­ping func­tion. As the num­ber of log events increas­es, you can increase the num­ber of shards in the Kine­sis stream. This would also increase the num­ber of con­cur­rent exe­cu­tions of the log ship­ping func­tion.

Take a look at this repo to see how it works. It has a near­ly iden­ti­cal set up to the demo project for the pre­vi­ous post:

  • a set-retention func­tion that auto­mat­i­cal­ly updates the reten­tion pol­i­cy for new log groups to 7 days
  • a subscribe func­tion auto­mat­i­cal­ly sub­scribes new log groups to a Kine­sis stream
  • a ship-logs-to-logzio func­tion that process­es the log events from the above Kine­sis stream and ships them to Logz.io
  • a process_all script to sub­scribe all exist­ing log groups to the same Kine­sis stream

You should also check out this post to see how you can autoscale Kine­sis streams using Cloud­Watch and Lamb­da.

Like what you’re read­ing but want more help? I’m hap­py to offer my ser­vices as an inde­pen­dent con­sul­tant and help you with your server­less project — archi­tec­ture reviews, code reviews, build­ing proof-of-con­cepts, or offer advice on lead­ing prac­tices and tools.

I’m based in Lon­don, UK and cur­rent­ly the only UK-based AWS Server­less Hero. I have near­ly 10 years of expe­ri­ence with run­ning pro­duc­tion work­loads in AWS at scale. I oper­ate pre­dom­i­nant­ly in the UK but I’m open to trav­el­ling for engage­ments that are longer than a week. To see how we might be able to work togeth­er, tell me more about the prob­lems you are try­ing to solve here.

I can also run an in-house work­shops to help you get pro­duc­tion-ready with your server­less archi­tec­ture. You can find out more about the two-day work­shop here, which takes you from the basics of AWS Lamb­da all the way through to com­mon oper­a­tional pat­terns for log aggre­ga­tion, dis­tri­b­u­tion trac­ing and secu­ri­ty best prac­tices.

If you pre­fer to study at your own pace, then you can also find all the same con­tent of the work­shop as a video course I have pro­duced for Man­ning. We will cov­er top­ics includ­ing:

  • authen­ti­ca­tion & autho­riza­tion with API Gate­way & Cog­ni­to
  • test­ing & run­ning func­tions local­ly
  • CI/CD
  • log aggre­ga­tion
  • mon­i­tor­ing best prac­tices
  • dis­trib­uted trac­ing with X-Ray
  • track­ing cor­re­la­tion IDs
  • per­for­mance & cost opti­miza­tion
  • error han­dling
  • con­fig man­age­ment
  • canary deploy­ment
  • VPC
  • secu­ri­ty
  • lead­ing prac­tices for Lamb­da, Kine­sis, and API Gate­way

You can also get 40% off the face price with the code ytcui. Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP).