AWS Lambda —3 pro tips for working with Kinesis streams

At Yubl, we arrived at a non-triv­ial server­less archi­tec­ture where Lamb­da and Kine­sis became a promi­nent fea­ture of this archi­tec­ture.

Whilst our expe­ri­ence using Lamb­da with Kine­sis was great in gen­er­al, there was a cou­ple of lessons that we had to learn along the way. Here are 3 use­ful tips to help you avoid some of the pit­falls we fell into and accel­er­ate your own adop­tion of Lamb­da and Kine­sis.

Consider partial failures

From the Lamb­da doc­u­men­ta­tion:

AWS Lamb­da polls your stream and invokes your Lamb­da func­tion. There­fore, if a Lamb­da func­tion fails, AWS Lamb­da attempts to process the erring batch of records until the time the data expires…

Because the way Lamb­da func­tions are retried, if you allow your func­tion to err on par­tial fail­ures then the default behav­ior is to retry the entire batch until suc­cess or the data expires from the stream.

To decide if this default behav­ior is right for you, you have to answer cer­tain ques­tions:

  • can events be processed more than once?
  • what if those par­tial fail­ures are per­sis­tent? (per­haps due to a bug in the busi­ness log­ic that is not han­dling cer­tain edge cas­es grace­ful­ly)
  • is it more impor­tant to process every event till suc­cess than keep­ing the over­all sys­tem real-time?

In the case of Yubl (which was a social net­work­ing app with a time­line fea­ture sim­i­lar to Twit­ter) we found that for most of our use cas­es it’s more impor­tant to keep the sys­tem flow­ing than to halt pro­cess­ing for any failed events, even if for a minute.

For instance, when you cre­ate a new post, we would dis­trib­ute it to all of your fol­low­ers by pro­cess­ing the yubl-posted event. The 2 basic choic­es we’re pre­sent­ed with are:

  1. allow errors to bub­ble up and fail the invocation—we give every event every oppor­tu­ni­ty to be processed; but if some events fail per­sis­tent­ly then no one will receive new posts in their feed and the sys­tem appears unavail­able
  2. catch and swal­low par­tial failures—failed events are dis­card­ed, some users will miss some posts but the sys­tem appears to be run­ning nor­mal­ly to users (even affect­ed users might not real­ize that they had missed some posts)

(of course, it doesn’t have to be a bina­ry choice, there’s plen­ty of room to add smarter han­dling for par­tial fail­ures, which we will dis­cuss short­ly)

We encap­su­lat­ed these 2 choic­es as part of our tool­ing so that we get the ben­e­fit of reusabil­i­ty and the devel­op­ers can make an explic­it choice (and the code makes that choice obvi­ous to any­one read­ing it lat­er on) for every Kine­sis proces­sor they cre­ate.

You would prob­a­bly apply dif­fer­ent choic­es depend­ing on the prob­lem you’re solv­ing, the impor­tant thing is to always con­sid­er how par­tial fail­ures would affect your sys­tem as a whole.

Use dead letter queues (DLQ)

AWS announced sup­port for Dead Let­ter Queues (DLQ) at the end of 2016, how­ev­er, at the time of writ­ing this sup­port only extends to asyn­chro­nous invo­ca­tions (SNS, S3, IOT, etc.) but not poll-based invo­ca­tions such as Kine­sis and DynamoDB streams.

That said, there’s noth­ing stop­ping you from apply­ing the DLQ con­cept your­self.

First, let’s roll back the clock to a time when we didn’t have Lamb­da. Back then, we’d use long run­ning appli­ca­tions to poll Kine­sis streams our­selves. Heck, I even wrote my own pro­duc­er and con­sumer libraries because when AWS rolled out Kine­sis they total­ly ignored any­one not run­ning on the JVM!

Lamb­da has tak­en over a lot of the responsibilities—polling, track­ing where you are in the stream, error han­dling, etc.—but as we have dis­cussed above it doesn’t remove you from the need to think for your­self. Nor does it change what good looks like for a sys­tem that process­es Kine­sis events, which for me must have at least these 3 qual­i­ties:

  • it should be real-time (most domains con­sid­er real-time as “with­in a few sec­onds”)
  • it should retry failed events, but retries should not vio­late the real­time con­straint on the sys­tem
  • it should be pos­si­ble to retrieve events that could not be processed so some­one can inves­ti­gate root cause or pro­vide man­u­al inter­ven­tion

Back then, my long run­ning appli­ca­tion would:

  1. poll Kine­sis for events
  2. process the events by pass­ing them to a del­e­gate func­tion (your code)
  3. failed events are retried 2 addi­tion­al times
  4. after the 2 retries are exhaust­ed, they are saved into a SQS queue
  5. record the last sequence num­ber of the batch so that we don’t lose the cur­rent progress if the host VM dies or the appli­ca­tion crash­es
  6. anoth­er long run­ning appli­ca­tion (per­haps on anoth­er VM) would poll the SQS queue for events that couldn’t be process real­time
  7. process the failed events by pass­ing them to the same del­e­gate func­tion as above (your code)
  8. after the max no. of retrievals the events are passed off to a DLQ
  9. this trig­gers Cloud­Watch alarms and some­one can man­u­al­ly retrieve the event from the DLQ to inves­ti­gate

A Lamb­da func­tion that process­es Kine­sis events should also:

  • retry failed events X times depend­ing on pro­cess­ing time
  • send failed events to a DLQ after exhaust­ing X retries

Since SNS already comes with DLQ sup­port, you can sim­pli­fy your set­up by send­ing the failed events to a SNS top­ic instead—Lamb­da would then process it a fur­ther 3 times before pass­ing it off to the des­ig­nat­ed DLQ.

Avoid “hot” streams

We found that when a Kine­sis stream has 5 or more Lamb­da func­tion sub­scribers we would start to see lots ReadProvisionedThroughputExceeded errors in Cloud­Watch. For­tu­nate­ly these errors are silent to us as they hap­pen to (and are han­dled by) the Lamb­da ser­vice polling the stream.

How­ev­er, we occa­sion­al­ly see spikes in the GetRecords.IteratorAge met­ric, which tells us that a Lamb­da func­tion will some­times lag behind. This did not hap­pen fre­quent­ly enough to present a prob­lem but the spikes were unpre­dictable and did not cor­re­late to spikes in traf­fic or num­ber of incom­ing Kine­sis events.

Increas­ing the no. of shards in the stream made mat­ters worse and the no. of ReadProvisionedThroughputExceeded increased pro­por­tion­al­ly.


Accord­ing to the Kine­sis doc­u­men­ta­tion

Each shard can sup­port up to 5 trans­ac­tions per sec­ond for reads, up to a max­i­mum total data reads of 2 MB per sec­ond.

and Lamb­da doc­u­men­ta­tion

If your stream has 100 active shards, there will be 100 Lamb­da func­tions run­ning con­cur­rent­ly. Then, each Lamb­da func­tion process­es events on a shard in the order that they arrive.

One would assume that each of the afore­men­tioned Lamb­da func­tions would be polling its shard inde­pen­dent­ly. Since the prob­lem is hav­ing too many Lamb­da func­tions poll the same shard, it makes sense that adding new shards will only esca­late the prob­lem fur­ther.


All prob­lems in com­put­er sci­ence can be solved by anoth­er lev­el of indi­rec­tion.”

—David Wheel­er

After speak­ing to the AWS sup­port team about this, the only advice we received (and one that we had already con­sid­ered) was to apply the fan out pat­tern by adding anoth­er lay­er of Lamb­da func­tion who would dis­trib­ute the Kine­sis events to oth­ers.

Whilst this is sim­ple to imple­ment, it has some down­sides:

  • it vast­ly com­pli­cates the log­ic for han­dling par­tial fail­ures (see above)
  • all func­tions now process events at the rate of the slow­est func­tion, poten­tial­ly dam­ag­ing the real­time-ness of the sys­tem

We also con­sid­ered and dis­count­ed sev­er­al oth­er alter­na­tives, includ­ing:

  • have one stream per subscriber—this has a sig­nif­i­cant cost impli­ca­tion, and more impor­tant­ly it means pub­lish­ers would need to pub­lish the same event to mul­ti­ple Kine­sis streams in a “trans­ac­tion” with no easy way to roll­back (since you can’t unpub­lish an event in Kine­sis) on par­tial fail­ures
  • roll mul­ti­ple sub­scriber log­ic into one—this cor­rodes our ser­vice bound­ary as dif­fer­ent sub­sys­tems are bun­dled togeth­er to arti­fi­cial­ly reduce the no. of sub­scribers

In the end, we didn’t find a tru­ly sat­is­fy­ing solu­tion and decid­ed to recon­sid­er if Kine­sis was the right choice for our Lamb­da func­tions on a case by case basis.

  • for sub­sys­tems that do not have to be real­time, use S3 as source instead—all our Kine­sis events are per­sist­ed to S3 via Kine­sis Fire­hose, the result­ing S3 files can then be processed by these sub­sys­tems, eg. Lamb­da func­tions that stream events to Google Big­Query for BI
  • for work that are task-based (ie, order is not impor­tant), use SNS/SQS as source instead—SNS is native­ly sup­port­ed by Lamb­da, and we imple­ment­ed a proof-of-con­cept archi­tec­ture for pro­cess­ing SQS events with recur­sive Lamb­da func­tions, with elas­tic scal­ing; now that SNS has DLQ sup­port (it was not avail­able at the time) it would def­i­nite­ly be the pre­ferred option pro­vid­ed that its degree of par­al­lelism would not flood and over­whelm down­stream sys­tems such as data­bas­es, etc.
  • for every­thing else, con­tin­ue to use Kine­sis and apply the fan out pat­tern as an absolute last resort

Wrapping up…

So there you have it, 3 pro tips from a group of devel­op­ers who have had the plea­sure of work­ing exten­sive­ly with Lamb­da and Kine­sis.

I hope you find this post use­ful, if you have any inter­est­ing obser­va­tions or learn­ing from your own expe­ri­ence work­ing with Lamb­da and Kine­sis, please share them in the com­ments sec­tion below.

Links

Yubl’s road to server­less — Part 1, Overview

Yubl’s road to server­less — Part 2, Test­ing and CI/CD

Yubl’s road to server­less — Part 3, Ops

AWS Lamb­da — use recur­sive func­tions to process SQS mes­sages, Part 1

AWS Lamb­da — use recur­sive func­tions to process SQS mes­sages, Part 2

Like what you’re read­ing? Check out my video course Pro­duc­tion-Ready Server­less and learn the essen­tials of how to run a server­less appli­ca­tion in pro­duc­tion.

We will cov­er top­ics includ­ing:

  • authen­ti­ca­tion & autho­riza­tion with API Gate­way & Cog­ni­to
  • test­ing & run­ning func­tions local­ly
  • CI/CD
  • log aggre­ga­tion
  • mon­i­tor­ing best prac­tices
  • dis­trib­uted trac­ing with X-Ray
  • track­ing cor­re­la­tion IDs
  • per­for­mance & cost opti­miza­tion
  • error han­dling
  • con­fig man­age­ment
  • canary deploy­ment
  • VPC
  • secu­ri­ty
  • lead­ing prac­tices for Lamb­da, Kine­sis, and API Gate­way

You can also get 40% off the face price with the code ytcui. Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP).