What is the best event source for doing pub-sub with AWS Lambda?

AWS offers a wealth of options for imple­ment­ing mes­sag­ing pat­terns such as pub-sub with Lamb­da, let’s com­pare and con­trast some of these options.

The pub-sub pattern

Publish-Subscribe (often short­ened to pub-sub) is a mes­sag­ing pat­tern where pub­lish­ers and sub­scribers are decou­pled through an inter­me­di­ary bro­ker (ZeroMQ, Rab­bit­MQ, SNS, etc.).

From Wikipedia, https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern

SNS + Lambda

In the AWS ecosys­tem, the obvi­ous can­di­date for the bro­ker role is SNS.

SNS will make 3 attempts for your func­tion to process a mes­sage before send­ing it to a Dead Let­ter Queue (DLQ) if a DLQ is spec­i­fied for the func­tion. How­ev­er, accord­ing to an analy­sis by the folks at Ops­Ge­nie, the no. of retries can be as many as 6.

Anoth­er thing to con­sid­er is the degree of par­al­lelism this set­up offers. For each mes­sage SNS will cre­ate a new invo­ca­tion of your func­tion. So if you pub­lish 100 mes­sages to SNS then you can have 100 con­cur­rent exe­cu­tions of the sub­scribed Lamb­da func­tion.

This is great if you’re opti­mis­ing for through­put.

How­ev­er, we’re often con­strained by the max through­put our down­stream depen­den­cies can han­dle — data­bas­es, S3, internal/external ser­vices, etc.

If the burst in through­put is short then there’s a good chance the retries would be suf­fi­cient (there’s a ran­domised, expo­nen­tial back off between retries too) and you won’t miss any mes­sages.

Erred mes­sages are retried 2 times with expo­nen­tial back off. If the burst is short-lived then the retry is like­ly to suc­ceed, result­ing in no mes­sage loss.

If the burst in through­put is sus­tained over a long peri­od of time, then you can exhaust the max no. of retries. At this point you’ll have to rely on the DLQ and pos­si­bly human inter­ven­tion in order to recov­er the mes­sages that couldn’t be processed the first time round.

Erred mes­sages are retried 2 times with expo­nen­tial back off. But the burst in mes­sage rate over­laps with the retries, fur­ther exas­per­at­ing the prob­lem and even­tu­al­ly the max no. of retries are exhaust­ed and erred mes­sages have to be deliv­ered to the DLQ instead (if one is spec­i­fied).

Sim­i­lar­ly, if the down­stream depen­den­cy expe­ri­ences an out­age then all mes­sages received and retried dur­ing the out­age are bound to fail.

Any mes­sage received or retried dur­ing the down­stream mes­sage will fail and be sent to the DLQ.

You can also run into Lamb­da lim­it on no. of con­cur­rent exe­cu­tions in a region. Since this is an account wide lim­it, it will also impact your oth­er sys­tems that rely on AWS Lamb­da — APIs, event pro­cess­ing, cron jobs, etc.

Kinesis Streams + Lambda

Kine­sis Streams dif­fer from SNS in many ways:

  • Lamb­da polls Kine­sis for records up to 5 times a sec­ond, where­as SNS would push mes­sages to Lamb­da
  • records are received in batch­es (up to your spec­i­fied max­i­mum), SNS invokes your func­tion with one mes­sage
  • if your func­tion returns an error or times out, then you’ll keep receiv­ing the same batch of records until you either suc­cess­ful­ly process them or the data are no longer avail­able in the stream
  • the degree of par­al­lelism is deter­mined by the no. of shards in the stream as there is one ded­i­cat­ed invo­ca­tion per shard
  • Kine­sis Streams are charged based on no. of records pushed to the stream; shard hours, and whether or not you enable extend­ed reten­tion

SNS is prone to suf­fer from tem­po­ral issues — bursts in traf­fic, down­stream out­age, etc. Kine­sis on the oth­er hand deals with these issues much bet­ter.

  • degree of par­al­lelism is con­strained by no. of shards, which can be used to amor­tise bursts in mes­sage rate
Bursts in mes­sage rate is amor­tised, as the max through­put is deter­mined by no. of shards * max batch size * 5 reads per sec­ond. Which gives you two levers to adjust the max through­put with.
  • records are retried until suc­cess, unless the out­age lasts longer than the reten­tion pol­i­cy you have on the stream (default is 24 hours) you will even­tu­al­ly be able to process the records
The impact of a down­stream out­age is absorbed by the retry-until-suc­cess invo­ca­tion pol­i­cy.

But Kine­sis Streams is not with­out its own prob­lems. In fact, from my expe­ri­ence using Kine­sis Streams with Lamb­da I have found a no. of caveats that we need­ed to under­stand in order to make effec­tive use of them.

You can read about these caveats here.

 There are also sev­er­al oper­a­tional con­sid­er­a­tions to take into account:

  • because Kine­sis Streams is charged (in part) based on shard hours, so a dor­mant stream would have a base­line cost of $0.015 per shard per hour (~$11 per shard per month)
  • there is no built-in auto-scal­ing capa­bil­i­ty for Kine­sis Streams nei­ther, so there is also addi­tion­al man­age­ment over­head for scal­ing them up based on uti­liza­tion

It is pos­si­ble to build auto-scal­ing capa­bil­i­ty your­self, which I had done at my pre­vi­ous (failed) start­up. Whilst I can’t share the code you can read about the approach and my design think­ing here.

Inter­est­ing­ly, Kine­sis Streams is not the only stream­ing option avail­able on AWS, there is also DynamoDB Streams.


DynamoDB Streams + Lambda

DynamoDB Streams can be used as a like-for-like replace­ment for Kine­sis Streams.

By and large, DynamoDB Streams + Lamb­da works the same way as Kine­sis Streams + Lamb­da. Oper­a­tional­ly, it does have some inter­est­ing twists:

  • DynamoDB Streams auto-scales the no. of shards
  • if you’re pro­cess­ing DynamoDB Streams with AWS Lamb­da then you don’t pay for the reads from DynamoDB Streams (but you still pay for the read & write capac­i­ty units for the DynamoDB table itself)

  • Kine­sis Streams offers the option to extend data reten­tion to 7 days; DynamoDB Streams doesn’t offer such option

The fact that DynamoDB Streams auto-scales the no. of shards can be a dou­ble-edged sword. On one hand it elim­i­nates the need for you to man­age and scale the stream (or come up with home baked auto-scal­ing solu­tion); on the oth­er hand, it can also dimin­ish the abil­i­ty to amor­tize spikes in load you pass on to down­stream sys­tems.

AFAIK there is no way to lim­it the no. of shards a DynamoDB stream can scale up to — some­thing you’d sure­ly con­sid­er when imple­ment­ing your own auto-scal­ing solu­tion.

Should I use Kinesis or DynamoDB Streams?

I think the most per­ti­nent ques­tion is “what is your source of truth?”

Does a row being writ­ten in DynamoDB make it canon to the state of your sys­tem? This is cer­tain­ly the case in most N-tier sys­tems that are built around a data­base, regard­less whether it’s RDBMS or NoSQL.

In an event sourced sys­tem where state is mod­elled as a sequence of events (as opposed to a snap­shot) the source of truth might well be the Kine­sis stream — as soon as an event is writ­ten to the stream it’s con­sid­ered canon to the state of the sys­tem.

Then, there’re oth­er con­sid­er­a­tions around cost, auto-scal­ing, etc.

From a devel­op­ment point of view, DynamoDB Streams also has some lim­i­ta­tions & short­com­ing:

  • each stream is lim­it­ed to events from one table
  • the records describe DynamoDB events and not events from your domain, which I always felt cre­ates a sense of dis­so­nance when I’m work­ing with these events

Cost Implication of your Broker choice

Exclud­ing the cost of Lamb­da invo­ca­tions for pro­cess­ing the mes­sages, here are some cost pro­jec­tions for using SNS vs Kine­sis Streams vs DynamoDB Streams as the bro­ker. I’m mak­ing the assump­tion that through­put is con­sis­tent, and that each mes­sage is 1KB in size.

month­ly cost at 1 msg/s 

month­ly cost at 1,000 msg/s 

These pro­jec­tions should not be tak­en at face val­ue. For starters, the assump­tions about a per­fect­ly con­sis­tent through­put and mes­sage size is unre­al­is­tic, and you’ll need some head­room with Kine­sis & DynamoDB Streams even if you’re not hit­ting the throt­tling lim­its.

That said, what these pro­jec­tions do tell me is that:

  1. you get an awful lot with each shard in Kine­sis Streams
  2. whilst there’s a base­line cost for using Kine­sis Streams, the cost grows much slow­er with scale com­pared to SNS and DynamoDB Streams, thanks to the sig­nif­i­cant­ly low­er cost per mil­lion requests

Stacking it up

Whilst SNS, Kine­sis & DynamoDB Streams are your basic choic­es for the bro­ker, the Lamb­da func­tions can also act as bro­kers in their own right and prop­a­gate events to oth­er ser­vices.

This is the approach used by the aws-lamb­da-fanout project from awslabs. It allows you to prop­a­gate events from Kine­sis and DynamoDB Streams to oth­er ser­vices that can­not direct­ly sub­scribe to the 3 basic choice of bro­kers either because account/region lim­i­ta­tions, or that they’re just not sup­port­ed.

The aws-lamb­da-fanout project from awslabs prop­a­gates events from Kine­sis and DynamoDB Streams to oth­er ser­vices across mul­ti­ple accounts and regions.

Whilst it’s a nice idea and def­i­nite­ly meets some spe­cif­ic needs, it’s worth bear­ing in mind the extra com­plex­i­ties it intro­duces — han­dling par­tial fail­ures, deal­ing with down­stream out­ages, mis­con­fig­u­ra­tions, etc.

Conclusions

So what is the best event source for doing pub-sub with AWS Lamb­da? Like most tech deci­sions, it depends on the prob­lem you’re try­ing to solve, and the con­straints you’re work­ing with.

In this post, we looked at SNS, Kine­sis Streams and DynamoDB Streams as can­di­dates for the bro­ker role. We walked through a num­ber of sce­nar­ios to see how the choice of event source affects scal­a­bil­i­ty, par­al­lelism, resilience against tem­po­ral issues and cost.

You should now have a much bet­ter under­stand­ing of the trade­offs between these event sources when work­ing with Lamb­da. In the next post, we will look at anoth­er pop­u­lar mes­sag­ing pat­tern, push-pull, and how we can imple­ment it using Lamb­da. Again, we will look at a num­ber of dif­fer­ent ser­vices you can use and com­pare them.

Until next time!