how to do fan-out and fan-in with AWS Lambda

In the last post, we look at how you can imple­ment pub-sub with AWS Lamb­da. We com­pared sev­er­al event sources you can use, SNS, Kine­sis streams and DynamoDB streams, and the trade­offs avail­able to you.

Let’s look at anoth­er mes­sag­ing pat­tern today, push-pull, which is often referred to as fan-out/fan-in.

It’s real­ly two sep­a­rate pat­terns work­ing in tan­dem.

Fan-out is often used on its own, where mes­sages are deliv­ered to a pool of work­ers in a round-robin fash­ion and each mes­sage is deliv­ered to only one work­er.

This is use­ful in at least two dif­fer­ent ways:

  1. hav­ing a pool of work­ers to car­ry out the actu­al work allows for par­al­lel pro­cess­ing and lead to increased through­put
  2. if each mes­sage rep­re­sents an expen­sive task that can be bro­ken down into small­er sub­tasks that can be car­ried out in par­al­lel

In the sec­ond case where the orig­i­nal task (say, a batch job) is par­ti­tioned into many sub­tasks, you’ll need fan-in to col­lect result from indi­vid­ual work­ers and aggre­gate them togeth­er.

fan-out with SNS

As dis­cussed above, SNS’s invo­ca­tion per mes­sage pol­i­cy is a good fit here as we’re opti­miz­ing for through­put and par­al­lelism dur­ing the fan-out stage.

Here, a ventilator func­tion would par­ti­tion the expen­sive task into sub­tasks, and pub­lish a mes­sage to the SNS top­ic for each sub­task.

This is essen­tial­ly the approach we took when we imple­ment­ed the time­line fea­ture at Yubl (the last start­up I worked at) which works the same as Twitter’s time­line — when you pub­lish a new post it is dis­trib­uted to your fol­low­ers’ time­line; and when you fol­low anoth­er user, their posts would show up in your time­line short­ly after.

Yubl had a time­line fea­ture which works the same way as Twitter’s time­line. When you pub­lish a new post, the post will be dis­trib­uted to the time­line of your fol­low­ers.
A real-world exam­ple of fan-out where­by a user’s new post is dis­trib­uted to his fol­low­ers. Since the user can have tens of thou­sands of fol­low­ers the task is bro­ken down into many sub­tasks — each sub­task involves dis­trib­ut­ing the new post to 1k fol­low­ers and can be per­formed in par­al­lel.

fan-out with SQS

Before the advent of AWS Lamb­da, this type of work­load is often car­ried out with SQS. Unfor­tu­nate­ly SQS is not one of the sup­port­ed event sources for Lamb­da, which puts it in a mas­sive dis­ad­van­tage here.

That said, SQS itself is still a good choice for dis­trib­ut­ing tasks and if your sub­tasks take longer than 5 min­utes to com­plete (the max exe­cu­tion time for Lamb­da) you might still have to find a way to make the SQS + Lamb­da set­up work.

Let me explain what I mean.

First, it’s pos­si­ble for a Lamb­da func­tion to go beyond the 5 min exe­cu­tion time lim­it by writ­ing it as a recur­sive func­tion. How­ev­er, the orig­i­nal invo­ca­tion (trig­gered by SNS) has to sig­nal whether or not the SNS mes­sage was suc­cess­ful­ly processed, but that infor­ma­tion is only avail­able at the end of the recur­sion!

With SQS, you have a mes­sage han­dle that can be passed along dur­ing recur­sion. The recursed invo­ca­tion can then use the han­dle to:

  • extend the vis­i­bil­i­ty time­out for the mes­sage so anoth­er SQS poller does not receive it whilst we’re still pro­cess­ing the mes­sage
  • delete the mes­sage if we’re able to suc­cess­ful­ly process it

A while back, I pro­to­typed an archi­tec­ture for pro­cess­ing SQS mes­sages using recur­sive Lamb­da func­tions. The archi­tec­ture allows for elas­ti­cal­ly scal­ing up and down the no. of pollers based on the size of the back­log (or what­ev­er Cloud­Watch met­ric you choose to scale on).

You can read all about it here.

I don’t believe it low­ers the bar of entry for the SQS + Lamb­da set­up enough for reg­u­lar use, not to men­tion the addi­tion­al cost of run­ning a Lamb­da func­tion 24/7 for polling SQS.

The good news is that, AWS announced that SQS event source is com­ing to Lamb­da! So hope­ful­ly in the future you won’t need workarounds like the one I cre­at­ed to use Lamb­da with SQS.

What about Kinesis or DynamoDB Streams?

Per­son­al­ly I don’t feel these are great options, because the degree of par­al­lelism is con­strained by the no. of shards. Whilst you can increase the no. of shards, it’s a real­ly expen­sive way to get extra par­al­lelism, espe­cial­ly giv­en the way reshard­ing works in Kine­sis Streams — after split­ting an exist­ing shard, the old shard is still around for at least 24 hours (based on your reten­tion pol­i­cy) and you’ll con­tin­ue to pay for it.

There­fore, dynam­i­cal­ly adjust­ing the no. of shards to scale up and down the degree of par­al­lelism you’re after can incur lots unnec­es­sary cost.

With DynamoDB Streams, you don’t even have the option to reshard the stream — it’s a man­aged stream that reshards as it sees fit.

fan-in: collecting results from workers

When the ventilator func­tion par­ti­tion the orig­i­nal task into many sub­tasks, it can also include two iden­ti­fiers with each subtask?—?one for the top lev­el job, and one for the sub­task. When the sub­tasks are com­plet­ed, you can use the iden­ti­fiers to record their results against.

For exam­ple, you might use a DynamoDB table to store these results. But bare in mind that DynamoDB has a max item size of 400KB includ­ing attribute names.

Alter­na­tive­ly, you may also con­sid­er stor­ing the results in S3, which has a max object size of a whop­ping 5TB! For exam­ple, you can store the results as the fol­low­ing:

bucket/job_id/task_01.json
bucket/job_id/task_02.json
bucket/job_id/task_03.json
...

Note that in both cas­es we’re prone to expe­ri­ence hot par­ti­tions — large no. of writes against the same DynamoDB hash key or S3 pre­fix.

To mit­i­gate this neg­a­tive effect, be sure to use a GUID for the job ID.

Depend­ing on the vol­ume of write oper­a­tions you need to per­form against S3, you might need to tweak the approach. For exam­ple:

  • par­ti­tion the buck­et with top lev­el fold­ers and place results in to the cor­rect fold­er based on hash val­ue of the job ID
bucket/01/job_id_001/task_01.json
bucket/01/job_id_001/task_02.json
bucket/01/job_id_001/task_03.json
...
  • store the results in eas­i­ly hash­able but unstruc­tured way in S3, but also record ref­er­ences to them in DynamoDB table
bucket/ffa7046a-105e-4a00-82e6-849cd36c303b.json
bucket/8fb59303-d379-44b0-8df6-7a479d58e387.json
bucket/ba6d48b6-bf63-46d1-8c15-10066a1ceaee.json
...

fan-in: tracking overall progress

When the ventilator func­tion runs and par­ti­tions the expen­sive task into lots small sub­tasks, it should also record the total no. of sub­tasks. This way, it allows each invo­ca­tion of the worker func­tion to atom­i­cal­ly decre­ment the count, until it reach­es 0.

The invo­ca­tion that sees the count reach 0 is then respon­si­ble for sig­nalling that all the sub­tasks are com­plete. It can do this in many ways, per­haps by pub­lish­ing a mes­sage to anoth­er SNS top­ic so the worker func­tion is decou­pled from what­ev­er post steps that need to hap­pen to aggre­gate the indi­vid­ual results.

(wait, so are we back to the pub-sub pat­tern again?) maybe ;-)

At this point, the sink func­tion (or reduc­er, as it’s called in the con­text of a map-reduce job) would be invoked. See­ing as you’re like­ly to have a large no. of results to col­lect, it might be a good idea to also write the sink func­tion as a recur­sive func­tion too.

Any­way, these are just a few of the ways I can think of to imple­ment the push-poll pat­tern with AWS Lamb­da. Let me know in the com­ments if I have missed any obvi­ous alter­na­tives.

Like what you’re read­ing but want more help? I’m hap­py to offer my ser­vices as an inde­pen­dent con­sul­tant and help you with your server­less project — archi­tec­ture reviews, code reviews, build­ing proof-of-con­cepts, or offer advice on lead­ing prac­tices and tools.

I’m based in Lon­don, UK and cur­rent­ly the only UK-based AWS Server­less Hero. I have near­ly 10 years of expe­ri­ence with run­ning pro­duc­tion work­loads in AWS at scale. I oper­ate pre­dom­i­nant­ly in the UK but I’m open to trav­el­ling for engage­ments that are longer than a week. To see how we might be able to work togeth­er, tell me more about the prob­lems you are try­ing to solve here.

I can also run an in-house work­shops to help you get pro­duc­tion-ready with your server­less archi­tec­ture. You can find out more about the two-day work­shop here, which takes you from the basics of AWS Lamb­da all the way through to com­mon oper­a­tional pat­terns for log aggre­ga­tion, dis­tri­b­u­tion trac­ing and secu­ri­ty best prac­tices.

If you pre­fer to study at your own pace, then you can also find all the same con­tent of the work­shop as a video course I have pro­duced for Man­ning. We will cov­er top­ics includ­ing:

  • authen­ti­ca­tion & autho­riza­tion with API Gate­way & Cog­ni­to
  • test­ing & run­ning func­tions local­ly
  • CI/CD
  • log aggre­ga­tion
  • mon­i­tor­ing best prac­tices
  • dis­trib­uted trac­ing with X-Ray
  • track­ing cor­re­la­tion IDs
  • per­for­mance & cost opti­miza­tion
  • error han­dling
  • con­fig man­age­ment
  • canary deploy­ment
  • VPC
  • secu­ri­ty
  • lead­ing prac­tices for Lamb­da, Kine­sis, and API Gate­way

You can also get 40% off the face price with the code ytcui. Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP).