AWS Lambda — use recursive function to process SQS messages (Part 1)

UPDATE 07/09/2016 : read part 2 on how to elas­ti­cal­ly scale Lamb­da func­tion based on no. of mes­sages avail­able in the queue.


It’s been a year since the release of AWS Lamb­da ser­vice and here at Yubl we’re at the start of an excit­ing jour­ney to move our stack to Lamb­da (using the awe­some Server­less frame­work).

One fea­ture that’s sad­ly miss­ing though, is sup­port for SQS. Every AWS evan­ge­list I have spo­ken to tells me that it’s one of the most request­ed fea­tures from their cus­tomers and that it’s com­ing. It’s not the first time I have heard this kind of non-com­mit­tal response from those involved with Ama­zon and expe­ri­ence tells me not to expect it to hap­pen any­time soon.

But, if you’re itch­ing to use AWS Lamb­da with SQS and don’t wan­na wait an unspec­i­fied amount of time to get there, you have some options right now:

  • use SNS or Kine­sis instead
  • do-it-your­self with recur­sive Lamb­da func­tion that polls and process­es SQS mes­sages



Whilst you can use SNS and Kine­sis with Lamb­da already, SQS’s sup­port for dead let­ter queues still makes it a bet­ter choice in sit­u­a­tions where even­tu­al con­sis­ten­cy can be tol­er­at­ed but out­right data loss­es should be mit­i­gat­ed.

Whilst you can process SQS using EC2-host­ed appli­ca­tions already, Lamb­da deliv­ers a poten­tial cost sav­ing for low-traf­fic envi­ron­ments and more gran­u­lar cost con­trol as you scale out. It also pro­vides an easy and fast deploy­ment pipeline, with sup­port for ver­sion­ing and roll­backs.

Final­ly, we’ll walk through a sim­ple exam­ple and write a recur­sive Lamb­da func­tion in Node.js using the Server­less frame­work.


Lambda + SNS/Kinesis vs. Lambda + SQS

As a com­pro­mise, both SNS and Kine­sis can be used with Lamb­da to allow you to delay the exe­cu­tion of some work till lat­er. But seman­ti­cal­ly there are some impor­tant dif­fer­ences to con­sid­er.



SQS has built-in sup­port for dead let­ter queues — i.e. if a mes­sage is received N times and still not processed suc­cess­ful­ly then it is moved to a spec­i­fied dead let­ter queue. Mes­sages in the dead let­ter queue like­ly require some man­u­al inter­ven­tion and you typ­i­cal­ly would set up Cloud­Watch alarms to alert you when mes­sages start to pour into the dead let­ter queue.

If the cause is tem­po­ral, for exam­ple, there are out­ages to down­stream sys­tems — DB is unavail­able, or oth­er ser­vices are down/timing out — then the dead let­ter queue helps:

  1. pre­vent the build up of mes­sages in the pri­ma­ry queue; and
  2. gives you a chance to retry failed mes­sages after the down­stream sys­tems have recov­ered.

With SNS, mes­sages are retried 3 times and then lost for­ev­er. Whether or not this behav­iour is a show­stop­per for using SNS needs to be judged against your require­ments.

With Kine­sis, the ques­tion becomes slight­ly more com­pli­cat­ed. When using Kine­sis with Lamb­da, your degree of par­al­lelism (dis­cussed in more details below) is equal to the no. of shards in the stream. When your Lamb­da func­tion fails to process a batch of events, it’ll be called again with the same batch of events because AWS is keep­ing track of the posi­tion of your func­tion in that shard.


In essence, this means the retry strat­e­gy is up to you, but your choic­es are lim­it­ed to:

  1. fail and always retry the whole batch (even if some mes­sages were processed suc­cess­ful­ly) until either the fault heals itself or the mes­sages in the batch are no longer avail­able (Kine­sis only keeps mes­sages for up to 24 hours)
  2. nev­er retry a failed batch

If you choose option 1 (which is the default behav­iour), then you also have to ensure that mes­sages are processed in a way that’s idem­po­tent. If you choose option 2, then there’s a sig­nif­i­cant chance for data loss.

Nei­ther option is very appeal­ing, which is why I would nor­mal­ly use Kine­sis in con­junc­tion with SQS:

  • process a batch of mes­sages, and queue failed mes­sages into SQS
  • allow the pro­cess­ing of the shard to move on in spite of the failed mes­sages
  • SQS mes­sages are processed by the same log­ic, which required me to decou­ple the pro­cess­ing log­ic from the deliv­ery of pay­loads (SNS, SQS, Kine­sis, tests, etc.)



SNS exe­cutes your func­tion for every noti­fi­ca­tion, and as such has the high­est degree of par­al­lelism pos­si­ble.


There’s a soft lim­it of 100 con­cur­rent Lamb­da exe­cu­tions per region (which you can increase by rais­ing a sup­port tick­et), though accord­ing to the doc­u­men­ta­tion, AWS might increase the con­cur­rent exe­cu­tion lim­it on your behalf in order to exe­cute your func­tion at least once per noti­fi­ca­tion. How­ev­er, as a safe­ty pre­cau­tion you should still set up Cloud­Watch alarms on the Throt­tled met­ric for your Lamb­da func­tions.


With Kine­sis (and DynamoDB Streams for that mat­ter), the degree of par­al­lelism is the same as the no. of shards.


If you’re work­ing with SQS today, the degree of par­al­lelism would equal to the no. of poll loops you’re run­ning in your clus­ter.


For exam­ple, if you’re run­ning a tight poll loop for each core, and you have 3 quad-core EC2 instances in a clus­ter, then the degree of par­al­lelism would be 3 instances * 4 cores = 12.

Mov­ing for­ward, if you choose to use recur­sive Lamb­da func­tions to process SQS mes­sages then you can choose the degree of par­al­lelism you want to use.


Lambda + SQS vs. EC2 + SQS

Which brings us to the next ques­tion : if you can use EC2 instances to process SQS mes­sages already, why both­er mov­ing to Lamb­da? I think the cost sav­ing poten­tials, and the ease and speed of deploy­ment are the main ben­e­fits.



If you use the small­est pro­duc­tion-ready EC2 class — Lin­ux t2.micro — it will cost you $10.25/month in the eu-west-1 region (Ire­land).

Whilst the auto-scal­ing ser­vice is free to use, the default EC2 health checks can­not be relied upon to detect when your appli­ca­tion has stopped work­ing. To solve this prob­lem, you’ll typ­i­cal­ly set­up an ELB and use ELB health checks to trig­ger appli­ca­tion lev­el checks to ensure it is still run­ning and pro­cess­ing mes­sages.

The ELB health checks also enables the auto-scal­ing ser­vice to auto­mat­i­cal­ly replace unhealthy instances.

A min­i­mum pro­duc­tion deploy­ment would there­fore cost you $30.75 a month.


A recur­sive Lamb­da func­tion run­ning non-stop 24/7 would run for 2678400 sec­onds a month.

60 s * 60 m * 24 hr * 31 days = 2678400 s

If you assign 128MB to your func­tion, then your month­ly cost for Lamb­da would be $5.61 a month.

Month­ly Com­pute Charge

  = Total Com­pute (GB-sec­onds) * ($0.00001667 /GB-sec­ond)

  = (2678400 s * 128 MB / 1024 MB) * $0.00001667

  = 334800 GB-sec­onds * $0.00001667

  = $5.581116

Month­ly Request Charge

  = Total Requests * ($ 0.20/Million Reqs)

  = (2678400 s  / 20 s) / 1000000 * $ 0.20

  = 133920 Req / 1000000 * $0.20

  = $0.026784

Month­ly Charge (Total)

  = Month­ly Com­pute Charge + Month­ly Request Charge

  = $5.581116 + $0.026784

= $5.6079

Since Lambda’s free tier does not expire 12 months after sign up, this would fall with­in the free tier of 400000 GB-sec­onds per month too.


How­ev­er, there are oth­er aspects to con­sid­er:

  • you can process sev­er­al SQS queues in one EC2 instance
  • the cost of ELB is amor­tised as the size of clus­ter increas­es
  • EC2 instance class jumps in cost but also offers more com­pute and net­work­ing capa­bil­i­ty
  • in order to auto-scale your SQS-pro­cess­ing Lamb­da func­tions, you’ll need to pro­vi­sion oth­er resources

The exact cost sav­ing from using Lamb­da for SQS is not as clear cut as I first thought. But at the very least, you have a more gran­u­lar con­trol of your cost as you scale out.

A recur­sive­ly exe­cut­ed, 128MB Lamb­da func­tion would cost $5.61/month, where­as an autoscaled clus­ter of t2.micro instances would go up in month­ly cost $10.25 at a time.




In the IAAS world, the emer­gence of con­tain­er tech­nolo­gies has vast­ly improved the deploy­ment sto­ry.

But, as a devel­op­er, I now have anoth­er set of tech­nolo­gies which I need to come to grips with. The chal­lenge of nav­i­gat­ing this fast-chang­ing space and mak­ing sen­si­ble deci­sions on a set of over­lap­ping tech­nolo­gies (Kuber­netes, Mesos, Nomad, Dock­er, Rock­et, just to name a few) is not an easy one, and the con­se­quence of these deci­sions will have long last­ing impact on your orga­ni­za­tion.

Don’t get me wrong, I think con­tain­er tech­nolo­gies are amaz­ing and am excit­ed to see the pace of inno­va­tion in that space. But it’s a solu­tion, not the goal, the goal is and has always been to deliv­er val­ues to users quick­ly and safe­ly.

Keep­ing up with all that is hap­pen­ing in the con­tain­er space is over­whelm­ing, and at times I can’t help but feel that I am trad­ing one set of prob­lems with anoth­er.

As a devel­op­er, I want to deliv­er val­ue to my users first and fore­most. I want all the ben­e­fits con­tain­er tech­nolo­gies bring, but make their com­plex­i­ties some­one else’s prob­lem!


One the best things about AWS Lamb­da — besides all the reac­tive pro­gram­ming good­ness — is that deploy­ment and scal­ing becomes Ama­zon’s prob­lem.

I don’t have to think about pro­vi­sion­ing VMs and clus­ter­ing appli­ca­tions togeth­er; I don’t have to think about scal­ing the clus­ter and deploy­ing my code onto them. I don’t have to rely on an ops team to mon­i­tor and man­age my clus­ter and stream­line our deploy­ment. All of them, Ama­zon’s prob­lem! 

All I have to do to deploy a new ver­sion of my code to AWS Lamb­da is upload a zip file and hook up my Lamb­da func­tion to the rel­e­vant event sources and it’s job done!

Life becomes so much sim­pler 

With the Server­less frame­work, things get even eas­i­er!

Lamb­da sup­ports the con­cept of ver­sions and alias­es, where an alias is a named ver­sion of your code that has its own ARN. Server­less uses alias­es to imple­ment the con­cept of stages — ie. dev, stag­ing, prod — to mir­ror the con­cept of stages in Ama­zon API Gate­way.

To deploy a new ver­sion of your code to stag­ing:

  • you pub­lish a new ver­sion with your new code
  • update the stag­ing alias to point to that new ver­sion
  • and that’s it! Instant deploy­ment with no down­time!

Sim­i­lar­ly, to roll­back stag­ing to a pre­vi­ous ver­sion:

  • update the stag­ing alias to point to the pre­vi­ous ver­sion
  • sit back and admire the instant roll­back, again with no down­time!

What’s more? Server­less stream­lines these process­es into a sin­gle com­mand!


Writing a recursive Lambda function to process SQS messages with Serverless

(p.s. you can find the source code for the exper­i­ment here.)

First, let’s cre­ate a queue in SQS called Hel­loWorld.


Notice that although we have spec­i­fied the default vis­i­bil­i­ty time­out and receive mes­sage wait time (for long polling) val­ues here, we’ll over­ride them in the ReceiveMes­sage request lat­er.

Then we’ll cre­ate a new Server­less project, and add a func­tion called say-hel­lo.

Our project struc­ture looks rough­ly like this:


In the handler.js mod­ule, let’s add the fol­low­ing.


Notice we’re rely­ing on two envi­ron­ment vari­ables here — QUEUE_URL and FUNC_NAME. Both will be pop­u­lat­ed by Server­less using val­ues that we spec­i­fy in s-function.json (to under­stand how this works, check out Serverless’s doc­u­men­ta­tion).


Next, we’ll write the han­dler code for our Lamb­da func­tion.

Here, we will:

  1. make a ReceiveMes­sage request to SQS using long polling (20s)
  2. for every mes­sage we receive, we’ll process it with a say­Hel­lo func­tion (which we’ll write next)
  3. the say­Hel­lo func­tion will return a Promise
  4. when all the mes­sages have been processed, we’ll recurse by invok­ing this Lamb­da func­tion again


In the say­Hel­lo func­tion, we’ll log a mes­sage and delete the mes­sage in SQS.

One caveat to remem­ber is that, Promise.all will reject imme­di­ate­ly if any of the Promis­es rejects. Which is why I’m han­dling any error relat­ed to delet­ing the mes­sage in SQS here with .catch — it’ll restore the chain rather than allow­ing the reject to bub­ble up.


This imple­men­ta­tion, how­ev­er, doesn’t han­dle errors aris­ing from pro­cess­ing of the mes­sage (i.e. log­ging a mes­sage in this case). To do that, you’ll have to wrap the pro­cess­ing log­ic in a Promise, eg.

return new Promise((resolve, reject) => { 

    // pro­cess­ing log­ic goes here



.then(() => SQS.deleteMessage(delParams).promise())




Final­ly, let’s add a recurse func­tion to invoke this func­tion again.


Cou­ple of things to note about this func­tion:

  1. unlike say­Hel­lo, it doesn’t catch its own errors, this allows the reject to bub­ble up to the main han­dler func­tion which will then fail the cur­rent exe­cu­tion
    • by fail­ing the cur­rent exe­cu­tion this way we can use the Errors met­ric in Cloud­Watch to re-trig­ger this Lamb­da func­tion (more on this in Part 2)
  2. we’re call­ing invokeA­sync instead of invoke so we don’t have to (in this case we can’t!) wait for the func­tion to return


Deploying the Lambda function

You’ll notice that we haven’t pro­vid­ed any val­ues for the two envi­ron­ment vari­ables — QUEUE_URL and FUNC_NAME — we men­tioned ear­li­er. That’s because we don’t have the ARN for the Lamb­da func­tion yet!

So, let’s deploy the func­tion using Server­less. Run server­less dash deploy and fol­low the prompt.


Aha, there’s the ARN for our Lamb­da func­tion!

Go back to the project, open _meta/variables/s-variables-dev-euwest1.json and add the vari­ables queueUrl and func­Name.


How­ev­er, there’s anoth­er lay­er of abstrac­tion we need to address to ensure our envi­ron­ment vari­ables are pop­u­lat­ed cor­rect­ly.

Open processors/say-hello/s-function.json.

In the envi­ron­ment sec­tion, add QUEUE_URL and FUNC_NAME like below:


Now do you see how things fit togeth­er?


Since we’ve made changes to envi­ron­ment vari­ables, we’ll need to rede­ploy our Lamb­da func­tion.


Testing the deployed Lambda function

Once your Lamb­da func­tion is deployed, you can test it from the man­age­ment con­sole.


See that blue Test but­ton on the top left? Click that.

Since our func­tion doesn’t make use of the incom­ing pay­load, you can send it any­thing.


If you run the func­tion as it stands right now, you’ll get a per­mis­sion denied error because the IAM role the Lamb­da ser­vice uses to call our func­tion with doesn’t have per­mis­sions to use the SQS queue or to invoke a Lamb­da func­tion.

That’s great, because it tells me I have gran­u­lar con­trol over what my Lamb­da func­tions can do and I can con­trol it using IAM roles.

But we do need to go back to the project and give our func­tion the nec­es­sary per­mis­sions.


Configuring execution role permissions

Back in the Server­less project, take a look at s-resources-cf.json in the root direc­to­ry. This is a tem­plat­ed Cloud­For­ma­tion file that’s used to cre­ate the AWS resources for your func­tion, includ­ing the exe­cu­tion role for your func­tions.

Wher­ev­er you see ${stage} or ${region}, these will be sub­sti­tut­ed by vari­ables of the same name under _meta/variables/xxx.json files. Have a read of the Server­less doc­u­men­ta­tion if you wan­na dig deep­er into how the tem­plat­ing sys­tem works.

By default, Server­less cre­ates an exe­cu­tion role that can log to Cloud­Watch Logs.

  “Effect”: “Allow”,
  “Action”: [
  “Resource”: “arn:aws:logs:${region}:*:*”

You can edit s-resources-cf.json to grant the exe­cu­tion role per­mis­sions to use our queue and invoke the func­tion to recurse. Add the fol­low­ing state­ments:

  “Effect”: “Allow”,
  “Action”: [
  “Resource”: “arn:aws:sqs:${region}:*:HelloWorld”
  “Effect”: “Allow”,
  “Action”: “lambda:InvokeFunction”,
  “Resource”: “arn:aws:lambda:${region}:*:function:recursive-lambda-for-sqs-say-hello:${stage}”

We can update the exist­ing exe­cu­tion role by deploy­ing the Cloud­For­ma­tion change. To do that with Server­less, run server­less resources deploy (you can also use sls short­hand to save your­self some typ­ing).


If you test the func­tion again, you’ll see in the logs (which you can find in the Cloud­Watch Logs man­age­ment con­sole) that the func­tion is run­ning and recurs­ing.


Now, pop over to the SQS man­age­ment con­sole and select the Hel­loWorld queue we cre­at­ed ear­li­er. Under the Queue Actions drop down, select Send Mes­sage.

In the fol­low­ing pop­up, you can send a mes­sage into the queue.


Once the mes­sage is sent, you can check the Cloud­Watch Logs again, and see that mes­sage is received and a “Hel­lo, Yan of mes­sage ID […]” mes­sage was logged.


and voila, our recur­sive Lamb­da func­tion is pro­cess­ing SQS mes­sages! 



In Part 2, we’ll look at a sim­ple approach to autoscale the no. of con­cur­rent exe­cu­tions of our func­tion, and restart failed func­tions.

Liked this post? Why not support me on Patreon and help me get rid of the ads!