Takeaways from “Simplifying the Future” by Adrian Cockcroft

Simplifying things in our daily lives

“Life is complicated… but we use simple abstractions to deal with it.”

– Adrian Cockcroft

When people say “it’s too complicated”, what they usually mean is “there are too many moving parts and I can’t figure out what it’s going to do next, that I haven’t figured out an internal model for how it works and what it does”.

Which bags the question: “what’s the most complicated thing that you can deal with intuitively?”

Driving, for instance, is one of the most complicated things that we have to do on a regular basis. It combines hand-eye-feet coordination, navigation skills, and ability to react to unforeseeable scenarios that can be life-or-death.

 

A good example of a simple abstraction is the touch-based interface you find on smart phones and pads. Kids can dissimulate the working of an iPad by experimenting with it, without needing any formal training because they can interact with them and get instant feedback which helps them build the mental model of how things work.

As engineers, we should inspire to build things that can be given to 2 year olds and they can intuitively understand how they operate. This last point reminds me of what Brett Victor has been saying for years, with inspirational talks such as Inventing on Principle and Stop Drawing Dead Fish.

Netflix for instance, has invested much effort in intuition engineering and are building tools to help people get a better intuitive understanding of how their complex, distributed systems are operating at any moment in time.

Another example of how you can take complex things and give them simple descriptions is XKCD’s Thing Explainer, which uses simple words to explain otherwise complex things such as the International Space Station, Nuclear Reactor and Data Centre.

sidebar: wrt to complexities in code, here are two talks that you might also find interesting

 

Simplifying work

Adrian mentioned Netflix’s slide deck on their culture and values:

Intentional culture is becoming an important thing, and other companies have followed suit, eg.

It conditions people joining the company on what they would expect to see once they’re onboarded, and helps frame & standardise the recruitment process so that everyone knows what a ‘good’ hire looks like.

If you’re creating a startup you can set the culture from the start, don’t wait until you have accidental culture, be intentional and early about what you want to have.

 

This creates a purpose-driven culture.

Be clear and explicit about the purpose and let people work out how best to implement that purpose.

Purposes are simple statements, whereas setting out all the individual processes you need to ensure people build the right things are much harder, it’s simpler to have a purpose-driven culture and let people self-organise around those purposes.

Netflix also found that if you impose processes on people then you drive talents away, which is a big problem. Time and again, Netflix found that people produce a fraction of what they’re capable of producing at other places because they were held back by processes, rules and other things that slow them down.

On Reverse Conway’s Law, Adrian said that you should start with an organisational structure that’s cellular in nature, with clear responsibilities and ownership for a no. of small, co-located teams – high trust & high cohesion within the team, and low trust across the teams.

The morale here is that, if you build a company around a purpose-driven, systems-thinking approach then you are building organisations that are flexible and can evolve as the technology moves on.

The more rules you put in, and the more complex and rigid it gets, then you end up with the opposite.

“You build it, you run it”

– Werner Vogel, Amazon CTO

 

Simplifying the things we build

First, you should shift your thinking from projects to products, the key difference is that whereas a project has a start and end, a product will continue to evolve for as long as it still serves a purpose. On this point, see also:

“I’m sorry, but there really isn’t any project managers in this new model”

– Adrian Cockcroft

As a result, the overhead & ratio of developers to people doing management, releases & ops has to change.

 

Second, the most important metric to optimise for is time to value. (see also “beyond features” by Dan North)

“The lead time to someone saying thank you is the only reputation metric that matters”

– Dan North

Looking at the customer values and working out how to improve the time-to-value is an interesting challenge. (see Simon Wardley’s value-chain-mapping)

And lastly, and this is a subtle point – optimise for the customers that you want to have rather than the customers you have now. Which is an interesting twist on how we often think about retention and monetisation.

For Netflix, their optimisation is almost always around converting free trials to paying customers, which means they’re always optimising for people who haven’t seen the product before. Interestingly, this feedback loop also has the side-effect of forcing the product to be simple.

On the other hand, if you optimise for power users, then you’re likely to introduce more and more features that contribute towards the product being too complicated for new users. You can potentially build yourself into a corner where you struggle to attract new users and become vulnerable to a new comers into the market with simpler products that new users can understand.

 

Monolithic apps only look simple from the outside (at the architect diagram level), but if you look under the cover to see your object dependencies then the true scale of their complexities start to become apparent. And they often are complicated because it requires discipline to enforce clear separations.

“If you require constant diligence, then you’re setting everyone up for failure and hurt.”

– Brian Hunter

Microservices enforce separation that makes them less complicated, and make those connectivities between components explicit. They are also better for on-boarding as new joiners don’t have to understand all the interdependencies (inside a monolith) that encompass your entire system to make even small changes.

Each micro-service should have a clear, well-defined set of responsibilities and there’s a cap on the level of complexities they can reach.

sidebar: the best answers I have heard for “how small should a microservice be?” are:

  • “one that can be completely rewritten in 2 weeks”
  • “what can fit inside an engineer’s head” – which Psychology tells us, isn’t a lot ;-)

 

Monitoring used to be one of the things that made microservices complicated, but the tooling has caught up in this space and nowadays many vendors (such as NewRelic) offer tools that support this style of architecture out of the box.

 

Simplifying microservices architecture

If your system is deployed globally, then having the same, automated deployment for every region gives you symmetry. Having this commonality (same AMI, auto-scaling settings, deployment pipeline, etc.) is important, as is automation, because they give you known states in your system that allows you to make assertions.

It’s also important to have systems thinking, try to come up with feedback loops that drive people and machines to do the right thing.

Adrian then referenced Simon Wardley’s post on ecosystems in which he talked about the ILC model, or, a cycle of Innovate, Leverage, and Commoditize.

He touched on Serverless technologies such as AWS Lambda (which we’re using heavily at Yubl). At the moment it’s at the Innovate stage where it’s still a poorly defined concept and even those involved are still working out how best to utilise it.

If AWS Lambda functions are your nano-services, then on the other end of the scale both AWS and Azure are going to release VMs with terabytes of memory to the general public soon – which will have a massive impact on systems such as in-memory graph databases (eg. Neo4j).

When we move to the Leverage stage, the concepts have been clearly defined and terminologies are widely understood. However, the implementations are not yet standardised, and the challenge at this stage is that you can end up with too many choices as new vendors and solutions compete for market share as mainstream adoption gathers pace.

This is where we’re at with container schedulers – Docker Swarm, Kubernetes, Nomad, Mesos, CloudFoundry and whatever else pops up tomorrow.

As the technology matures and people work out the core set of features that matter to them, it’ll start to become a Commodity – this is where we’re at with running containers – where there are multiple compatible implementations that are offered as services.

This new form of commodity then becomes the base for the next wave of innovations by providing a platform that you can build on top of.

Simon Wardley also talked about this as the cycle of War, Wonder and Peace.

AWS Lambda – janitor-lambda function to clean up old deployment packages

When working with AWS Lambda, one of the things to keep in mind is that there’s a per region limit of 75GB total size for all deployment packages. Whilst that sounds a lot at first glance, our small team of server engineers managed to rack up nearly 20GB of deployment packages in just over 3 months!

Whilst we have been mindful of deployment package size (because it affects cold start time) and heavily using Serverless‘s built-in mechanism to exclude npm packages that are not used by each of the functions, the simple fact that deployment is simple and fast means we’re doing A LOT OF DEPLOYMENTS.

Individually, most of our functions are sub-2MB, but many functions are deployed so often that in some cases there are more than 300 deployed versions! This is down to how the Serverless framework deploy functions – by publishing a new version each time. On its own, it’s not a problem, but unless you clean up the old deployment packages you’ll eventually run into the 75GB limit.

 

Some readers might have heard of Netflix’s Janitor Monkey, which cleans up unused resources in your environment – instance, ASG,  EBS volumes, EBS snapshots, etc.

Taking a leaf out of Netflix’s book, we wrote a Lambda function which finds and deletes old versions of your functions that are not referenced by an alias – remember, Serverless uses aliases to implement the concept of stages in Lambda, so not being referenced by an alias essentially equates to an orphaned version.

janitor-lambda

At the time of writing, we have just over 100 Lambda functions in our development environment and around 50 running in production. After deploying the janitor-lambda function, we have cut the code storage size down to 1.1GB, which include only the current version of deployments for all our stages (we have 4 non-prod stages in this account).

lambda-console

sidebar: if you’d like to hear more about our experience with Lambda thus far and what we have been doing, then check out the slides from my talk on the matter, I’d be happy to write them up in more details too when I have more free time.

 

Janitor-Lambda

Without further ado, here’s the bulk of our janitor function:

Since AWS Lambda throttles you on the no. of APIs calls per minute, we had to store the list of functions in the functions variable so that it carries over multiple invocations.

When we hit the (almost) inevitable throttle exception, the current invocation will end, and any functions that haven’t been completely cleaned will be cleaned the next time the function is invoked.

Another thing to keep in mind is that, when using a CloudWatch Event as the source of your function, Amazon will retry your function up to 2 more times on failure. In this case, if the function is retried straight away, it’ll just get throttled again. Hence why in the handler we log and swallow any exceptions:

I hope you have found this post useful, let me know in the comments if you have any Lambda/Serverless related questions.

Slides for my AWS Lambda talk tonight

AWS Lambda – constant timeout when using bluebird Promise

Hello! Sorry for the lack of posts recently, it’s been a pretty hectic time here at Yubl, with plenty of exciting work happening and even more on the way. Hopefully I will be able to share with you some of the cool things we have done and valuable lessons we have learnt from working with AWS Lambda and Serverless in production.

Today’s post is one such lesson, a slightly baffling and painful one at that.

 

The Symptoms

We noticed that the Lambda function behind one of our APIs in Amazon API Gateway was timing out consistently (the function is configured with a 6s timeout, which is what you see in the diagram below).

lambda-bluebird-latency-spike

Looking in the logs it appears that one instance of our function (based on the frequency of the timeouts I could deduce that AWS Lambda had 3 instances of my function running at the same time) was constantly timing out.

What’s even more baffling is that, after the first timeout, the subsequent Lambda invocations never even enters the body of my handler function!

Considering that this is a Node.js function (running on the Node.js 4.3 runtime), this symptom is similar to what one’d expect if a synchronous operation is blocking the event queue so that nothing else gets a chance to run. (oh, how I miss Erlang VM’s pre-emptive scheduling at this point!)

So, as a summary, here’s the symptoms that we observed:

  1. function times out the first time
  2. all subsequent invocations times out without executing the handler function
  3. continues to timeout until Lambda recycles the underlying resource that runs your function

which, as you can imagine, is pretty scary – one strike, and you’re out

Oh, and I managed to reproduce the symptoms with Lambda functions with other event source types too, so it’s not specific to API Gateway endpoints.

 

Bluebird – the likely Culprit

After investigating the issue some more, I was able to isolate the problem to the use of bluebird Promises.

I was able to replicate the issue with a simple example below, where the function itself is configured to timeout after 1s.

lambda-bluebird-timeout-example

As you can see from the log messages below, as I repeatedly hit the API Gateway endpoint, the invocations continue to timeout without printing the hello~~~ message.

lambda-bluebird-timeout-example-log

At this point, your options are:

a) wait it out, or

b) do a dummy update with no code change

 

On the other hand, a hand-rolled delay function using vanilla Promise works as expected with regards to timeouts.

lambda-bluebird-timeout-example-2

lambda-bluebird-timeout-example-log-2

 

Workarounds

The obvious workaround is not to use bluebird, and any library that uses bluebird under the hood – e.g. promised-mongo.

Which sucks, because:

  1. bluebird is actually quite useful, and we use both bluebird and co quite heavily in our codebase
  2. having to check every dependency to make sure it’s not using bluebird under the hood
  3. can’t use other useful libraries that use bluebird internally

However, I did find that, if you specify an explicit timeout using bluebird‘s Promise.timeout function then it’s able to recover correctly. Presumably using bluebird’s own timeout function gives it a clean timeout whereas being forcibly timed out by the Lambda runtime screws with the internal state of its Promises.

The following example works as expected:

lambda-bluebird-timeout-example-3

lambda-bluebird-timeout-example-log-3

But, it wouldn’t be a workaround if it doesn’t have its own caveats.

It means you now have one more error that you need to handle in a graceful way (e.g. mapping the response in API Gateway to a 5XX HTTP status code), otherwise you’ll end up sending this kinda unhelpful responses back to your callers.

lambda-bluebird-timeout-example-log-4

 

So there, a painful lesson we learnt whilst running Node.js Lambda functions in production. Hopefully you have found this post in time before running into the issue yourself!

AWS Lambda – use recursive function to process SQS messages (Part 1)

UPDATE 07/09/2016 : read part 2 on how to elastically scale Lambda function based on no. of messages available in the queue.

 

It’s been a year since the release of AWS Lambda service and here at Yubl we’re at the start of an exciting journey to move our stack to Lambda (using the awesome Serverless framework).

One feature that’s sadly missing though, is support for SQS. Every AWS evangelist I have spoken to tells me that it’s one of the most requested features from their customers and that it’s coming. It’s not the first time I have heard this kind of non-committal response from those involved with Amazon and experience tells me not to expect it to happen anytime soon.

But, if you’re itching to use AWS Lambda with SQS and don’t wanna wait an unspecified amount of time to get there, you have some options right now:

  • use SNS or Kinesis instead
  • do-it-yourself with recursive Lambda function that polls and processes SQS messages

 

TL;DR

Whilst you can use SNS and Kinesis with Lambda already, SQS‘s support for dead letter queues still makes it a better choice in situations where eventual consistency can be tolerated but outright data losses should be mitigated.

Whilst you can process SQS using EC2-hosted applications already, Lambda delivers a potential cost saving for low-traffic environments and more granular cost control as you scale out. It also provides an easy and fast deployment pipeline, with support for versioning and rollbacks.

Finally, we’ll walk through a simple example and write a recursive Lambda function in Node.js using the Serverless framework.

 

Lambda + SNS/Kinesis vs. Lambda + SQS

As a compromise, both SNS and Kinesis can be used with Lambda to allow you to delay the execution of some work till later. But semantically there are some important differences to consider.

 

Retries

SQS has built-in support for dead letter queues – i.e. if a message is received N times and still not processed successfully then it is moved to a specified dead letter queue. Messages in the dead letter queue likely require some manual intervention and you typically would set up CloudWatch alarms to alert you when messages start to pour into the dead letter queue.

If the cause is temporal, for example, there are outages to downstream systems – DB is unavailable, or other services are down/timing out – then the dead letter queue helps:

  1. prevent the build up of messages in the primary queue; and
  2. gives you a chance to retry failed messages after the downstream systems have recovered.

With SNS, messages are retried 3 times and then lost forever. Whether or not this behaviour is a showstopper for using SNS needs to be judged against your requirements.

With Kinesis, the question becomes slightly more complicated. When using Kinesis with Lambda, your degree of parallelism (discussed in more details below) is equal to the no. of shards in the stream. When your Lambda function fails to process a batch of events, it’ll be called again with the same batch of events because AWS is keeping track of the position of your function in that shard.

lambda+kinesis-v2

In essence, this means the retry strategy is up to you, but your choices are limited to:

  1. fail and always retry the whole batch (even if some messages were processed successfully) until either the fault heals itself or the messages in the batch are no longer available (Kinesis only keeps messages for up to 24 hours)
  2. never retry a failed batch

If you choose option 1 (which is the default behaviour), then you also have to ensure that messages are processed in a way that’s idempotent. If you choose option 2, then there’s a significant chance for data loss.

Neither option is very appealing, which is why I would normally use Kinesis in conjunction with SQS:

  • process a batch of messages, and queue failed messages into SQS
  • allow the processing of the shard to move on in spite of the failed messages
  • SQS messages are processed by the same logic, which required me to decouple the processing logic from the delivery of payloads (SNS, SQS, Kinesis, tests, etc.)

 

Parallelism

SNS executes your function for every notification, and as such has the highest degree of parallelism possible.

lambda-parallelism-sns-01

There’s a soft limit of 100 concurrent Lambda executions per region (which you can increase by raising a support ticket), though according to the documentation, AWS might increase the concurrent execution limit on your behalf in order to execute your function at least once per notification. However, as a safety precaution you should still set up CloudWatch alarms on the Throttled metric for your Lambda functions.

 

With Kinesis (and DynamoDB Streams for that matter), the degree of parallelism is the same as the no. of shards.

lambda-parallelism-kinesis

If you’re working with SQS today, the degree of parallelism would equal to the no. of poll loops you’re running in your cluster.

sqs-parallelism

For example, if you’re running a tight poll loop for each core, and you have 3 quad-core EC2 instances in a cluster, then the degree of parallelism would be 3 instances * 4 cores = 12.

Moving forward, if you choose to use recursive Lambda functions to process SQS messages then you can choose the degree of parallelism you want to use.

 

Lambda + SQS vs. EC2 + SQS

Which brings us to the next question : if you can use EC2 instances to process SQS messages already, why bother moving to Lambda? I think the cost saving potentials, and the ease and speed of deployment are the main benefits.

 

Cost

If you use the smallest production-ready EC2 class – Linux t2.micro – it will cost you $10.25/month in the eu-west-1 region (Ireland).

Whilst the auto-scaling service is free to use, the default EC2 health checks cannot be relied upon to detect when your application has stopped working. To solve this problem, you’ll typically setup an ELB and use ELB health checks to trigger application level checks to ensure it is still running and processing messages.

The ELB health checks also enables the auto-scaling service to automatically replace unhealthy instances.

A minimum production deployment would therefore cost you $30.75 a month.

 

A recursive Lambda function running non-stop 24/7 would run for 2678400 seconds a month.

60 s * 60 m * 24 hr * 31 days = 2678400 s

If you assign 128MB to your function, then your monthly cost for Lambda would be $5.61 a month.

Monthly Compute Charge

  = Total Compute (GB-seconds) * ($0.00001667 /GB-second)

  = (2678400 s * 128 MB / 1024 MB) * $0.00001667

  = 334800 GB-seconds * $0.00001667

  = $5.581116

Monthly Request Charge

  = Total Requests * ($ 0.20/Million Reqs)

  = (2678400 s  / 20 s) / 1000000 * $ 0.20

  = 133920 Req / 1000000 * $0.20

  = $0.026784

Monthly Charge (Total)

  = Monthly Compute Charge + Monthly Request Charge

  = $5.581116 + $0.026784

= $5.6079

Since Lambda’s free tier does not expire 12 months after sign up, this would fall within the free tier of 400000 GB-seconds per month too.

 

However, there are other aspects to consider:

  • you can process several SQS queues in one EC2 instance
  • the cost of ELB is amortised as the size of cluster increases
  • EC2 instance class jumps in cost but also offers more compute and networking capability
  • in order to auto-scale your SQS-processing Lambda functions, you’ll need to provision other resources

The exact cost saving from using Lambda for SQS is not as clear cut as I first thought. But at the very least, you have a more granular control of your cost as you scale out.

A recursively executed, 128MB Lambda function would cost $5.61/month, whereas an autoscaled cluster of t2.micro instances would go up in monthly cost $10.25 at a time.

lambda-vs-ec2-cost

 

Deployment

In the IAAS world, the emergence of container technologies has vastly improved the deployment story.

But, as a developer, I now have another set of technologies which I need to come to grips with. The challenge of navigating this fast-changing space and making sensible decisions on a set of overlapping technologies (Kubernetes, Mesos, Nomad, Docker, Rocket, just to name a few) is not an easy one, and the consequence of these decisions will have long lasting impact on your organization.

Don’t get me wrong, I think container technologies are amazing and am excited to see the pace of innovation in that space. But it’s a solution, not the goal, the goal is and has always been to deliver values to users quickly and safely.

Keeping up with all that is happening in the container space is overwhelming, and at times I can’t help but feel that I am trading one set of problems with another.

As a developer, I want to deliver value to my users first and foremost. I want all the benefits container technologies bring, but make their complexities someone else’s problem!

 

One the best things about AWS Lambda – besides all the reactive programming goodness – is that deployment and scaling becomes Amazon‘s problem.

I don’t have to think about provisioning VMs and clustering applications together; I don’t have to think about scaling the cluster and deploying my code onto them. I don’t have to rely on an ops team to monitor and manage my cluster and streamline our deployment. All of them, Amazon‘s problem! 

All I have to do to deploy a new version of my code to AWS Lambda is upload a zip file and hook up my Lambda function to the relevant event sources and it’s job done!

Life becomes so much simpler 

With the Serverless framework, things get even easier!

Lambda supports the concept of versions and aliases, where an alias is a named version of your code that has its own ARN. Serverless uses aliases to implement the concept of stages – ie. dev, staging, prod – to mirror the concept of stages in Amazon API Gateway.

To deploy a new version of your code to staging:

  • you publish a new version with your new code
  • update the staging alias to point to that new version
  • and that’s it! Instant deployment with no downtime!

Similarly, to rollback staging to a previous version:

  • update the staging alias to point to the previous version
  • sit back and admire the instant rollback, again with no downtime!

What’s more? Serverless streamlines these processes into a single command!

 

Writing a recursive Lambda function to process SQS messages with Serverless

(p.s. you can find the source code for the experiment here.)

First, let’s create a queue in SQS called HelloWorld.

recursive_sqs_01

Notice that although we have specified the default visibility timeout and receive message wait time (for long polling) values here, we’ll override them in the ReceiveMessage request later.

Then we’ll create a new Serverless project, and add a function called say-hello.

Our project structure looks roughly like this:

recursive_sqs_004

In the handler.js module, let’s add the following.

recursive_sqs_000

Notice we’re relying on two environment variables here – QUEUE_URL and FUNC_NAME. Both will be populated by Serverless using values that we specify in s-function.json (to understand how this works, check out Serverless’s documentation).

 

Next, we’ll write the handler code for our Lambda function.

Here, we will:

  1. make a ReceiveMessage request to SQS using long polling (20s)
  2. for every message we receive, we’ll process it with a sayHello function (which we’ll write next)
  3. the sayHello function will return a Promise
  4. when all the messages have been processed, we’ll recurse by invoking this Lambda function again

recursive_sqs_001

In the sayHello function, we’ll log a message and delete the message in SQS.

One caveat to remember is that, Promise.all will reject immediately if any of the Promises rejects. Which is why I’m handling any error related to deleting the message in SQS here with .catch – it’ll restore the chain rather than allowing the reject to bubble up.

recursive_sqs_002

This implementation, however, doesn’t handle errors arising from processing of the message (i.e. logging a message in this case). To do that, you’ll have to wrap the processing logic in a Promise, eg.

return new Promise((resolve, reject) => { 

    // processing logic goes here

    resolve();

})

.then(() => SQS.deleteMessage(delParams).promise())

.then(…)

.catch(…);

 

Finally, let’s add a recurse function to invoke this function again.

recursive_sqs_003

Couple of things to note about this function:

  1. unlike sayHello, it doesn’t catch its own errors, this allows the reject to bubble up to the main handler function which will then fail the current execution
    • by failing the current execution this way we can use the Errors metric in CloudWatch to re-trigger this Lambda function (more on this in Part 2)
  2. we’re calling invokeAsync instead of invoke so we don’t have to (in this case we can’t!) wait for the function to return

 

Deploying the Lambda function

You’ll notice that we haven’t provided any values for the two environment variables – QUEUE_URL and FUNC_NAME – we mentioned earlier. That’s because we don’t have the ARN for the Lambda function yet!

So, let’s deploy the function using Serverless. Run serverless dash deploy and follow the prompt.

recursive_sqs_02

Aha, there’s the ARN for our Lambda function!

Go back to the project, open _meta/variables/s-variables-dev-euwest1.json and add the variables queueUrl and funcName.

recursive_sqs_03

However, there’s another layer of abstraction we need to address to ensure our environment variables are populated correctly.

Open processors/say-hello/s-function.json.

In the environment section, add QUEUE_URL and FUNC_NAME like below:

recursive_sqs_14

Now do you see how things fit together?

serverless-env-vars

Since we’ve made changes to environment variables, we’ll need to redeploy our Lambda function.

 

Testing the deployed Lambda function

Once your Lambda function is deployed, you can test it from the management console.

recursive_sqs_04

See that blue Test button on the top left? Click that.

Since our function doesn’t make use of the incoming payload, you can send it anything.

Oops.

If you run the function as it stands right now, you’ll get a permission denied error because the IAM role the Lambda service uses to call our function with doesn’t have permissions to use the SQS queue or to invoke a Lambda function.

That’s great, because it tells me I have granular control over what my Lambda functions can do and I can control it using IAM roles.

But we do need to go back to the project and give our function the necessary permissions.

 

Configuring execution role permissions

Back in the Serverless project, take a look at s-resources-cf.json in the root directory. This is a templated CloudFormation file that’s used to create the AWS resources for your function, including the execution role for your functions.

Wherever you see ${stage} or ${region}, these will be substituted by variables of the same name under _meta/variables/xxx.json files. Have a read of the Serverless documentation if you wanna dig deeper into how the templating system works.

By default, Serverless creates an execution role that can log to CloudWatch Logs.

{
  “Effect”: “Allow”,
  “Action”: [
    “logs:CreateLogGroup”,
    “logs:CreateLogStream”,
    “logs:PutLogEvents”
  ],
  “Resource”: “arn:aws:logs:${region}:*:*”
}

You can edit s-resources-cf.json to grant the execution role permissions to use our queue and invoke the function to recurse. Add the following statements:

{
  “Effect”: “Allow”,
  “Action”: [
    “sqs:ReceiveMessage”,
    “sqs:DeleteMessage”
  ],
  “Resource”: “arn:aws:sqs:${region}:*:HelloWorld”
},
{
  “Effect”: “Allow”,
  “Action”: “lambda:InvokeFunction”,
  “Resource”: “arn:aws:lambda:${region}:*:function:recursive-lambda-for-sqs-say-hello:${stage}”
}

We can update the existing execution role by deploying the CloudFormation change. To do that with Serverless, run serverless resources deploy (you can also use sls shorthand to save yourself some typing).

recursive_sqs_06

If you test the function again, you’ll see in the logs (which you can find in the CloudWatch Logs management console) that the function is running and recursing.

 

Now, pop over to the SQS management console and select the HelloWorld queue we created earlier. Under the Queue Actions drop down, select Send Message.

In the following popup, you can send a message into the queue.

recursive_sqs_07

Once the message is sent, you can check the CloudWatch Logs again, and see that message is received and a “Hello, Yan of message ID […]” message was logged.

recursive_sqs_08

and voila, our recursive Lambda function is processing SQS messages! 

recurse-lambda-for-sqs-v0

 

In Part 2, we’ll look at a simple approach to autoscale the no. of concurrent executions of our function, and restart failed functions.