Yubl’s road to Serverless architecture – Part 1

Since Yubl’s closure quite a few people have asked about the serverless architecture we ended up with and some of the things we have learnt along the way.

As such, this is the first of a series of posts where I’d share some of the lessons we learnt. However, bear in mind the pace of change in this particular space so some of the challenges/problems we encountered might have been solved by the time you read this.

ps. many aspects of this series is already covered in a talk I gave on Amazon Lambda at Leetspeak this year, you can find the slides and recording of the talk here.

 

From A Monolithic Beginning

Back when I joined Yubl in April I inherited a monolithic Node.js backend running on EC2 instances, with MongoLab (hosted MongoDB) and CloudAMQP (hosted RabbitMQ) thrown into the mix.

yubl-monolith

There were numerous problems with the legacy system, some could be rectified with incremental changes (eg. blue-green deployment) but others required a rethink at an architectural level. Although things look really simple on paper (at the architecture diagram level), all the complexities are hidden inside each of these 3 services and boy, there were complexities!

My first tasks were to work with the ops team to improve the existing deployment pipeline and to draw up a list of characteristics we’d want from our architecture:

  • able to do small, incremental deployments
  • deployments should be fast, and requires no downtime
  • no lock-step deployments
  • features can be deployed independently
  • features are loosely coupled through messages
  • minimise cost for unused resources
  • minimise ops effort

From here we decided on a service-oriented architecture, and Amazon Lambda seemed the perfect tool for the job given the workloads we had:

  • lots of APIs, all HTTPS, no ultra-low latency requirement
  • lots of background tasks, many of which has soft-realtime requirement (eg. distributing post to follower’s timeline)

 

To a Serverless End

It’s suffice to say that we knew the migration was going to be a long road with many challenges along the way, and we wanted to do it incrementally and gradually increase the speed of delivery as we go.

“The lead time to someone saying thank you is the only reputation metric that matters”

– Dan North

The first step of the migration was to make the legacy systems publish state changes in the system (eg. user joined, user A followed user B, etc.) so that we can start building new features on top of the legacy systems.

To do this, we updated the legacy systems to publish events to Kinesis streams.

 

Our general strategy is:

  • build new features on top of these events, which usually have their own data stores (eg. DynamoDB, CloudSearch, S3, BigQuery, etc.) together with background processing pipelines and APIs
  • extract existing features/concepts from the legacy system into services that will run side-by-side
    • these new services will initially be backed by the same shared MongoLab database
    • other services (including the legacy ones) are updated to use hand-crafted API clients to access the encapsulated resources via the new APIs rather than hitting the shared MongoLab database directly
    • once all access to these resources are done via the new APIs, data migration (usually to DynamoDB tables) will commence behind the scenes
  • wherever possible, requests to existing API endpoints are forwarded to the new APIs so that we don’t have to wait for the iOS and Android apps to be updated (which can take weeks) and can start reaping the benefits earlier

 

After 6 months of hard work, my team of 6 backend engineers (including myself) have drastically transformed our backend infrastructure. Amazon was very impressed by the work we were doing with Lambda and in the process of writing up a case study of our work when Yubl was shut down at the whim of our major shareholder.

Here’s an almost complete picture of the architecture we ended up with (some details are omitted for brevity and clarity).

overall

Some interesting stats:

  • 170 Lambda functions running in production
  • roughly 1GB of total deployment package size (after Janitor Lambda cleans up unreferenced versions)
  • Lambda cost was around 5% of what we pay for EC2 for a comparable amount of compute
  • the no. of production deployments increased from 9/month in April to 155 in September

 

For the rest of the series I’ll drill down into specific features, how we utilised various AWS services, and how we tackled the challenges of:

  • centralised logging
  • centralised configuration management
  • distributed tracing with correlation IDs for Lambda functions
  • keeping Lambda functions warm to avoid coldstart penalty
  • auto-scaling AWS resources that do not scale dynamically
  • automatically clean up old Lambda function versions
  • securing sensitive data (eg. mongodb connection string, service credentials, etc.)

I can also explain our strategy for testing, and running/debugging functions locally, and so on. If there’s anything you’d like me to cover in particular, please leave a comment and let me know.

 

Links

Slides and recording of my Lambda talk at LeetSpeak 2016

AWS Lambda – use recursive function to process SQS messages (Part 2)

First of all, apologies for taking months to write this up since part 1, I have been extremely busy since joining Yubl. We have done a lot of work with AWS Lambda and I hope to share more of the lessons we have learnt soon, but for now take a look at this slidedeck to get a flavour.

 

At the end of part 1 we have a recursive Lambda function that will:

  1. fetch messages from SQS using long-polling
  2. process any received messages
  3. recurse by invoking itself asynchronously

however, on its own there are a number of problems:

  • any unhandled exceptions will kill the loop
  • there’s no way to scale the no. of processing loops elastically

 

We can address these concerns with a simple mechanism where:

  • a DynamoDB table stores a row for each loop that should be running
  • when scaling up, a new row is added to the table, and a new loop is started with the Hash Key (let’s call it a token) as input;
  • on each recursion, the looping function will check if its token is still valid
    • if yes, then the function will upsert a last_used timestamp against its token
    • if not (ie, the token is deleted) then terminates the loop
  • when scaling down, one of the tokens is chosen at random and deleted from the table, which causes the corresponding loop to terminate at the end of its current recursion (see above)
  • another Lambda function is triggered every X mins to look for tokens whose last_used timestamp is older than some threshold, and starts a new loop in its place

Depending on how long it takes the processing function to process all its SQS messages – limited by the max 5 min execution time – you can either adjust the threshold accordingly. Alternatively, you can provide another GUID when starting each loop (say, a loop_id?) and extend the loop’s token check to make sure the loop_id associated with its token matches its own.

 

High Level Overview

To implement this mechanism, we will need a few things:

  • a DynamoDB table to store the tokens
  • a function to look for dead loops restart them
  • a CloudWatch event to trigger the restart function every X mins
  • a function to scale_up the no. of processing loops
  • a function to scale_down the no. of processing loops
  • CloudWatch alarm(s) to trigger scaling activity based on no. of messages in the queue
  • SNS topics for the alarms to publish to, which in turn triggers the scale_up and scale_down functions accordingly*

* the SNS topics are required purely because there’s no support for CloudWatch alarms to trigger Lambda functions directly yet.

architecture

A working code sample is available on github which includes:

  • handler code for each of the aforementioned functions
  • s-resources-cf.json that will create the DynamoDB table when deployed
  • _meta/variables/s-variables-*.json files which also specifies the name of the DynamoDB table (and referenced by s-resources-cf.json above)

I’ll leave you to explore the code sample at your own leisure, and instead touch on a few of the intricacies.

 

Setting up Alarms

Assuming that your queue is mostly empty as your processing is keeping pace with the rate of new items, then a simple staggered set up should suffice here, for instance:

cw-alarm-2

Each of the alarms would send notification to the relevant SNS topic on ALARM and OK.

cw-alarm-1

 

Verifying Tokens

Since there’s no way to message a running Lambda function directly, we need a way to signal it to stop recursing somehow, and this is where the DynamoDB table comes in.

At the start of each recursion, the processing-msg function will check against the table to see if its token still exists. And since the table is also used to record a heartbeat from the function (for the restart function to identify failed recursions), it also needs to upsert the last_used timestamp against the token.

We can combine the two ops in one conditional write, which will fail if the token doesn’t exist, and that error will terminate the loop.

ps. I made the assumption in the sample code that processing the SQS messages is quick, hence why the timeout setting for the process-msg function is 30s (20s long polling + 10s processing) and the restart function’s threshold is a very conservative 2 mins (it can be much shorter even after you take into account the effect of eventual consistency).

If your processing logic can take some time to execute – bear in mind that the max timeout allowed for a Lambda function is 5 mins – then here’s a few options that spring to mind:

  1. adjust the restart function’s threshold to be more than 5 mins : which might not be great as it lengthens your time to recovery when things do go wrong;
  2. periodically update the last_used timestamp during processing : which also needs to be conditional writes, whilst swallowing any errors;
  3. add an loop_id in to the DynamoDB table and include it in the ConditionExpression : that way, you keep the restart function’s threshold low and allow the process-msg function to occasionally overrun; when it does, a new loop is started in its place and takes over its token with a new loop_id so that when the overrunning instance finishes it’ll be stopped when it recurses (and fails to verify its token because the loop_id no longer match)

Both option 2 & 3 strike me as reasonable approaches, depending on whether your processing logic are expected to always run for some time (eg, involves some batch processing) or only in unlikely scenarios (occasionally slow third-party API calls).

 

Scanning for Failed Loops

The restart function performs a table scan against the DynamoDB table to look for tokens whose last_used timestamp is either:

  • not set : the process-msg function never managed to set it during the first recursion, perhaps DynamoDB throttling issue or temporary network issue?
  • older than threshold : the process-msg function has stopped for whatever reason

By default, a Scan operation in DynamoDB uses eventually consistent read, which can fetch data that are a few seconds old. You can set the ConsistentRead parameter to true; or, be more conservative with your thresholds.

Also, there’s a size limit of 1mb of scanned data per Scan request. So you’ll need to perform the Scan operation recursively, see here.

 

Find out QueueURL from SNS Message

In the s-resources-cf.json file, you might have noticed that the DynamoDB table has the queue URL of the SQS queue as hash key. This is so that we could use the same mechanism to scale up/down & restart processing functions for many queues (see the section below).

But this brings up a new question: “when the scale-up/scale-down functions are called (by CloudWatch Alarms, via SNS), how do we work out which queue we’re dealing with?”

When SNS calls our function, the payload looks something like this:

which contains all the information we need to workout the queue URL for the SQS queue in question.

We’re making an assumption here that the triggered CloudWatch Alarm exist in the same account & region as the scale-up and scale-down functions (which is a pretty safe bet I’d say).

 

Knowing when NOT to Scale Down

Finally – and these only apply to the scale-down function – there are two things to keep in mind when scaling down:

  1. leave at least 1 processing loop per queue
  2. ignore OK messages whose old state is not ALARM

The first point is easy, just check if the query result contains more than 1 token.

The second point is necessary because CloudWatch Alarms have 3 states – OK, INSUFFICIENT_DATA and ALARM – and we only want to scale down when transitioned from ALARM => OK.

 

Taking It Further

As you can see, there is a fair bit of set up involved. It’d be a waste if we have to do the same for every SQS queue we want to process.

Fortunately, given the current set up there’s nothing stopping you from using the same infrastructure to manage multiple queues:

  • the DynamoDB table is keyed to the queue URL already, and
  • the scale-up and scale-down functions can already work with CloudWatch Alarms for any SQS queues

Firstly, you’d need to implement additional configuration so that given a queue URL you can work out which Lambda function to invoke to process messages from that queue.

Secondly, you need to decide between:

  1. use the same recursive function to poll SQS, but forward received messages to relevant Lambda functions based on the queue URL, or
  2. duplicate the recursive polling logic in each Lambda function (maybe put them in a npm package so to avoid code duplication)

Depending on the volume of messages you’re dealing with, option 1 has a cost consideration given there’s a Lambda invocation per message (in addition to the polling function which is running non-stop).

 

So that’s it folks, it’s a pretty long post and I hope you find the ideas here useful. If you’ve got any feedbacks or suggestions do feel free to leave them in the comments section below.

AWS Lambda – janitor-lambda function to clean up old deployment packages

When working with AWS Lambda, one of the things to keep in mind is that there’s a per region limit of 75GB total size for all deployment packages. Whilst that sounds a lot at first glance, our small team of server engineers managed to rack up nearly 20GB of deployment packages in just over 3 months!

Whilst we have been mindful of deployment package size (because it affects cold start time) and heavily using Serverless‘s built-in mechanism to exclude npm packages that are not used by each of the functions, the simple fact that deployment is simple and fast means we’re doing A LOT OF DEPLOYMENTS.

Individually, most of our functions are sub-2MB, but many functions are deployed so often that in some cases there are more than 300 deployed versions! This is down to how the Serverless framework deploy functions – by publishing a new version each time. On its own, it’s not a problem, but unless you clean up the old deployment packages you’ll eventually run into the 75GB limit.

 

Some readers might have heard of Netflix’s Janitor Monkey, which cleans up unused resources in your environment – instance, ASG,  EBS volumes, EBS snapshots, etc.

Taking a leaf out of Netflix’s book, we wrote a Lambda function which finds and deletes old versions of your functions that are not referenced by an alias – remember, Serverless uses aliases to implement the concept of stages in Lambda, so not being referenced by an alias essentially equates to an orphaned version.

janitor-lambda

At the time of writing, we have just over 100 Lambda functions in our development environment and around 50 running in production. After deploying the janitor-lambda function, we have cut the code storage size down to 1.1GB, which include only the current version of deployments for all our stages (we have 4 non-prod stages in this account).

lambda-console

sidebar: if you’d like to hear more about our experience with Lambda thus far and what we have been doing, then check out the slides from my talk on the matter, I’d be happy to write them up in more details too when I have more free time.

 

Janitor-Lambda

Without further ado, here’s the bulk of our janitor function:

Since AWS Lambda throttles you on the no. of APIs calls per minute, we had to store the list of functions in the functions variable so that it carries over multiple invocations.

When we hit the (almost) inevitable throttle exception, the current invocation will end, and any functions that haven’t been completely cleaned will be cleaned the next time the function is invoked.

Another thing to keep in mind is that, when using a CloudWatch Event as the source of your function, Amazon will retry your function up to 2 more times on failure. In this case, if the function is retried straight away, it’ll just get throttled again. Hence why in the handler we log and swallow any exceptions:

I hope you have found this post useful, let me know in the comments if you have any Lambda/Serverless related questions.

Slides for my AWS Lambda talk tonight