The problems with DynamoDB Auto Scaling and how it might be improved

TL;DR – AWS announced the long awaited auto scaling capability for DynamoDB, but we found it takes too long to scale up and doesn’t scale up aggressively enough as it’s held back by using consumed capacity as scaling metric rather than actual request count.

Here at Space Ape Games we developed an in-house tech to auto scale DynamoDB throughput and have used it successfully in production for a few years. It’s even integrated with our LiveOps tooling and scales up our DynamoDB tables according to the schedule of live events. This way, our tables are always provisioned just ahead of that inevitable spike in traffic at the start of an event.

Auto scaling DynamoDB is a common problem for AWS customers, I have personally implemented similar tech to deal with this problem at two previous companies. Whilst at Yubl, I even applied the same technique to auto scale Kinesis streams too.

When AWS announced DynamoDB Auto Scaling we were excited. However, the blog post that accompanied the announcement illustrated two problems:

  • the reaction time to scaling up is slow (10–15 mins)
  • it did not scale sufficiently to maintain the 70% utilization level
Notice the high no. of throttled operations despite the scaling activity. If you were scaling the table manually, would you have settled for this result?

Notice the high no. of throttled operations despite the scaling activity. If you were scaling the table manually, would you have settled for this result?

It looks as though the author’s test did not match the kind of workload that DynamoDB Auto Scaling is designed to accommodate:

In our case, we also have a high write-to-read ratio (typically around 1:1) because every action the players perform in a game changes their state in some way. So unfortunately we can’t use DAX as a get-out-of-jail free card.

How DynamoDB auto scaling works

When you modify the auto scaling settings on a table’s read or write throughput, it automatically creates/updates CloudWatch alarms for that table – four for writes and four for reads.

As you can see from the screenshot below, DynamoDB auto scaling uses CloudWatch alarms to trigger scaling actions. When the consumed capacity units breaches the utilization level on the table (which defaults to 70%) for 5 mins consecutively it will then scale up the corresponding provisioned capacity units.

Problems with the current system, and how it might be improved

From our own tests we found DynamoDB’s lacklustre performance at scaling up is rooted in 2 problems:

  1. The CloudWatch alarms requires 5 consecutive threshold breaches. When you take into account the latency in CloudWatch metrics (which typically are a few mins behind) it means scaling actions occur up to 10 mins after the specified utilization level is first breached. This reaction time is too slow.
  2. The new provisioned capacity unit is calculated based on consumed capacity units rather than the actual request count. The consumed capacity units is itself constrained by the provisioned capacity units even though it’s possible to temporarily exceed the provisioned capacity units with burst capacity. What this means is that once you’ve exhausted the saved burst capacity, the actual request count can start to outpace the consumed capacity units and scaling up is not able to keep pace with the increase in actual request count. We will see the effect of this in the results from the control group later.

Based on these observations, we hypothesize that you can make two modifications to the system to improve its effectiveness:

  1. trigger scaling up after 1 threshold breach instead of 5, which is in-line with the mantra of “scale up early, scale down slowly”.
  2. trigger scaling activity based on actual request count instead of consumed capacity units, and calculate the new provisioned capacity units using actual request count as well.

As part of this experiment, we also prototyped these changes (by hijacking the CloudWatch alarms) to demonstrate their improvement.

Testing Methodology

The most important thing for this test is a reliable and reproducible way of generating the desired traffic patterns.

To do that, we have a recursive function that will make BatchPut requests against the DynamoDB table under test every second. The items per second rate is calculated based on the elapsed time (t) in seconds so it gives us a lot of flexibility to shape the traffic pattern we want.

Since a Lambda function can only run for a max of 5 mins, when context.getRemainingTimeInMillis() is less than 2000 the function will recurse and pass the last recorded elapsed time (t) in the payload for the next invocation.

The result is a continuous, smooth traffic pattern you see below.

We tested with 2 traffic patterns we see regularly.

Bell Curve

This should be a familiar traffic pattern for most – a slow & steady buildup of traffic from the trough to the peak, followed by a faster drop off as users go to sleep. After a period of steady traffic throughout the night things start to pick up again the next day.

For many of us whose user base is concentrated in the North America region, the peak is usually around 3–4am UK time – the more reason we need DynamoDB Auto Scaling to do its job and not wake us up!

This traffic pattern is characterised by a) steady traffic at the trough, b) slow & steady build up towards the peak, c) fast drop off towards the trough, and repeat.

This traffic pattern is characterised by a) steady traffic at the trough, b) slow & steady build up towards the peak, c) fast drop off towards the trough, and repeat.

Top Heavy

This sudden burst of traffic is usually precipitated by an event – a marketing campaign, a promotion by the app store, or in our case a scheduled LiveOpsevent.

In most cases these events are predictable and we scale up DynamoDB tables ahead of time via our automated tooling. However, in the unlikely event of an unplanned burst of traffic (and it has happened to us a few times) a goodauto scaling system should scale up quickly and aggressively to minimise the disruption to our players.

This pattern is characterised by a) sharp climb in traffic, b) a slow & steady decline, c) stay at a stead level until the anomaly finishes and it goes back to the Bell Curve again.

This pattern is characterised by a) sharp climb in traffic, b) a slow & steady decline, c) stay at a stead level until the anomaly finishes and it goes back to the Bell Curve again.

We tested these traffic patterns against several utilization level settings (default is 70%) to see how it handles them. We measured the performance of the system by:

  • the % of successful requests (ie. consumed capacity / request count)
  • the total no. of throttled requests during the test

These results will act as our control group.

We then tested the same traffic patterns against the 2 hypothetical auto scaling changes we proposed above.

To prototype the proposed changes we hijacked the CloudWatch alarms created by DynamoDB auto scaling using CloudWatch events.

When a PutMetricAlarm API call is made, our change_cw_alarm function is invoked and replaces the existing CloudWatch alarms with the relevant changes – ie. set the EvaluationPeriods to 1 minute for hypothesis 1.

To avoid an invocation loop, the Lambda function will only make changes to the CloudWatch alarm if the EvaluationPeriod has not been changed to 1 min already.

To avoid an invocation loop, the Lambda function will only make changes to the CloudWatch alarm if the EvaluationPeriod has not been changed to 1 min already.

The change_cw_alarm function changed the breach threshold for the CloudWatch alarms to 1 min.

The change_cw_alarm function changed the breach threshold for the CloudWatch alarms to 1 min.

For hypothesis 2, we have to take over the responsibility of scaling up the table as we need to calculate the new provisioned capacity units using a custom metric that tracks the actual request count. Hence why the AlarmActions for the CloudWatch alarm is also overridden here.

The SNS topic is subscribed to a Lambda function which scales up the throughput of the table.

The SNS topic is subscribed to a Lambda function which scales up the throughput of the table.

Result (Bell Curve)

The test is setup as following:

  1. table starts off with 50 write capacity unit
  2. traffic holds steady for 15 mins at 25 writes/s
  3. traffic then increases to peak level (300 writes/s) at a steady rate over the next 45 mins
  4. traffic drops off back to 25 writes/s at a steady rate over the next 15 mins
  5. traffic holds steady at 25 writes/s

All the units in the diagrams are of SUM/min, which is how CloudWatch tracks ConsumedWriteCapacityUnits and WriteThrottleEvents, but I had to normalise the ProvisionedWriteCapacityUnits (which is tracked as per second unit) to make them consistent.

Let’s start by seeing how the control group (vanilla DynamoDB auto scaling) performed at different utilization levels from 30% to 80%.

I’m not sure why the total consumed units and total request count metrics didn’t match exactly when the utilization is between 30% and 50%, but seeing as there were no throttled events I’m going to put that difference down to inaccuracies in CloudWatch.

I’m not sure why the total consumed units and total request count metrics didn’t match exactly when the utilization is between 30% and 50%, but seeing as there were no throttled events I’m going to put that difference down to inaccuracies in CloudWatch.

I make several observations from these results:

  1. At 30%-50% utilization levels, write ops are never throttled – this is what we want to see in production.
  2. At 60% utilization level, the slow reaction time (problem 1) caused writes to be throttled early on as the system adjust to the steady increase in load but it was eventually able to adapt.
  3. At 70% and 80% utilization level, things really fell apart. The growth in the actual request count outpaced the growth of consumed capacity units, more and more write ops were throttled as the system failed to adapt to the new level of actual utilization (as opposed to “allowed”utilization measured by consumed capacity units, ie problem 2).

Hypothesis 1 : scaling after 1 min breach

Some observations:

  1. At 30%-50% utilization level, there’s no difference to performance.
  2. At 60% utilization level, the early throttled writes we saw in the control group is now addressed as we decreased the reaction time of the system.
  3. At 70%-80% utilization levels, there is negligible difference in performance. This is to be expected as the poor performance in the control group is caused by problem 2, so improving reaction time alone is unlikely to significantly improve performances in these cases.

Hypothesis 2 : scaling after 1 min breach on actual request count

Scaling on actual request count and using actual request count to calculate the new provisioned capacity units yields amazing results. There were no throttled events at 30%-70% utilization levels.

Even at 80% utilization level both the success rate and total no. of throttled events have improved significantly.

This is an acceptable level of performance for an autoscaling system, one that I’ll be happy to use in a production environment. Although, I’ll still lean on the side of caution and choose a utilization level at or below 70% to give the table enough headroom to deal with sudden spikes in traffic.

Results (Top Heavy)

The test is setup as following:

  1. table starts off with 50 write capacity unit
  2. traffic holds steady for 15 mins at 25 writes/s
  3. traffic then jumps to peak level (300 writes/s) at a steady rate over the next 5 mins
  4. traffic then decreases at a rate of 3 writes/s per minute

Once again, let’s start by looking at the performance of the control group (vanilla DynamoDB auto scaling) at various utilization levels.

Some observations from the results above:

  1. At 30%-60% utilization levels, most of the throttled writes can be attributed to the slow reaction time (problem 1). Once the table started to scale up the no. of throttled writes quickly decreased.
  2. At 70%-80% utilization levels, the system also didn’t scale up aggressively enough (problem 2). Hence we experienced throttled writes for much longer, resulting in a much worse performance overall.

Hypothesis 1 : scaling after 1 min breach

Some observations:

  1. Across the board the performance has improved, especially at the 30%-60% utilization levels.
  2. At 70%-80% utilization levels we’re still seeing the effect of problem 2 – not scaling up aggressively enough. As a result, there’s still a long tail to the throttled write ops.

Hypothesis 2 : scaling after 1 min breach on actual request count

Similar to what we observed with the Bell Curve traffic pattern, this implementation is significantly better at coping with sudden spikes in traffic at all utilization levels tested.

Even at 80% utilization level (which really doesn’t leave you with a lot of head room) an impressive 94% of write operations succeeded (compared with 73% recorded by the control group). Whilst there is still a significant no. of throttled events, it compares favourably against the 500k+ count recorded by the vanilla DynamoDB auto scaling.

Conclusions

I like DynamoDB, and I would like to use its auto scaling capability out of the box but it just doesn’t quite match my expectations at the moment. I hope someone from AWS is reading this, and that this post provides sufficient proof (as you can see from the data below) that it can be vastly improved with relatively small changes.

Feel free to play around with the demo, all the code is available here.

Running and debugging AWS Lambda functions locally with the Serverless framework and VS Code

One of the complaints developers often have for AWS Lambda is the inability to run and debug functions locally. For Node.js at least, the Serverless framework and VS Code provides a good solution for doing just that.

An often underused feature of the Serverless framework is the invoke local command, which runs your code locally by emulating the AWS Lambda environment. Granted, it’s not a perfect simulation and only works with Node.js and Python, but it has been good enough for most of local development needs.

With VS Code, you have the ability to debug Node.js applications, including the option to launch an external program.

Put the two together and you have the ability to locally run and debug your Lambda functions.

Step 1 : install Serverless framework as dev dependency

In general, it’s a good idea to install Serverless framework as a dev dependency in a project because:

  1. it allows other developers (and the CI server) to use the Serverless framework for deployment without having to install it themselves
  2. it prevents incompatibility issues when you have an incompatible version of Serverless framework installed to that used by the serverless.yml file in the project
  3. since Serverless v1.16.0 dev dependencies are excluded from the deployment package so it wouldn’t add to your deployment size (this is broken in the current version v1.18.0 but should be fixed shortly)

Step 2 : add debug configuration

Invoke the “sls invoke local” CLI command against the “hello” function with an empty object {} as input. It’s also possible to invoke the function with a JSON file, see doc here.

Step 3 : enjoy!

There, nice and easy :-)

Couple of things to note:

  • if your function depends on environment variables, then you can set those up in the launch.json config file in step 2
  • if your function needs to access other AWS resources, then you also need to setup the relevant environment variables (eg. AWS_PROFILE) for the aws-sdk to access those resources in the correct AWS account
  • this approach will not work for recursive functions (well, the recursion will happen on the deployed Lambda function, so you won’t be able to debug it)

I’m running a live course on designing serverless architecture with AWS Lambda

Hi everyone, just a quick note to let you know that I’m running a live online course with O’Reilly on designing serverless architectures with AWS Lambda.

It’s a 2-day course on September 11-12th with 6 hours in total, and it’s available for free if you have a subscription with SafariBooksOnline. Registration for the course is open till September 7th, so if there are still spaces available then you can even sign up for a free 10-day trial on SafariBooksOnline just before registration closes and get the course for free.

Sign up here.

The course will cover a variety of topics over the two days:

  • AWS Lambda basics
  • the Serverless framework
  • testing strategies
  • CI/CD
  • centralised logging
  • distributed tracing
  • monitoring
  • performance considerations including cold starts
  • config management
  • Lambda in VPC
  • security
  • best practices with API Gateway and Kinesis
  • step functions
  • explore several design patterns with Lambda

Slides for my serverless security talk

Applying the Saga pattern with AWS Lambda and Step Functions

The Saga pattern is a pattern for managing failures, where each action has a compensating action for rollback.

In Hector Garcia-Molina’s 1987 paper, it is described as an approach to handling system failures in a long-running transactions.

It has become increasingly relevant in the world of microservices as application logic often needs to transact across multiple bounded contexts – each encapsulated by its own microservice with independent databases. Caitie McCaffrey gave a good talk on using the Saga pattern in distributed systems, which you can watch here.

Using Caitie’s example from her talk, suppose we have a transaction that goes something like this:

Begin transaction
    Start book hotel request
    End book hotel request
    Start book flight request
    End book flight request
    Start book car rental request
    End book car rental request
End transaction

We can model each of the actions (and their compensating actions) with a Lambda function, and use a state machine in Step Function as the coordinator for the saga.

Because the compensating actions can also fail so we need to be able to retry them until success, which means they have to be idempotent.

In the example below, we’ll implement backward recovery in the event of a failure.

Each Lambda function expects the input to be in the following shape.

{
  "trip_id": "5c12d94a-ee6a-40d9-889b-1d49142248b7",
  "depart": "London",
  "depart_at": "2017-07-10T06:00:00.000Z",
  "arrive": "Dublin",
  "arrive_at": "2017-07-12T08:00:00.000Z",
  "hotel": "holiday inn",
  "check_in": "2017-07-10T12:00:00.000Z",
  "check_out": "2017-07-12T14:00:00.000Z",
  "rental": "Volvo",
  "rental_from": "2017-07-10T00:00:00.000Z",
  "rental_to": "2017-07-12T00:00:00.000Z"
}

Inside each of the functions is a simple PutItem request against a different DynamoDB table. The corresponding compensating function will perform a DeleteItem against the corresponding table to rollback the PutItem action.

The state machine pass the same input to each action in turn:

  1. BookHotel
  2. BookFlight
  3. BookRental

and record their results at a specific path (so to avoid overriding the input $ that will be passed to the next function).

In this naive implementation, we’ll apply the compensating action for any failure – hence the State.ALL below. In practice, you should consider giving certain error types a retry – eg. temporal errors such as DynamoDB’s provision throughput exceeded exceptions.

Success Case

Following the happy path, each of the actions are performed in turn and the state machine will end successfully.

Failure Cases

When failures strike, depending on where the failure occurs we need to apply the corresponding compensating actions in turn.

In the examples below, if the failure happened at BookFlight, then both CancelFlight and CancelHotel will be executed to rollback any changes performed thus far.

Similar, if the failure happened at BookRental, then all three compensating actions – CancelRental, CancelFlight and CancelHotel – will be executed in that order to rollback all the state changes from the transaction.

Each compensating action also have an infinite retry loop! In practice, there should be a reasonable upper limit on the no. of retries before you alert for human intervention.

You can find the source code for this demo here.