Understanding the scaling behaviour of DynamoDB OnDemand tables

You can become a serverless blackbelt. Enrol to my 4-week online workshop Production-Ready Serverless and gain hands-on experience building something from scratch using serverless technologies. At the end of the workshop, you should have a broader view of the challenges you will face as your serverless architecture matures and expands. You should also have a firm grasp on when serverless is a good fit for your system as well as common pitfalls you need to avoid. Sign up now and get 15% discount with the code yanprs15!

Update 15/03/2019: Thanks to Zac Charles who pointed me to this new page in the DynamoDB docs. It explains how the OnDemand capacity mode works. Turns out you DON’T need to pre-warm a table. You just need to create the table with the desired peak throughput (Provisioned), and then change it to OnDemand. After you change the table to OnDemand it’ll still have the same throughput.

To confirm this behaviour, I ran a few more experiments:

  1. create a table with 20k/30k/40k provisioned write throughput
  2. change the table to OnDemand
  3. apply 40k writes/s traffic to the table right away

The result confirms the aforementioned behaviour.

Back when AWS announced DynamoDB AutoScaling in 2017, I took it for a spin and found a number of problems with how it works.

At re:invent 2018, AWS also announced DynamoDB OnDemand. DynamoDB OnDemand tables have different scaling behaviour, which promises to be far superior to DynamoDB AutoScaling.

Talking with a few of my friends at AWS, the consensus is that “AWS customers no longer need to worry about scaling DynamoDB”! So I ran a number of experiments to help me understand how OnDemand tables deal with sudden spikes in traffic.

From 0 to 4000, no problem!

Let’s assume that a newly created table can handle X amount of read/write throughput before it needs to scale. We can find out the value of X by:

  1. create a new table
  2. generate a traffic that ramps up to Y writes/s throughput in one minute (using UUIDs as keys to avoid hotspots)
  3. observe the throughput at which writes are throttled

To help me quickly find the threshold, I used Step Functions to test different values of Y (i.e. peak throughput): 100, 200, 300, 500, 700, 900, 1100, 1300, 1500, 1750, 2000, 2500, 3000, 5000, 7000, and 9000.

For every value of Y up to 3000, there were no throttled writes. But I was not able to perform more than 4000 writes/s.

The conclusion is that, a newly created table can handle 4000 writes/s without any throttling.

Table handles previously CONSUMED peak & more

From the announcement post by Danilo, one paragraph piqued my curiosity in particular.

Is the adaptive behaviour similar to DynamoDB AutoScaling and calculate the next threshold based on the previous peak Consumed Capacity? I explained the problem with this approach in my previous post – the threshold should be based on the throughput you wanted to execute (consumed + throttled), not just what you succeeded with (consumed).

With this question in mind, I reran the experiment for peak traffic of 9000 writes/s. As you can see from the diagrams below, the table’s threshold has been raised to around 7000 writes/s.

The good news is that, the table scaled to support throughput higher than the previous peak Consumed Capacity. The bad news is that it didn’t adapt to our desired throughput of 9000 writes/s so we still experienced some throttling.

The road to 40k writes/s

I ran another experiment to see how quickly a new table can scale to 40,000 writes/s and the steps it took.

For this experiment, I had to use a different Step Functions state machine. Using DynamoDB BatchWrites, I was able to achieve around 10,000 writes/s from a 1GB Lambda function. So to generate 40,000 writes/s I needed to run multiple instance of this function in parallel.

The new state machine looks like this.

As you can see from the result, it took roughly one hour to scale the table to meet our throughput demand of 40,000 writes/s.

Afterwards, we’ll be able to achieve 40,000 writes/s against this table without any problems.

This behaviour opens up an interesting possibility.

Most DynamoDB table would never experience a spike of more than 4000 writes/s. For those that do, you can also “pre-warm” the table to your target peak throughput. After pre-warming the table, it would be able to handle your desired peak throughput.

The road to 40k writes/s, slowly…

What if the traffic ramps up to 40,000 writes/s steadily instead? I ran a series of experiments to find out.

0 to 40k in 60 mins (1 hour)

In this case, the traffic increases steady over the course of an hour until it reached 40,000 writes/s. As you can see from the results below, the rate of progression is too fast for the OnDemand table to adapt to on the fly. As a result we incurred a lot of throttled write events along the way.

0 to 40k in 90 mins (1:30 hours)

Rerunning the experiment and slowing up the rate of progression shows improved results. But we still experienced a lot of throttling.

0 to 40k in 120 mins (2 hours)

0 to 40k in 150 mins (2:30 hours)

As we slowed down the rate the throughput goes up, we experienced less and less throttling along the way. One would assume that there is a rate of progression, Z, where the throughput increases slowly enough that there will be no throttling.

What about costs?

A lot have been made about the fact that OnDemand tables are more expensive per request. A rough estimate suggests it’s around 5–6 times more expensive per request compared to a provisioned table.

While that is the case, my experience with DynamoDB is that most tables are poorly utilized. A quick look at our DynamoDB tables in production paints a familiar picture. Despite the occasional spikes, tables are using less than 10% of the provisioned throughput on average.

And it’s not hard to find even worse offenders.

The bottom line is, we are terrible at managing DynamoDB throughputs! There are so much waste in reserved throughputs that, even with a hefty premium per request for OnDemand tables we will still end up with a huge saving overall.

Experiments are good!

As a side, all these experiments cost us a total of $271.61 for over 217 million write requests.

In the grand scheme of things, this is insignificant. The cost of the engineering time (my time!) to conduct, analyze and summarise the results of these experiments is far greater.

But even beyond that, the value of lessons we learnt is far greater. We now have the confidence to use OnDemand tables for anything that has a peak throughput of less than 4000 per second. Which means:

  • less work for our engineers from an infrastructure-as-code perspective
  • no need for capacity planning (in most cases)

The productivity saving this simple experiment creates for us makes the AWS cost look like peanuts! The moral of the story? Be curious, ask questions, experiment, and share your learnings with the community.


In summary:

  • You should default to DynamoDB OnDemand tables unless you have a stable, predictable traffic.
  • OnDemand tables can handle up to 4000 Consumed Capacity out of the box, after which your operations will be throttled.
  • OnDemand tables would auto-scale based on the Consumed Capacity.
  • Once scaled, you can reach the same throughput again in the future without throttling.
  • You should consider pre-warming OnDemand tables that need to handle spikes of more than 4000 Consumed Capacity to avoid initial throttling.


Liked this article? Support me on Patreon and get direct help from me via a private Slack channel or 1-2-1 mentoring.
Subscribe to my newsletter

Hi, I’m Yan. I’m an AWS Serverless Hero and I help companies go faster for less by adopting serverless technologies successfully.

Are you struggling with serverless or need guidance on best practices? Do you want someone to review your architecture and help you avoid costly mistakes down the line? Whatever the case, I’m here to help.

Hire me.

Skill up your serverless game with this hands-on workshop.

My 4-week Production-Ready Serverless online workshop is back!

This course takes you through building a production-ready serverless web application from testing, deployment, security, all the way through to observability. The motivation for this course is to give you hands-on experience building something with serverless technologies while giving you a broader view of the challenges you will face as the architecture matures and expands.

We will start at the basics and give you a firm introduction to Lambda and all the relevant concepts and service features (including the latest announcements in 2020). And then gradually ramping up and cover a wide array of topics such as API security, testing strategies, CI/CD, secret management, and operational best practices for monitoring and troubleshooting.

If you enrol now you can also get 15% OFF with the promo code “yanprs15”.

Enrol now and SAVE 15%.

Check out my new podcast Real-World Serverless where I talk with engineers who are building amazing things with serverless technologies and discuss the real-world use cases and challenges they face. If you’re interested in what people are actually doing with serverless and what it’s really like to be working with serverless day-to-day, then this is the podcast for you.

Check out my new course, Learn you some Lambda best practice for great good! In this course, you will learn best practices for working with AWS Lambda in terms of performance, cost, security, scalability, resilience and observability. We will also cover latest features from re:Invent 2019 such as Provisioned Concurrency and Lambda Destinations. Enrol now and start learning!

Check out my video course, Complete Guide to AWS Step Functions. In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. There is something for everyone from beginners to more advanced users looking for design patterns and best practices. Enrol now and start learning!