The problems with DynamoDB Auto Scaling and how it might be improved

TL;DR — AWS announced the long await­ed auto scal­ing capa­bil­i­ty for DynamoDB, but we found it takes too long to scale up and doesn’t scale up aggres­sive­ly enough as it’s held back by using con­sumed capac­i­ty as scal­ing met­ric rather than actu­al request count.

Here at Space Ape Games we devel­oped an in-house tech to auto scale DynamoDB through­put and have used it suc­cess­ful­ly in pro­duc­tion for a few years. It’s even inte­grat­ed with our LiveOps tool­ing and scales up our DynamoDB tables accord­ing to the sched­ule of live events. This way, our tables are always pro­vi­sioned just ahead of that inevitable spike in traf­fic at the start of an event.

Auto scal­ing DynamoDB is a com­mon prob­lem for AWS cus­tomers, I have per­son­al­ly imple­ment­ed sim­i­lar tech to deal with this prob­lem at two pre­vi­ous com­pa­nies. Whilst at Yubl, I even applied the same tech­nique to auto scale Kine­sis streams too.

When AWS announced DynamoDB Auto Scal­ing we were excit­ed. How­ev­er, the blog post that accom­pa­nied the announce­ment illus­trat­ed two prob­lems:

  • the reac­tion time to scal­ing up is slow (10–15 mins)
  • it did not scale suf­fi­cient­ly to main­tain the 70% uti­liza­tion lev­el
Notice the high no. of throttled operations despite the scaling activity. If you were scaling the table manually, would you have settled for this result?
Notice the high no. of throt­tled oper­a­tions despite the scal­ing activ­i­ty. If you were scal­ing the table man­u­al­ly, would you have set­tled for this result?

It looks as though the author’s test did not match the kind of work­load that DynamoDB Auto Scal­ing is designed to accom­mo­date:

In our case, we also have a high write-to-read ratio (typ­i­cal­ly around 1:1) because every action the play­ers per­form in a game changes their state in some way. So unfor­tu­nate­ly we can’t use DAX as a get-out-of-jail free card.

How DynamoDB auto scaling works

When you mod­i­fy the auto scal­ing set­tings on a table’s read or write through­put, it auto­mat­i­cal­ly creates/updates Cloud­Watch alarms for that table — four for writes and four for reads.

As you can see from the screen­shot below, DynamoDB auto scal­ing uses Cloud­Watch alarms to trig­ger scal­ing actions. When the con­sumed capac­i­ty units breach­es the uti­liza­tion lev­el on the table (which defaults to 70%) for 5 mins con­sec­u­tive­ly it will then scale up the cor­re­spond­ing pro­vi­sioned capac­i­ty units.

Problems with the current system, and how it might be improved

From our own tests we found DynamoDB’s lack­lus­tre per­for­mance at scal­ing up is root­ed in 2 prob­lems:

  1. The Cloud­Watch alarms requires 5 con­sec­u­tive thresh­old breach­es. When you take into account the laten­cy in Cloud­Watch met­rics (which typ­i­cal­ly are a few mins behind) it means scal­ing actions occur up to 10 mins after the spec­i­fied uti­liza­tion lev­el is first breached. This reac­tion time is too slow.
  2. The new pro­vi­sioned capac­i­ty unit is cal­cu­lat­ed based on con­sumed capac­i­ty units rather than the actu­al request count. The con­sumed capac­i­ty units is itself con­strained by the pro­vi­sioned capac­i­ty units even though it’s pos­si­ble to tem­porar­i­ly exceed the pro­vi­sioned capac­i­ty units with burst capac­i­ty. What this means is that once you’ve exhaust­ed the saved burst capac­i­ty, the actu­al request count can start to out­pace the con­sumed capac­i­ty units and scal­ing up is not able to keep pace with the increase in actu­al request count. We will see the effect of this in the results from the con­trol group lat­er.

Based on these obser­va­tions, we hypoth­e­size that you can make two mod­i­fi­ca­tions to the sys­tem to improve its effec­tive­ness:

  1. trig­ger scal­ing up after 1 thresh­old breach instead of 5, which is in-line with the mantra of “scale up ear­ly, scale down slow­ly”.
  2. trig­ger scal­ing activ­i­ty based on actu­al request count instead of con­sumed capac­i­ty units, and cal­cu­late the new pro­vi­sioned capac­i­ty units using actu­al request count as well.

As part of this exper­i­ment, we also pro­to­typed these changes (by hijack­ing the Cloud­Watch alarms) to demon­strate their improve­ment.

Testing Methodology

The most impor­tant thing for this test is a reli­able and repro­ducible way of gen­er­at­ing the desired traf­fic pat­terns.

To do that, we have a recur­sive func­tion that will make BatchPut requests against the DynamoDB table under test every sec­ond. The items per sec­ond rate is cal­cu­lat­ed based on the elapsed time (t) in sec­onds so it gives us a lot of flex­i­bil­i­ty to shape the traf­fic pat­tern we want.

Since a Lamb­da func­tion can only run for a max of 5 mins, when context.getRemainingTimeInMillis() is less than 2000 the func­tion will recurse and pass the last record­ed elapsed time (t) in the pay­load for the next invo­ca­tion.

The result is a con­tin­u­ous, smooth traf­fic pat­tern you see below.

We test­ed with 2 traf­fic pat­terns we see reg­u­lar­ly.

Bell Curve

This should be a famil­iar traf­fic pat­tern for most — a slow & steady buildup of traf­fic from the trough to the peak, fol­lowed by a faster drop off as users go to sleep. After a peri­od of steady traf­fic through­out the night things start to pick up again the next day.

For many of us whose user base is con­cen­trat­ed in the North Amer­i­ca region, the peak is usu­al­ly around 3–4am UK time — the more rea­son we need DynamoDB Auto Scal­ing to do its job and not wake us up!

This traffic pattern is characterised by a) steady traffic at the trough, b) slow & steady build up towards the peak, c) fast drop off towards the trough, and repeat.
This traf­fic pat­tern is char­ac­terised by a) steady traf­fic at the trough, b) slow & steady build up towards the peak, c) fast drop off towards the trough, and repeat.

Top Heavy

This sud­den burst of traf­fic is usu­al­ly pre­cip­i­tat­ed by an event — a mar­ket­ing cam­paign, a pro­mo­tion by the app store, or in our case a sched­uled LiveOpsevent.

In most cas­es these events are pre­dictable and we scale up DynamoDB tables ahead of time via our auto­mat­ed tool­ing. How­ev­er, in the unlike­ly event of an unplanned burst of traf­fic (and it has hap­pened to us a few times) a goodauto scal­ing sys­tem should scale up quick­ly and aggres­sive­ly to min­imise the dis­rup­tion to our play­ers.

This pattern is characterised by a) sharp climb in traffic, b) a slow & steady decline, c) stay at a stead level until the anomaly finishes and it goes back to the Bell Curve again.
This pat­tern is char­ac­terised by a) sharp climb in traf­fic, b) a slow & steady decline, c) stay at a stead lev­el until the anom­aly fin­ish­es and it goes back to the Bell Curve again.

We test­ed these traf­fic pat­terns against sev­er­al utilization level set­tings (default is 70%) to see how it han­dles them. We mea­sured the per­for­mance of the sys­tem by:

  • the % of suc­cess­ful requests (ie. con­sumed capac­i­ty / request count)
  • the total no. of throt­tled requests dur­ing the test

These results will act as our con­trol group.

We then test­ed the same traf­fic pat­terns against the 2 hypo­thet­i­cal auto scal­ing changes we pro­posed above.

To pro­to­type the pro­posed changes we hijacked the Cloud­Watch alarms cre­at­ed by DynamoDB auto scal­ing using Cloud­Watch events.

When a PutMetricAlarm API call is made, our change_cw_alarm func­tion is invoked and replaces the exist­ing Cloud­Watch alarms with the rel­e­vant changes — ie. set the EvaluationPeriods to 1 minute for hypoth­e­sis 1.

To avoid an invocation loop, the Lambda function will only make changes to the CloudWatch alarm if the EvaluationPeriod has not been changed to 1 min already.
To avoid an invo­ca­tion loop, the Lamb­da func­tion will only make changes to the Cloud­Watch alarm if the Eval­u­a­tion­Pe­ri­od has not been changed to 1 min already.
The change_cw_alarm function changed the breach threshold for the CloudWatch alarms to 1 min.
The change_cw_alarm func­tion changed the breach thresh­old for the Cloud­Watch alarms to 1 min.

For hypoth­e­sis 2, we have to take over the respon­si­bil­i­ty of scal­ing up the table as we need to cal­cu­late the new pro­vi­sioned capac­i­ty units using a cus­tom met­ric that tracks the actu­al request count. Hence why the AlarmActions for the Cloud­Watch alarm is also over­rid­den here.

The SNS topic is subscribed to a Lambda function which scales up the throughput of the table.
The SNS top­ic is sub­scribed to a Lamb­da func­tion which scales up the through­put of the table.

Result (Bell Curve)

The test is set­up as fol­low­ing:

  1. table starts off with 50 write capac­i­ty unit
  2. traf­fic holds steady for 15 mins at 25 writes/s
  3. traf­fic then increas­es to peak lev­el (300 writes/s) at a steady rate over the next 45 mins
  4. traf­fic drops off back to 25 writes/s at a steady rate over the next 15 mins
  5. traf­fic holds steady at 25 writes/s

All the units in the dia­grams are of SUM/min, which is how Cloud­Watch tracks ConsumedWriteCapacityUnits and WriteThrottleEvents, but I had to nor­malise the ProvisionedWriteCapacityUnits (which is tracked as per sec­ond unit) to make them con­sis­tent.

Let’s start by see­ing how the con­trol group (vanil­la DynamoDB auto scal­ing) per­formed at dif­fer­ent uti­liza­tion lev­els from 30% to 80%.

I’m not sure why the total consumed units and total request count metrics didn’t match exactly when the utilization is between 30% and 50%, but seeing as there were no throttled events I’m going to put that difference down to inaccuracies in CloudWatch.
I’m not sure why the total con­sumed units and total request count met­rics didn’t match exact­ly when the uti­liza­tion is between 30% and 50%, but see­ing as there were no throt­tled events I’m going to put that dif­fer­ence down to inac­cu­ra­cies in Cloud­Watch.

I make sev­er­al obser­va­tions from these results:

  1. At 30%-50% uti­liza­tion lev­els, write ops are nev­er throt­tled — this is what we want to see in pro­duc­tion.
  2. At 60% uti­liza­tion lev­el, the slow reac­tion time (prob­lem 1) caused writes to be throt­tled ear­ly on as the sys­tem adjust to the steady increase in load but it was even­tu­al­ly able to adapt.
  3. At 70% and 80% uti­liza­tion lev­el, things real­ly fell apart. The growth in the actu­al request count out­paced the growth of con­sumed capac­i­ty units, more and more write ops were throt­tled as the sys­tem failed to adapt to the new lev­el of actu­al uti­liza­tion (as opposed to “allowed”uti­liza­tion mea­sured by con­sumed capac­i­ty units, ie prob­lem 2).

Hypothesis 1 : scaling after 1 min breach

Some obser­va­tions:

  1. At 30%-50% uti­liza­tion lev­el, there’s no dif­fer­ence to per­for­mance.
  2. At 60% uti­liza­tion lev­el, the ear­ly throt­tled writes we saw in the con­trol group is now addressed as we decreased the reac­tion time of the sys­tem.
  3. At 70%-80% uti­liza­tion lev­els, there is neg­li­gi­ble dif­fer­ence in per­for­mance. This is to be expect­ed as the poor per­for­mance in the con­trol group is caused by prob­lem 2, so improv­ing reac­tion time alone is unlike­ly to sig­nif­i­cant­ly improve per­for­mances in these cas­es.

Hypothesis 2 : scaling after 1 min breach on actual request count

Scal­ing on actu­al request count and using actu­al request count to cal­cu­late the new pro­vi­sioned capac­i­ty units yields amaz­ing results. There were no throt­tled events at 30%-70% uti­liza­tion lev­els.

Even at 80% uti­liza­tion lev­el both the success rate and total no. of throt­tled events have improved sig­nif­i­cant­ly.

This is an accept­able lev­el of per­for­mance for an autoscal­ing sys­tem, one that I’ll be hap­py to use in a pro­duc­tion envi­ron­ment. Although, I’ll still lean on the side of cau­tion and choose a uti­liza­tion lev­el at or below 70% to give the table enough head­room to deal with sud­den spikes in traf­fic.

Results (Top Heavy)

The test is set­up as fol­low­ing:

  1. table starts off with 50 write capac­i­ty unit
  2. traf­fic holds steady for 15 mins at 25 writes/s
  3. traf­fic then jumps to peak lev­el (300 writes/s) at a steady rate over the next 5 mins
  4. traf­fic then decreas­es at a rate of 3 writes/s per minute

Once again, let’s start by look­ing at the per­for­mance of the con­trol group (vanil­la DynamoDB auto scal­ing) at var­i­ous uti­liza­tion lev­els.

Some obser­va­tions from the results above:

  1. At 30%-60% uti­liza­tion lev­els, most of the throt­tled writes can be attrib­uted to the slow reac­tion time (prob­lem 1). Once the table start­ed to scale up the no. of throt­tled writes quick­ly decreased.
  2. At 70%-80% uti­liza­tion lev­els, the sys­tem also didn’t scale up aggres­sive­ly enough (prob­lem 2). Hence we expe­ri­enced throt­tled writes for much longer, result­ing in a much worse per­for­mance over­all.

Hypothesis 1 : scaling after 1 min breach

Some obser­va­tions:

  1. Across the board the per­for­mance has improved, espe­cial­ly at the 30%-60% uti­liza­tion lev­els.
  2. At 70%-80% uti­liza­tion lev­els we’re still see­ing the effect of prob­lem 2 — not scal­ing up aggres­sive­ly enough. As a result, there’s still a long tail to the throt­tled write ops.

Hypothesis 2 : scaling after 1 min breach on actual request count

Sim­i­lar to what we observed with the Bell Curve traf­fic pat­tern, this imple­men­ta­tion is sig­nif­i­cant­ly bet­ter at cop­ing with sud­den spikes in traf­fic at all uti­liza­tion lev­els test­ed.

Even at 80% uti­liza­tion lev­el (which real­ly doesn’t leave you with a lot of head room) an impres­sive 94% of write oper­a­tions suc­ceed­ed (com­pared with 73% record­ed by the con­trol group). Whilst there is still a sig­nif­i­cant no. of throt­tled events, it com­pares favourably against the 500k+ count record­ed by the vanil­la DynamoDB auto scal­ing.

Conclusions

I like DynamoDB, and I would like to use its auto scal­ing capa­bil­i­ty out of the box but it just doesn’t quite match my expec­ta­tions at the moment. I hope some­one from AWS is read­ing this, and that this post pro­vides suf­fi­cient proof (as you can see from the data below) that it can be vast­ly improved with rel­a­tive­ly small changes.

Feel free to play around with the demo, all the code is avail­able here.