Finding coldstarts : how long does AWS Lambda keep your idle functions around?

In the last post I com­pared the cold­start time for Lamb­da func­tions with dif­fer­ent lan­guage, mem­o­ry and code size. One of the things I learnt was that idle func­tions are no longer ter­mi­nat­ed after 5 min­utes of inac­tiv­i­ty.


It is a fan­tas­tic news and some­thing that Ama­zon has qui­et­ly changed behind the scene. How­ev­er, it lead me to ask some fol­low up ques­tions:

  1. what’s the new idle time that would trig­ger a cold­start?
  2. does it dif­fer by mem­o­ry allo­ca­tion?
  3. are func­tions still recy­cled 4 hours from the cre­ation of host VM?

To answer the first 2 ques­tions, I devised an exper­i­ment.

First, here are my hypothe­ses going into the exper­i­ment.

WARNING: this exper­i­ment is intend­ed to help us glimpse into imple­men­ta­tion details of the AWS Lamb­da plat­form, they are fun and sat­is­fy my curios­i­ty but you shouldn’t build your appli­ca­tion with the results in mind as AWS can change these imple­men­ta­tion details with­out notice!


Hypoth­e­sis 1 : there is an upper bound to how long Lamb­da allows your func­tion to stay idle before reclaim­ing the asso­ci­at­ed resources

This should be a giv­en. Idle func­tions occu­py resources that can be used to help oth­er AWS cus­tomers scale up to meet their needs (and not to men­tion the first cus­tomer is not pay­ing for his idle func­tions!), it sim­ply wouldn’t make any sense for AWS to keep idle func­tions around for­ev­er.

Hypoth­e­sis 2 : the idle time­out is not a con­stant

From an implementor’s point-of-view, it might be sim­pler to keep this time­out a constant?—?ie. func­tions are always ter­mi­nat­ed after X mins of inac­tiv­i­ty. How­ev­er, I’m sure AWS will vary this time­out to opti­mise for high­er util­i­sa­tion and keep the util­i­sa­tion lev­els more even­ly dis­trib­uted across its fleet of phys­i­cal servers.

For exam­ple, if there’s an ele­vat­ed lev­el of resource con­tention in a region, why not ter­mi­nate idle func­tions ear­li­er to free up space?

Hypoth­e­sis 3 : the upper bound for inac­tiv­i­ty varies by mem­o­ry allo­ca­tion

An idle func­tion with 1536 MB of mem­o­ry allo­ca­tion is wast­ing a lot more resource than an idle func­tion with 128 MB of mem­o­ry, so it makes sense for AWS to ter­mi­nate idle func­tions with high­er mem­o­ry allo­ca­tion ear­li­er.

Experiment : find the upper bound for inactivity

To find the upper bound for inac­tiv­i­ty, we need a Lamb­da func­tion to act as the system-under-test and report when it has expe­ri­enced a cold­start. We also need a mech­a­nism to pro­gres­sive­ly increase the inter­val between invo­ca­tions until we arrive at an inter­val where each invo­ca­tion is guar­an­teed to be a coldstart?—?the upper bound. We will deter­mine the upper bound when we see 10 con­sec­u­tive cold­starts when invoked X min­utes apart.

To answer hypoth­e­sis 3 we will also repli­cate the system-under-test func­tion with dif­fer­ent mem­o­ry allo­ca­tions.

This exper­i­ment is a time con­sum­ing process, it requires dis­ci­pline and a degree of pre­ci­sion in tim­ing. Suf­fice to say I won’t be doing this by hand!

My first approach was to use a Cloud­Watch Sched­ule to trig­ger the system-under-test func­tion, and let the func­tion dynam­i­cal­ly adjust the sched­ule based on whether it’s expe­ri­enced a cold­start. It failed miserably?—?whenever the system-under-test updates the sched­ule the sched­ule will fire short­ly after rather than wait for the new­ly spec­i­fied inter­val…

Instead, I turned to Step Func­tions for help.

AWS Step Func­tions allows you to cre­ate a state machine where you can invoke Lamb­da func­tions, wait for a spec­i­fied amount of time, exe­cute par­al­lel tasks, retry, catch errors, etc.

A Wait state allows you to dri­ve the no. of sec­onds to wait using data (see SecondsPath param in the doc­u­men­ta­tion). Which means I can start the state machine with an input like this:

    “target”: “when-will-i-coldstart-dev-system-under-test-128”, 
    “interval”: 600, 
    “coldstarts”: 0 

The input is passed to anoth­er find-idle-timeout func­tion as invo­ca­tion event. The func­tion will invoke the target (which is one of the vari­ants of the system-under-test func­tion with dif­fer­ent mem­o­ry allo­ca­tions) and increase the inter­val if the system-under-test func­tion doesn’t report a cold­start. The find-idle-timeout func­tion will return a new piece of data for the Step Func­tion exe­cu­tion:

    “target”: “when-will-i-coldstart-dev-system-under-test-128”, 
    “interval”: 660, 
    “coldstarts”: 0 

Now, the Wait state will use the interval val­ue and wait 660 sec­onds before switch­ing back to the FindIdleTimeout state where it’ll invoke our system-under-test func­tion again (with the pre­vi­ous out­put as input).

"Wait": {
    "Type": "Wait",
    "SecondsPath": "$.interval",
    "Next": "FindIdleTimeout"

With this set­up I’m able to kick off mul­ti­ple executions?—?one for each mem­o­ry set­ting.

Along the way I have plen­ty of vis­i­bil­i­ty into what’s hap­pen­ing, all from the com­fort of the Step Func­tions man­age­ment con­sole.

Here are the results of the exper­i­ment:

From the data, it’s clear that AWS Lamb­da shuts down idle func­tions around the hour mark. It’s also inter­est­ing to note that the func­tion with 1536 MB mem­o­ry is ter­mi­nate over 10 mins ear­li­er, this sup­ports hypoth­e­sis 3.

I also col­lect­ed data on all the idle inter­vals where we saw a cold­start and cat­e­gorised them into 5 minute brack­ets.

Even though the data is seri­ous lack­ing, but from what lit­tle data I man­aged to col­lect you can still spot some high lev­el trends:

  • over 60% of cold­starts (pri­or to hit­ting the upper bound) hap­pened after 45 mins of inac­tiv­i­ty
  • the func­tion with 1536 MB mem­o­ry sees sig­nif­i­cant­ly few­er no. of cold starts pri­or to hit­ting the upper bound (worth not­ing that it also has a low­er upper bound (48 mins) than oth­er func­tions in this test

The data sup­ports hypoth­e­sis 2 though there’s no way for us to fig­ure out the rea­son behind these cold­starts or if there’s sig­nif­i­cance to the 45 mins bar­ri­er.


To sum­ma­ry the find­ings from our lit­tle exper­i­ment in one line:

AWS Lamb­da will gen­er­al­ly ter­mi­nate func­tions after 45–60 mins of inac­tiv­i­ty, although idle func­tions can some­times be ter­mi­nat­ed a lot ear­li­er to free up resources need­ed by oth­er cus­tomers.

I hope you find this exper­i­ment inter­est­ing, but please do not build appli­ca­tions on the assump­tions that:

    a) these results are valid, and

    b) they will remain valid for the fore­see­able future

I can­not stress enough that this exper­i­ment is meant for fun and to sat­is­fy a curi­ous mind, and noth­ing more!

The results from this exper­i­ment also deserve fur­ther inves­ti­ga­tion. For instance, the 1536 MB func­tion exhib­it­ed very dif­fer­ent behav­iour to oth­er func­tions, but is it a spe­cial case or would func­tions with more than 1024 MB of mem­o­ry all share these traits? I’d love to find out, maybe I’ll write a fol­low up to this exper­i­ment in the future.

Watch this space ;-)

Like what you’re read­ing? Check out my video course Pro­duc­tion-Ready Server­less and learn the essen­tials of how to run a server­less appli­ca­tion in pro­duc­tion.

We will cov­er top­ics includ­ing:

  • authen­ti­ca­tion & autho­riza­tion with API Gate­way & Cog­ni­to
  • test­ing & run­ning func­tions local­ly
  • CI/CD
  • log aggre­ga­tion
  • mon­i­tor­ing best prac­tices
  • dis­trib­uted trac­ing with X-Ray
  • track­ing cor­re­la­tion IDs
  • per­for­mance & cost opti­miza­tion
  • error han­dling
  • con­fig man­age­ment
  • canary deploy­ment
  • VPC
  • secu­ri­ty
  • lead­ing prac­tices for Lamb­da, Kine­sis, and API Gate­way

You can also get 40% off the face price with the code ytcui. Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP).