Amazon ELB – Some caveats around health check pings

We recent­ly found out about an inter­est­ing, undoc­u­ment­ed behav­iour of Amazon’s Elas­tic Load Bal­anc­ing (ELB) ser­vice – that health check pings are per­formed by each and every instance run­ning your ELB ser­vice at every health check inter­val.

Intro to ELB

But first, let me fill in some back­ground infor­ma­tion for read­ers who are not famil­iar with the var­i­ous ser­vices that are part of the AWS ecosys­tem.

ELB is a man­aged load bal­anc­ing ser­vice which auto­mat­i­cal­ly scales up and down based on traf­fic. You can real­ly eas­i­ly set­up peri­od­ic health check pings against your EC2 instances to ensure that requests are only rout­ed to avail­abil­i­ty zones and instances that are deemed healthy based on the results of the pings.

In addi­tion, you can use ELB’s health checks in con­junc­tion with Amazon’s Auto Scal­ing ser­vice to ensure that instances that repeat­ed­ly fails healthy checks with a fresh, new instance.

Gen­er­al­ly speak­ing, the ELB health check should ping an end­point on your web ser­vice (and not the default IIS page…) so that a suc­cess­ful ping can at least inform you that your ser­vice is still run­ning. How­ev­er, giv­en the num­ber of exter­nal ser­vices our game servers typ­i­cal­ly depend on (dis­trib­uted cache clus­ter, oth­er AWS ser­vices, Face­book, etc.) we made the deci­sion to make our ping han­dlers do more exten­sive checks to make sure that it can still com­mu­ni­cate with those ser­vices (which might not be the case if the instance is expe­ri­enc­ing networking/hardware fail­ures).

Some ELB caveats

How­ev­er, we noticed that at peak times, our clus­ter of Couch­base nodes are hit with a thou­sand pings at exact­ly the same time from our 100+ game servers at every ELB health check inter­val and the num­ber of hits just didn’t make sense to us! Work­ing with the AWS sup­port team, who were very help­ful and shared the ELB logs with us which revealed that 8 healthy check pings were issued against each game serv­er at each inter­val.

It turns out that, each instance that runs your ELB ser­vice (the num­ber of instances varies depend­ing on the cur­rent load) will ping each of your instances behind the load bal­ancer once per health check inter­val at exact­ly the same time.

A fur­ther caveat being that for an instance to be con­sid­ered unhealthy it needs to fail the required num­ber of con­sec­u­tive pings from an indi­vid­ual ELB instance’s per­spec­tive. Which means, it’s pos­si­ble for your instance to be con­sid­ered unhealthy from the ELB’s point of view with­out it hav­ing failed the required num­ber of con­sec­u­tive pings from the instance’s per­spec­tive.

To help us visu­al­ize this prob­lem, sup­pose there are cur­rent­ly 4 instances run­ning the ELB ser­vice for your envi­ron­ment, labelled ELB inst 1–4 below. Let’s assume that the instance behind the ELB always receive pings from the ELB instances in the same order, from ELB inst 1 to 4. So from our instance’s point of view, it receives pings from ELB inst 1, then ELB inst 2, then ELB inst 3 and so on.

Each of these instances will ping your instances once per health check inter­val. In the last 2 inter­vals, the instance failed to respond in a time­ly fash­ion to the ping by ELB inst 3, but respond­ed suc­cess­ful­ly to all oth­er pings. So from our instance’s point of view it has failed 2 out of 8 pings, but not con­sec­u­tive­ly, how­ev­er, from ELB inst 3’s per­spec­tive the instance has failed 2 con­sec­u­tive pings and should there­fore be con­sid­ered as unhealthy and stop receiv­ing requests until it pass­es the required num­ber of con­sec­u­tive pings.


Since the imple­men­ta­tion details of the ELB is abstract­ed away from us (and right­ly so!) it’s dif­fi­cult for us to test what hap­pens when this hap­pens – whether or not all oth­er ELB instances will straight away stop rout­ing traf­fic to that instance; or if it’ll only impact the rout­ing choic­es made by ELB inst 3.

From an imple­men­ta­tion point of view, I can under­stand why it was imple­ment­ed this way, with sim­plic­i­ty being the like­ly answer and it does cov­er all but the most unlike­ly of events. How­ev­er, from the end-user’s point of view of a ser­vice that’s essen­tial­ly a black-box, the promised (or at the very least the expect­ed) behav­iour of the ser­vice is dif­fer­ent from what hap­pens in real­i­ty, albeit sub­tly, and that some implic­it assump­tions were made about what we will be doing in the ping han­dler.

The behav­iours we expect­ed from the ELB were:

  • ELB will ping our servers at every health check inter­val
  • ELB will mark instance as unhealthy if it fails x num­ber of con­sec­u­tive health checks

What actu­al­ly hap­pens is:

  • ELB will ping our servers numer­ous times at every health check inter­val depend­ing on the num­ber of ELB instances
  • ELB will mark instance as unhealthy if it fails x num­ber of con­sec­u­tive health checks by a par­tic­u­lar ELB instance


If there are expen­sive health checks that you would like to per­form on your ser­vice but you still like to use the ELB health check mech­a­nism to stop traf­fic from being rout­ed to bad instances and have the Auto Scal­ing ser­vice replace them instead. One sim­ple workaround would be to per­form the expen­sive oper­a­tions in a timer event which you can con­trol, and let your ping han­dler sim­ply respond with HTTP 200 or 500 sta­tus code depend­ing on the result of the last inter­nal health check.