Design for Latency issues

The most com­mon issue I have encoun­tered in pro­duc­tion are latency/performance relat­ed. They can be symp­toms of a whole host of under­ly­ing caus­es rang­ing from AWS net­work issues (which can also man­i­fest itself in laten­cy/er­ror-rate spikes in any of the AWS ser­vices), over-loaded servers to sim­ple GC paus­es.

Laten­cy issues are inevitable – as much as you can improve the per­for­mance of your appli­ca­tion, things will go wrong, even­tu­al­ly, and often they’re out of your con­trol.

So you must design for them, and degrade the qual­i­ty of your appli­ca­tion grace­ful­ly to min­i­mize the impact on your users’ expe­ri­ences.

As back­end devel­op­ers, one of the fal­lac­i­es that we often fall into is to allow our dev envi­ron­ments to be too lenient. Servers and data­bas­es are nev­er under load dur­ing devel­op­ment, so we lure client devel­op­ers into a false sense of com­fort and set them up to fail when the appli­ca­tion runs into a slow-respond­ing serv­er in pro­duc­tion for the first time.

 

Latency Injection

To pro­gram my fel­low client devel­op­ers to always be mind­ful of laten­cy spikes, we decid­ed to inject ran­dom laten­cy delays on every request:

  1. check if we should inject a ran­dom delay;
  2. if yes, then work out how much laten­cy to inject and sleep the thread;
  3. and final­ly invoke the orig­i­nal method to process the request

This is an imple­men­ta­tion pat­tern that can be auto­mat­ed. I wrote a sim­ple Post­Sharp attribute to do this, whilst pig­gy­back­ing exist­ing con­fig­u­ra­tion mech­a­nisms to con­trol its behav­iour at run­time.

Then I mul­ti­cast the attribute to all our ser­vice end­points and my work was done!

We run laten­cy injec­tion in our dev envi­ron­ment and it helped iden­ti­fy numer­ous bugs in the client appli­ca­tion as a result and proved to be a worth­while exer­cise.

But we didn’t stop there.

 

Error Injection

We throt­tle user requests to all of our ser­vices to stop mis­chie­vous play­ers from spam­ming our servers using proxy tools such as Charles and Fid­dler, or hand­craft­ed bots.

But, occa­sion­al­ly, legit­i­mate requests can also be throt­tled as result of client bugs, over-zeal­ous retry strat­e­gy or incor­rect­ly con­fig­ured throt­tling thresh­old.

Once again, we decid­ed to make these errors much more vis­i­ble in the dev envi­ron­ment so that client devel­op­ers expect them and han­dle them grace­ful­ly.

To do that, we:

  1. set the thresh­old very low in dev
  2. used a Post­Sharp attribute to ran­dom­ly inject throt­tling error on oper­a­tions where it makes sense

The attribute that injects throt­tling error is very sim­ple, and looks some­thing along the lines of:

The same approach can be tak­en to include any ser­vice spe­cif­ic errors that the client should be able to grace­ful­ly recov­er from – ses­sion expi­ra­tion, state out-of-sync, etc.

 

Design for Failure

Sim­u­lat­ing laten­cy issues and oth­er errors fall under the prac­tice of Design for Fail­ure, which Simon Ward­ley iden­ti­fies as one of the char­ac­ter­is­tics of a next-gen­er­a­tion tech com­pa­ny.

image

p.s. you should check out Simon’s work on val­ue chain map­ping if you haven’t already, they’re inspir­ing.

 

Chaos Engines

Netflix’s use of Chaos Mon­key and Chaos Goril­la is a shin­ing exam­ple of Design for Fail­ure at scale.

Chaos Mon­key ran­dom­ly ter­mi­nates instances to sim­u­late hard­ware fail­ures and test your system’s abil­i­ty to with­stand such fail­ures.

Chaos Goril­la takes this exer­cise to the next lev­el and sim­u­late out­ages to entire Ama­zon avail­abil­i­ty zones to test their system’s abil­i­ty to auto­mat­i­cal­ly re-bal­ance to oth­er avail­abil­i­ty zones with­out user-vis­i­ble impact or man­u­al inter­ven­tion.

Net­flix has tak­en a lot of inspi­ra­tion from Release It! by Michael Nygard and Drift into Fail­ure by Sid­ney Dekker. Both books are awe­some and I high­ly rec­om­mend them.

image image

 

Global redundancy, or not

Based on reac­tions to AWS out­ages on social media, it’s clear to see that many (our­selves includ­ed) do not take full advan­tage of the cloud for glob­al redun­dan­cy.

You might scoff at that but for many the deci­sion to not have a glob­al­ly redun­dant infra­struc­ture is a con­scious one because the cost of such redun­dan­cy is not always jus­ti­fi­able.

It’s pos­si­ble to raise your sin­gle-point-of-fail­ure (SPOF) from indi­vid­ual resources/instances, to AZs, to regions, all the way to cloud providers.

But you’re incur­ring addi­tion­al costs at each turn:

  • your infra­struc­ture becomes more com­plex and dif­fi­cult to rea­son with;
  • you might need more engi­neers to man­age that com­plex­i­ty;
  • you will need to invest in bet­ter tool­ing for mon­i­tor­ing and automa­tion;
  • you might need more engi­neers to build those tools;
  • you incur more wastage in CPU/memory/bandwidth/etc. (it is called redun­dan­cy for a rea­son);
  • you have high­er net­work laten­cy for cross-AZ/re­gion com­mu­ni­ca­tions;

 

Global redundancy at Uber

On the oth­er hand, for many orga­ni­za­tions the cost of down­time out­weighs the cost of glob­al redun­dan­cy.

For instance, for Uber’s cus­tomers the cost of switch­ing to a com­peti­tor is low, which means avail­abil­i­ty is of para­mount impor­tance for Uber.

Uber devised a rather sim­ple, ele­gant mech­a­nism for their client appli­ca­tions to failover seam­less­ly in the event of a dat­a­cen­tre out­age. See this post for more details.

 

Latency Propagation

Final­ly, as more and more com­pa­nies adopt a microser­vices approach a whole host of chal­lenges will become evi­dent (many of which have been dis­cussed in Michael Nygard’s Release it!).

One of these chal­lenges is the prop­a­ga­tion of laten­cy through inter-ser­vice com­mu­ni­ca­tions.

If each of your ser­vices have a 99 per­centile laten­cy of 1s then only 1% of calls will take longer than 1s when you depend on only 1 ser­vice. But if you depend on 100 ser­vices then 63% of calls will take more than 1s!

In this regard, Google fel­low Jeff Dean’s paper Achiev­ing Rapid Response Times in Large Online Ser­vices presents an ele­gant solu­tion to this prob­lem.

image

I haven’t put this into prac­tice myself, but I imag­ine this can be eas­i­ly imple­ment­ed using Rx’s amb oper­a­tor.

 

Links