QCon London 2015–Takeaways from “Service Architectures at Scale, Lessons from Google and eBay”

Day three of QCon Lon­don was a treat, with full day tracks on archi­tec­ture and microser­vices, it pre­sent­ed some nice chal­lenges of what to see dur­ing the day.

My favourite talk of the day was Randy Shoup’s Ser­vice Archi­tec­tures at Scale, Lessons from Google and eBay.

 

Randy kicked off the ses­sion by iden­ti­fy­ing a com­mon trend in the archi­tec­ture evo­lu­tion at some of the biggest inter­net com­pa­nies.

image

An ecosys­tem of microser­vices also dif­fer from their mono­lith­ic coun­ter­parts in that they tend to organ­i­cal­ly form many lay­ers of depen­den­cies rather than fall into strict tiers in a hier­ar­chy.

 

At Google, there has nev­er been a top-down design approach to build­ing sys­tems, but rather an evo­lu­tion­ary process using nat­ur­al selec­tion – ser­vices sur­vive  by jus­ti­fy­ing their exis­tence through usage or they are dep­re­cat­ed. What appears to be a clean lay­er­ing by design turned out to be an emer­gent prop­er­ty of this approach.

image

Ser­vices are built from bot­tom-up but you can still end up with clean, clear sep­a­ra­tion of con­cerns.

 

At Google, there are no “archi­tect” roles, nor is there a cen­tral approval process for tech­nol­o­gy deci­sions. Most tech­nol­o­gy deci­sions are made with­in the team, so they’re empow­ered to make the deci­sions that are best for them and their ser­vice.

This is in direct con­trast to how eBay oper­at­ed ear­ly on, where there was an archi­tec­ture review board which act­ed as a cen­tral approval body.

 

Even with­out the pres­ence of a cen­tral­ized con­trol body, Google proved that it’s still pos­si­ble to achieved stan­dard­iza­tion across the orga­ni­za­tion.

With­in Google, com­mu­ni­ca­tion meth­ods (e.g.. net­work pro­to­col, data for­mat, struc­tured way of express­ing inter­face, etc.) as well as com­mon infra­struc­ture (source con­trol, mon­i­tor­ing, alert­ing, etc.) are stan­dard­ized by encour­age­ment rather than enforce­ment.

image

By the sound of it, best prac­tices and stan­dard­iza­tion are achieved through a con­sen­sus-based approach in teams and then spread out  through­out the orga­ni­za­tion through:

  • encap­su­la­tion in shared/reusable libraries;
  • sup­port for these stan­dards in under­ly­ing ser­vices;
  • code reviews (word of mouth);
  • and most impor­tant­ly the abil­i­ty to search all of Google’s code to find exist­ing exam­ples

One draw­back with fol­low­ing exist­ing exam­ples is the pos­si­bil­i­ty of ran­dom anchor­ing – some­one at one point made a deci­sion to do things one way and then that becomes the anchor for every­one else who finds that exam­ple there­after.

image

image

Whilst the sur­face areas of ser­vices are stan­dard­ized, the inter­nals of the ser­vices are not, leav­ing devel­op­ers to choose:

  • pro­gram­ming lan­guage (C++, Go, Python or Java)
  • frame­works
  • per­sis­tence mech­a­nisms

image

 

Rather than decid­ing on the split of microser­vices up ahead, capa­bil­i­ties tend to be imple­ment­ed in exist­ing ser­vices first to solve spe­cif­ic prob­lems.

If it prove to be suc­cess­ful then it’s extract­ed out and gen­er­al­ized as a ser­vice of its own with a new team formed around it. Many pop­u­lar ser­vices today all start­ed life this way – Gmail, App Engine and BigTable to name a few.

 

On the oth­er hand, a failed ser­vice (e.g. Google Wave) will be dep­re­cat­ed but reusable tech­nol­o­gy would be repur­posed and the peo­ple in the team would be rede­ployed to oth­er teams.

 

This is a fair­ly self-explana­to­ry slide and an apt descrip­tion of what a microser­vice should look like.

image

 

As the own­er of a ser­vice, your pri­ma­ry focus should be the needs of your clients, and to meet their needs at min­i­mum cost and effort. This includes lever­ag­ing com­mon tools, infra­struc­tures and exist­ing ser­vice as well as automat­ing as much as pos­si­ble.

The ser­vice own­er should have end-to-end own­er­ship, and the mantra should be “You build it, you run it”.

The teams should have auton­o­my to choose the right tech­nol­o­gy and be held respon­si­ble for the results of those choic­es.

 

Your ser­vice should have a bound­ed con­text, its pri­ma­ry focus should be on the client and ser­vices that depend on the ser­vice.

You should not have to wor­ry about the com­plete ecosys­tem or the under­ly­ing infra­struc­ture, and this reduced cog­ni­tive load also means the teams can be extreme­ly small (usu­al­ly 3–5 peo­ple) and nim­ble. Hav­ing a small team also bounds the amount of com­plex­i­ty that can be cre­at­ed (i.e. use Conway’s law to your advan­tage).

 

Treat ser­vice-ser­vice rela­tion­ship as a ven­dor-client rela­tion­ship with clear own­er­ship and divi­sion of respon­si­bil­i­ty.

To give peo­ple the right incen­tives, you should charge for usage of the ser­vice, this way, it aligns eco­nom­ic incen­tives for both sides to opti­mize for effi­cien­cy.

With a ven­dor-client rela­tion­ship (with SLAs and all) you’re incen­tivized to reduce the risk that comes with mak­ing changes, hence push­ing you towards mak­ing small incre­men­tal changes and employ­ing sol­id devel­op­ment prac­tices (code reviews, auto­mat­ed test, etc.).

 

You should nev­er break your clients’ code, hence it’s impor­tant to keep backward/forward com­pat­i­bil­i­ty of inter­faces.

You should pro­vide an explic­it dep­re­ca­tion pol­i­cy and give your clients strong incen­tives to move off old ver­sions.

 

Ser­vices at scale are high­ly exposed to per­for­mance vari­abil­i­ty.

image

Tail laten­cies (e.g. 95%, 99% laten­cy) are much more impor­tant than aver­age laten­cies. It’s eas­i­er for your client to pro­gram to con­sis­tent per­for­mance.

 

Ser­vices at scale are also high­ly exposed to fail­ures.

(dis­rup­tions are 10x more like­ly from human errors than software/hardware fail­ures)

You should have resilience in depth with redun­dan­cy for hard­ware fail­ures, and have capa­bil­i­ty for incre­men­tal deploy­ments:

  • Canary releas­es
  • Staged roll­outs
  • Rapid roll­backs

eBay also use ‘fea­ture flags’ to decou­ple code deploy­ment from fea­ture deploy­ment.

And of course, mon­i­tor­ing..

image

 

Final­ly, here are some anti-pat­terns to look out for:

Mega-Ser­vice – ser­vices that does too much, ala mini-mono­lith

Shared per­sis­tence – breaks encap­su­la­tion, and encour­ages ‘back­door’ vio­la­tion, can lead to hid­den cou­pling of ser­vices (think inte­gra­tion via data­bas­es…)

 

Gamesys Social

As I sat through Randy’s ses­sion, I was sur­prised and proud to find that we have employed many sim­i­lar prac­tices in my team (back­end team at Gamesys Social), a seal of approval if you like:

  • not hav­ing archi­tect roles, instead using a con­sen­sus-based approach to make tech­nol­o­gy deci­sions
  • stan­dard­iza­tion via encour­age­ment
  • allow you to exper­i­ment with approaches/tech and not penal­iz­ing you when things don’t pan out (the learn­ing is also a valu­able out­put from the exper­i­ment)
  • organ­ic growth of microser­vices (prov­ing them in exist­ing ser­vices first before split­ting out and gen­er­al­ize)
  • place high val­ue on automa­tion
  • auton­o­my to the team, and DevOps phi­los­o­phy of “you build it, you run it”
  • deploy­ment prac­tices – canary release, staged roll­outs, use of fea­ture flags and our twist on the blue-green deploy­ment

 

I’m cur­rent­ly look­ing for some func­tion­al pro­gram­mers to join the team, so if this sounds like the sort of envi­ron­ment you would like to work in, then have a look at our job spec and apply!

st119-3992

 

Links

Slides for the talk

We’re hir­ing a Func­tion­al Pro­gram­mer!