CraftConf 15–Takeaways from “Scaling micro-services at Gilt”

There were a cou­ple of micro-ser­vices relat­ed talks at this year’s edi­tion of Craft­Conf. The first was by Adri­an Tre­na­man of Gilt, who talked about their jour­ney from a mono­lith­ic archi­tec­ture to micro-ser­vices, and from self-man­aged dat­a­cen­tres to the cloud.

 

From Monolith to Micro-Services

They start­ed off with a Ruby on Rails mono­lith­ic archi­tec­ture in 2007, but quick­ly grew to a $400M busi­ness with­in 4 years.

image

With that suc­cess came the scal­a­bil­i­ty chal­lenge, one that they couldn’t meet with their exist­ing archi­tec­ture.

It was at this point that they start­ed to split up their mono­lith.

2011

A num­ber of small­er, but still some­what mono­lith­ic ser­vices were cre­at­ed – prod­uct ser­vice, user ser­vice, etc. But most of the ser­vices were still shar­ing the same Post­gres data­base.

The cart ser­vice was the only one that had its own ded­i­cat­ed data­s­tore, and this proved to be a valu­able les­son to them. In order to evolve ser­vices inde­pen­dent­ly you need to severe the hid­den cou­pling that occurs when ser­vices share the same data­base.

image

Adri­an also point­ed out that, whilst some of the back­end ser­vices have been split out, the UI appli­ca­tions (JSP pages) were still mono­liths. Lots busi­ness log­ic such as pro­mos and offers were hard­cod­ed and hard/slow to change.

Anoth­er per­va­sive issue is that ser­vices are all loose­ly typed – everything’s a map. This intro­duced plen­ty of con­fu­sion around what’s in the map. This was because lots of their devel­op­ers still hadn’t made the men­tal tran­si­tion to work­ing in a sta­t­i­cal­ly typed lan­guage.

2015

Fast for­ward to 2015, and they’re now cre­at­ing micro-ser­vices at a much faster rate using pri­mar­i­ly Scala and the Play frame­work.image

They have left the orig­i­nal ser­vices alone as they’re well under­stood and haven’t had to change fre­quent­ly.

The front end appli­ca­tion has been bro­ken up into a set of gran­u­lar, page-spe­cif­ic appli­ca­tions.

There has also been sig­nif­i­cant cul­tur­al changes:

  • emer­gent archi­tec­ture design rather than top-down
  • tech­nol­o­gy deci­sions are dri­ven by KPI where appro­pri­ate; or sim­ple goals where there is no mea­sur­able KPIs
  • fail fast, and be open about it so that lessons can be shared amongst the orga­ni­za­tion; they even hold reg­u­lar meet­ings to allow peo­ple to open­ly dis­cuss their fail­ures

To the Cloud

Along­side the tran­si­tion to micro-ser­vices, Gilt also migrat­ed to AWS using a hybrid approach via Ama­zon VPC.

Every team has its own AWS account as well as a bud­get. How­ev­er, orga­ni­za­tion of teams can change from time to time, but with this set­up it’s dif­fi­cult to move ser­vices around dif­fer­ent AWS accounts.

Incremental Approach

One thing to note about Gilt’s migra­tion to micro-ser­vices is that it took a long time and they have tak­en an incre­men­tal approach.

Adri­an explained this as down to them pri­or­i­tiz­ing get­ting into mar­ket and extract­ing real busi­ness val­ues over tech­ni­cal evo­lu­tion.

Why Micro-Services

  • it lessens inter-depen­den­cies between teams
  • faster code-to-pro­duc­tion
  • allows lots of ini­tia­tives in par­al­lel
  • allows dif­fer­ent language/framework to be used by each team
  • allows grace­ful degra­da­tion of ser­vices
  • allows code to be eas­i­ly dis­pos­able – easy to inno­vate, fail and move on; Greg Young also touched on this in his recent talk on opti­miz­ing for deletabil­i­ty

Challenges with Micro-Services

Adri­an list­ed 7 chal­lenges his team came across in their jour­ney.

Staging

They found it hard to main­tain stag­ing envi­ron­ments across mul­ti­ple teams and ser­vices. Instead, they have come to believe that test­ing in pro­duc­tion is the best way to go, and it’s there­fore nec­es­sary to invest in automa­tion tools to aid with doing canary releas­es.

I once heard Ben Chris­tensen of Net­flix talk about the same thing, that Net­flix too has come to real­ize that the only way to test a com­plex micro-ser­vices archi­tec­ture is to test it in pro­duc­tion.

That said, I’m sure both Net­flix and Gilt still have basic tests to catch the obvi­ous bugs before they release any­thing into the wild. But these tests would not suf­fi­cient­ly test the inter­ac­tion between the ser­vices (some­thing Dan North and Jes­si­ca Kerr cov­ered in their open­ing keynote Com­plex­i­ty is Out­side the Code).

To reduce the risk involved with test­ing in pro­duc­tion, you should at least have:

  • canary release mech­a­nism to lim­it impact of bad release and;
  • min­i­mize time for roll back;
  • min­i­mize time to dis­cov­ery for prob­lems by hav­ing gran­u­lar met­rics (see ex-Net­flix archi­tect Adri­an Cockcroft’s talk at Mon­i­tora­ma 14 on this)

Ownership

Who owns the ser­vice, and what hap­pens if the per­son who cre­at­ed the ser­vice moves onto some­thing else?

Gilt’s deci­sion is to have teams and depart­ments own the ser­vice, rather than indi­vid­ual staff.

Deployment

Gilt is build­ing tools over Dock­er to give them elas­tic and fast pro­vi­sion­ing. This is kin­da inter­est­ing, because there are already a num­ber of such tools/services avail­able, such as:

It’s not clear what are miss­ing from these that is dri­ving Gilt to build their own tools.

Lightweight API

They have set­tled on a REST-style API, although I don’t get the sense that Adri­an was talk­ing about REST in the sense that Roy Field­ing described – i.e. dri­ven by hyper­me­dia.

They also make sure that the clients are total­ly decou­pled from the ser­vice, and are dumb and have zero depen­den­cies.

Adri­an also gave a shout out api­doc. Per­son­al­ly, we have been using Mulesoft’s Any­point Plat­form and it has been a very use­ful tool for us to doc­u­ment our APIs.

Audit & Alerting

In order to give your engi­neers full auton­o­my in pro­duc­tion and still stay com­pli­ant, you need good audit­ing and alert­ing capa­bil­i­ties.

Gilt built a smart alert­ing tool called CAVE, which alerts you when sys­tem invari­ants – e.g. num­ber of orders shipped to the US in 5 min inter­val should be greater than 0 – have been vio­lat­ed.

Mon­i­tor­ing of micro-ser­vices is an inter­est­ing top­ic because once again, the tra­di­tion­al approach to mon­i­tor­ing – alert me when CPU/network or laten­cies go above thresh­old – is no longer suf­fi­cient because of the com­plex effects of causal­i­ty that runs through your inter-depen­dent ser­vices.

Instead, as Richard Rodger point­ed out in his Mea­sur­ing Micro-Ser­vices talk at Code­mo­tion Rome this year, it’s far more effec­tive to iden­ti­fy and mon­i­tor emerg­ing prop­er­ties instead.

If you live in Port­land then it might be worth going to Mon­i­tora­ma in June, it’s a con­fer­ence that focus­es on mon­i­tor­ing. Last year’s con­fer­ence had some pret­ty awe­some talks.

IO Explosions

I didn’t get a clear sense of how it works, but Adri­an men­tioned that Gilt has looked to lamb­da archi­tec­ture for their crit­i­cal path code, and are doing pre-com­pu­ta­tion and real-time push updates to reduce the num­ber of IO calls that occur between ser­vices.

Reporting

Since they have many data­bas­es, it means their data ana­lysts would need access to all the data­bas­es in order to run their reports. To sim­pli­fy things, they put their ana­lyt­ics data into a queue which is then writ­ten to S3.

Whilst Adri­an didn’t divulge on what tech­nol­o­gy Gilt is using for ana­lyt­ics, there are quite a num­ber of tools/services avail­able to you on the mar­ket.

 

Based on our expe­ri­ence at Gamesys Social where we have been using Google Big­Query for a num­ber of years, I strong­ly rec­om­mend that you take it into con­sid­er­a­tion. It’s a man­aged ser­vice that allows you to run SQL-like, ad-hoc queries on Exabyte-size datasets and get results back in sec­onds.

At this point we have around 100TB of ana­lyt­ics data at Gamesys Social and our data ana­lysts are able to run their queries every­day and analyse 10s of TBs of data with­out any help from the devel­op­ers.

Big­Query is just one of many inter­est­ing tech­nolo­gies that we use at Gamesys Social, so if you’re inter­est­ed in work­ing with us, we’re hir­ing for a func­tion­al pro­gram­mer right now.

 

 

Links