Red-White Push – Continuous Delivery at Gamesys Social

Nowa­days you see plen­ty of sto­ries about Con­tin­u­ous Inte­gra­tion, Con­tin­u­ous Deliv­ery and Con­tin­u­ous Deploy­ment on the web, and it’s great to see that the indus­try is mov­ing in this direc­tion, with more and more focus on automa­tion rather than hir­ing humans to do a job that machines are so much bet­ter at.

But, most of these sto­ries are also not very inter­est­ing because they tend to revolve around MVC-based web sites that con­trols both the serv­er and the client (since the client is just the serv­er-gen­er­at­ed HTML) and there’s real­ly no syn­chro­niza­tion or back­ward com­pat­i­bil­i­ty issues between the serv­er and the client. It’s a great place to be to not have those prob­lems, but they are real con­cerns for us for rea­sons we’ll go into short­ly.

 

The Netflix Way

One notable excep­tion is the con­tin­u­ous deploy­ment sto­ry from Net­flix, which Carl Quinn also talked about as part of an overview of the Net­flix archi­tec­ture in this pre­sen­ta­tion.

For me, there are a num­ber of things that make the Net­flix con­tin­u­ous deploy­ment sto­ry inter­est­ing and worth study­ing:

  • Scale – more than 1000 dif­fer­ent client devices and over a quar­ter of the inter­net traf­fic
  • Ami­na­tor – whilst most of us try to avoid cre­at­ing new AMIs when we need to deploy new ver­sions of our code, Net­flix has decid­ed to go the oth­er way and instead auto­mate away the painful, man­u­al steps involved with cre­at­ing new AMIs and in return get bet­ter start-up time as their VMs comes pre-baked

image

  • Use of Canary Deploy­ment – dip­ping your toe in the water by rout­ing a small frac­tion of your traf­fic to a canary clus­ter to test it out in the wild (it’s worth men­tion­ing that this facil­i­ty is also pro­vid­ed out-of-the-box by Google AppEngine)
  • Red/Black push – a clever word play (and ref­er­ence to the Net­flix colour I pre­sume?) on the clas­sic blue-green deploy­ment, but also mak­ing use of AWS’s auto-scal­ing ser­vice as well as Netflix’s very own Zuul and Asgard ser­vices for rout­ing and deploy­ment.

image

I’ve not heard any updates yet, but I’m very inter­est­ed to see how the Net­flix deploy­ment pipeline has changed over the last 12 months, espe­cial­ly now that Dock­er has become wide­ly accept­ed in the DevOps com­mu­ni­ty. I won­der if it’s a viable alter­na­tive to bak­ing AMIs and instead Ami­na­tor can be adopt­ed (and renamed since it’s no longer bak­ing AMIs) to bake Dock­er images instead which can then be fetched and deployed from a pri­vate repos­i­to­ry.

If you have see any recent talks/posts that pro­vides more up-to-date infor­ma­tion, please feel free to share in the com­ments.

 

Need for Backward Compatibility

One inter­est­ing omis­sion from all the Net­flix arti­cles and talks I have found so far has been how they man­age back­ward com­pat­i­bil­i­ty issues between their serv­er and client. One would assume that it must be an issue that comes up reg­u­lar­ly when­ev­er you intro­duce a big new fea­ture or break­ing changes to your API and you are not able to do a syn­chro­nous, con­trolled update to all your clients.

To illus­trate a sim­ple sce­nario that we run into reg­u­lar­ly, let’s sup­pose that in a client-serv­er set­up:

  • we have an iPhone/iPad client for our ser­vice which is cur­rent­ly ver­sion 1.0
  • we want to release a new ver­sion 1.1 with brand spank­ing new fea­tures
  • ver­sion 1.1 requires break­ing changes to the ser­vice API

AppStore-Update

In the sce­nario out­lined above, the serv­er changes must be deployed before review­ers from Apple open up the sub­mit­ted build or else they will find an unusable/unstable appli­ca­tion that they’ll no doubt fail and put you back to square one.

Addi­tion­al­ly, after the new ver­sion has been approved and you have marked it as avail­able in the App­Store, it takes up to a fur­ther 4 hours before the change is prop­a­gat­ed through the App­Store glob­al­ly.

This means your new serv­er code has to be back­ward com­pat­i­ble with the exist­ing (ver­sion 1.0) client.

 

In our case, we cur­rent­ly oper­ate a num­ber of social games on Face­book and mobile (both iOS and Android devices) and each game has a com­plete and inde­pen­dent ecosys­tem of back­end ser­vices that sup­port all its client plat­forms.

Back­ward com­pat­i­bil­i­ty is an impor­tant issue for us because of sce­nar­ios such as the one above, which is fur­ther com­pli­cat­ed by the involve­ment of oth­er app stores and plat­forms such as Google Play and Ama­zon App Store.

We also found through expe­ri­ence that every time we force our play­ers to update the game on their mobile devices we alien­ate and anger a fair chunk of our play­er base who will leave the game for good and occa­sion­al­ly leave harsh reviews along the way. Which is why even though we have the capa­bil­i­ty to force play­ers to update, it’s a capa­bil­i­ty that we use only as a last resort. The impli­ca­tion being that in prac­tice you can have many ver­sions of clients all access­ing the same back­end ser­vice which has to main­tain back­ward com­pat­i­bil­i­ty all the way through.

 

Deployment at Gamesys Social

Cur­rent­ly, most of our games fol­low this basic deploy­ment flow:

Current-Blue-Green

Blue-Green-Deploy

The steps involved in releas­ing to pro­duc­tion fol­low the basic prin­ci­ples of Blue-Green Deploy­ment and although it helps elim­i­nate down­time (since we are push­ing out changes in the back­ground whilst keep­ing the ser­vice run­ning so there is no vis­i­ble dis­rup­tion from the client’s point-of-view) it does noth­ing to elim­i­nate or reduce the need for main­tain­ing back­ward com­pat­i­bil­i­ty.

Instead, we dili­gent­ly man­age back­ward com­pat­i­bil­i­ty via a com­bi­na­tion of care­ful plan­ning, com­mu­ni­ca­tion, domain exper­tise and test­ing. Whilst it has served us well enough so far it’s hard­ly fool-proof, not to men­tion the amount of coor­di­nat­ed efforts required and the extra com­plex­i­ty it intro­duces to our code­base.

 

Hav­ing con­sid­ered going down the API ver­sion­ing route and the main­tain­abil­i­ty impli­ca­tions we decid­ed to look for a dif­fer­ent way, which is how we end­ed up with a vari­ant of Netflix’s Red-Black deploy­ment approach we inter­nal­ly refer to as..

 

Red-White Push

Our Red-White Push approach takes advan­tage of our exist­ing dis­cov­ery mech­a­nism where­by the client authen­ti­cates itself against a client-spe­cif­ic end­point along with the client build ver­sion.

Based on the client type and ver­sion the dis­cov­ery ser­vice routes the client to the cor­re­spond­ing clus­ter of game servers.

red-white-push

With this new flow, the ear­li­er exam­ple might look some­thing like this instead:

AppStore-Update-RWP

The key dif­fer­ences are:

  • instead of deploy­ing over exist­ing ser­vice whilst main­tain­ing back­ward com­pat­i­bil­i­ty, we deploy to a new clus­ter of nodes which will only be accessed by v1.1 clients, hence no need to sup­port back­ward com­pat­i­bil­i­ty
  • exist­ing v1.0 clients will con­tin­ue to oper­ate and will access the clus­ter of nodes run­ning old (but com­pat­i­ble) serv­er code
  • scale down the white clus­ter grad­u­al­ly as play­ers update to v1.1 client
  • until such time that we decide to no longer sup­port v1.0 clients then we can safe­ly ter­mi­nate the white clus­ter

 

Despite what the name sug­gests, you are not actu­al­ly lim­it­ed to only red and white clus­ters. Fur­ther­more, you can still use the afore­men­tioned Blue-Green Deploy­ment for releas­es that doesn’t intro­duce break­ing changes (and there­fore require syn­chro­nized updates to both client and serv­er).

 

We’re still a long way from where we want to be and there are still lots of things in our release process that need to be improved and auto­mat­ed, but we have come a long way from even 12 months ago.

As one of my ex-col­leagues said:

Releas­es are not excit­ing any­more”

- Will Knox-Walk­er

and that is the point – mak­ing releas­es non-events through automa­tion.

 

References

Net­flix – Deploy­ing the Net­flix API

Net­flix – Prepar­ing the Net­flix API for Deploy­ment

Net­flix – Announc­ing Zuul : Edge Ser­vice in the Cloud

Net­flix – How we use Zuul at Net­flix

Net­flix OSS Cloud Archi­tec­ture (Par­leys pre­sen­ta­tion)

Con­tin­u­ous Deliv­ery at Net­flix – From Code to the Mon­keys

Con­tin­u­ous Deliv­ery vs Con­tin­u­ous Deploy­ment

Mar­tin Fowler – Blue-Green Deploy­ment

Thought­Works – Imple­ment­ing Blue-Green Deploy­ments with AWS

Mar­tin Fowler – Microser­vices