DISCLAIMER : as always, you should bench­mark against your pay­load and use case, the bench­mark num­bers I have pro­duced here is unlikely to be rep­re­sen­ta­tive of your use cases and nei­ther is any­body else’s bench­mark numbers.

You can use the sim­ple test har­ness I cre­ated and see these exam­ple code to bench­mark against your par­tic­u­lar payload.

 

Binary

image

image

 

Json

A num­ber of seri­al­iz­ers were updated to the lat­est ver­sion – FastJ­son, Jil, NetJ­son, Json.Net, ServiceStack.Text, and FsPickler.Json was added in the mix.

image

image

Share

On appli­ca­tion monitoring

In the Gamesys social team, our view on appli­ca­tion mon­i­tor­ing is such that any­thing that runs in pro­duc­tion needs to be mon­i­tored exten­sively all the time – every ser­vice entry point, IO oper­a­tions or CPU inten­sive tasks. Sure, it comes at the cost of a few CPU cycles which might mean that you have to run a few more servers at scale, but that’s small cost to pay com­pared to:

  • lack of vis­i­bil­ity of how your appli­ca­tion is per­form­ing in pro­duc­tion; or
  • inabil­ity to spot issues occur­ring on indi­vid­ual nodes amongst large num­ber of servers; or
  • longer time to dis­cov­ery on pro­duc­tion issues, which results in
    • longer time to recov­ery (i.e. longer downtime)
    • loss of cus­tomers (imme­di­ate result of downtime)
    • loss of cus­tomer con­fi­dence (longer term impact)

Ser­vices such as Stack­Driver and Ama­zon Cloud­Watch also allow you to set up alarms around your met­rics so that you can be noti­fied or some auto­mated actions can be trig­gered in response to chang­ing con­di­tions in production.

In Michael T. Nygard’s Release It!: Design and Deploy Production-Ready Soft­ware (a great read by the way) he dis­cussed at length how unfavourable con­di­tions in pro­duc­tion can cause cracks to appear in your sys­tem, and through tight cou­pling and other anti-patterns these cracks can accel­er­ate and spread across your entire appli­ca­tion and even­tu­ally bring it crash­ing down to its knees.

 

In apply­ing exten­sive mon­i­tor­ing to our ser­vices we are able to:

  • see cracks appear­ing in pro­duc­tion early; and
  • col­lect exten­sive amount of data for the post-mortem; and
  • use knowl­edge gained dur­ing post-mortems to iden­tify early warn­ing signs and set up alarms accordingly

 

Intro­duc­ing Metricano

With our empha­sis on mon­i­tor­ing, it should come as no sur­prise that we have long sought to make it easy for our devel­op­ers to mon­i­tor dif­fer­ent aspects of their service.

 

Now, we’ve made it easy for you to do the same with Met­ri­cano, an agent–based frame­work for col­lect­ing, aggre­gat­ing and pub­lish­ing met­rics from your appli­ca­tion. From a high-level, the Met­ric­sAgent class col­lects met­rics and aggre­gates them into second-by-second sum­maries. These sum­maries are then pub­lished to all the pub­lish­ers you have configured.

 

Record­ing Metrics

There are a num­ber of options for you to record met­rics with Met­ric­sAgent:

Man­u­ally

You can call the Incre­ment­Count­Met­ric, or Record­Time­Met­ric meth­ods on an instance of IMet­ric­sAgent (you can use MetricsAgent.Default or cre­ate a new instance with MetricsAgent.Create), for example:

 

F# Work­flows

From F#, you can also use the built-in time­Met­ric and count­Met­ric workflows:

 

Post­Sharp Aspects

Lastly, you can also use the Coun­tEx­e­cu­tion and LogEx­e­cu­tion­Time attrib­utes from the Metricano.PostSharpAspects nuget pack­age, which can be applied at method, class and even assem­bly level.

The Coun­tEx­e­cu­tion attribute records a count met­ric with the fully qual­i­fied name of the method, whereas the LogEx­e­cu­tion­Time attribute records exe­cu­tion times into a time met­ric with the fully qual­i­fied name of the method. When applied at class and assem­bly level, the attrib­utes are multi-casted to all encom­passed meth­ods, pri­vate, pub­lic, instance and sta­tic. It’s pos­si­ble to tar­get spe­cific meth­ods, by name or vis­i­bil­ity, etc. please refer to PostSharp’s doc­u­men­ta­tion for detail.

 

Pub­lish­ing Metrics

All the met­rics you record won’t do you much good if they just stay inside the mem­ory of your appli­ca­tion server.

To get met­rics out of your appli­ca­tion server and into a mon­i­tor­ing ser­vice or dash­board, you need to tell Met­ri­cano to pub­lish met­rics with a set of pub­lish­ers. There is a ready made pub­lisher for Ama­zon Cloud­Watch ser­vice in the Metricano.CloudWatch nuget package.

To add a pub­lisher to the pipeline, use the Publish.With sta­tic method, see exam­ple here.

Since the low­est gran­u­lar­ity on Ama­zon Cloud­Watch is 1 minute, so as an opti­miza­tion to cut down on the num­ber of web requests (which also has a cost impact), Cloud­Watch­Pub­lisher will aggre­gate met­rics locally and only pub­lish the aggre­gates on a per minute basis.

If you want to pub­lish your met­ric data to another ser­vice (Stack­Driver or New Relic for instance), you can cre­ate your own pub­lisher by imple­ment­ing the very sim­ple IMet­ric­sPub­lisher inter­face. This sim­ple Con­solePub­lisher for instance, will cal­cu­late the 95 per­centile exe­cu­tion time and print them:

image

In gen­eral I find the 95/97/99 per­centile time met­rics much more infor­ma­tive than sim­ple aver­ages, since aver­ages are so eas­ily biased by even a small num­ber of out­ly­ing data points.

 

Sum­mary

I hope you have enjoyed this post and that you’ll find Met­ri­cano a use­ful addi­tion in your application.

I highly rec­om­mend read­ing Release It!, much of the pat­terns and anti-patterns dis­cussed in the book are becom­ing more and more rel­e­vant in today’s world where we’re build­ing smaller, more gran­u­lar microser­vices. Even the sim­plest of appli­ca­tions have mul­ti­ple inte­gra­tion points – social net­works, cloud ser­vices, etc. – and they are places where cracks are likely to occur before they spread out to the rest of your appli­ca­tion, unless, you have taken the mea­sure to guard against such cas­cad­ing failures.

If you decide to buy the book from ama­zon, please use the link I pro­vide below or add the query string para­me­ter tag=theburningmon-21 to the URL so that I can get a small refer­ral fee and use it towards buy­ing more books and find­ing more inter­est­ing things to write about here Smile

 

Links

Met­ri­cano project page

Release It!: Design and Deploy Production-Ready Software

Met­ri­cano nuget package

Metricano.CloudWatch nuget package

Metricano.PostSharpAspects nuget package

Red-White Push – Con­tin­u­ous Deliv­ery at Gamesys Social

Share

The mon­ster trap­ping mechan­ics in Here Be Mon­sters is fairly straight forward:

  • Mon­sters have a type and a set of stats – Strength, Speed and IQ
  • They have a rar­ity value which deter­mines the like­li­hood of an encounter
  • They have a set of baits they like, which can increase the like­li­hood of an encounter
  • Traps can catch mon­sters of match­ing types
  • Traps also have a set of stats – Strength, Speed and Tech
  • Chance of catch­ing a mon­ster is deter­mined by the trap’s stats vs the monster’s stats

image image

It’s as sim­ple as it sounds. Unless, of course, you’re the game designer respon­si­ble for set­ting the stats for the trap so that:

a. you achieve the intended catch rate % against each of the mon­sters, and

b. the dis­tri­b­u­tion of the stats should ‘make sense’, i.e. a low-tech box trap should have higher stats in strength than Tech

 

The naive approach would be to start with a guessti­mate and then use trial-and-error until you con­verge upon an answer or an approx­i­ma­tion to the answer that is con­sid­ered good enough. The naive approach would be labo­ri­ous and error prone, and unlikely to yield the opti­mal result (bar­ring the Her­culean effort of a per­sis­tent game designer..).

 

To auto­mate this process and aid our game design­ers, we designed and imple­mented a sim­ple genetic algo­rithm in F# that would search and find the opti­mal solu­tion based on:

  • intended % catch rate for each monster
  • an error margin
  • an ini­tial set of stats that defines the ideal dis­tri­b­u­tion of stats

The game design­ers can use our cus­tom web tool to run the algo­rithm, for example:

image

 

In sim­ple terms, a genetic algo­rithm starts with a set of poten­tial solu­tions and iter­a­tively gen­er­ates new gen­er­a­tions of solu­tions using a selec­tion and a muta­tion process such that:

  • the selec­tion process chooses which of the solu­tions sur­vive (sur­vival of the fittest and all) based on a fit­ness function
  • the muta­tion process gen­er­ates a new solu­tions using the sur­viv­ing solutions

the iter­a­tion con­tin­ues until one of the ter­mi­na­tions con­di­tions have been met, for exam­ple, if a solu­tion is found or we’ve reached the max­i­mum num­ber of gen­er­a­tions allowed.

image

 

In our algo­rithm, each solu­tion is a set of stats for the trap, and the selec­tion process cal­cu­lates the catch rate for each of the mon­sters using the solu­tion, and keeps the solu­tion if it’s bet­ter than the solu­tion it’s mutated from.

The muta­tion process then takes each of he sur­viv­ing solu­tions and gen­er­ates new solu­tions by tweak­ing the stats in a num­ber of ways:

  • +/- a small amount from each of Strength/Speed/Tech (gen­er­ates bet­ter solu­tions when we’re close to opti­mal solutions)
  • +/- a large amount from each of Strength/Speed/Tech (gen­er­ates notice­ably dif­fer­ent solu­tions we’re far from opti­mal solutions)

So from an ini­tial solu­tion of Strength:100, Speed:100 and Tech:200, you can end up with a num­ber of dif­fer­ent solu­tions for the next generation:

image

This process con­tin­ues until either:

  • the max num­ber of gen­er­a­tions has been exhausted, or
  • none of the new solu­tions sur­vive the selec­tion process

the final sur­vivors are then fil­tered using the error mar­gin spec­i­fied by the game designer, and sorted by how close it’s to the spec­i­fied tar­get catch rates, e.g.

image

 

We have also applied the same tech­nique and imple­mented genetic algo­rithms to:

  • find stats for a mon­ster that will give it the intended catch rate for a num­ber of traps (the inverse of the above)
  • find con­fig­u­ra­tion for baits so that we can achieve the desired encounter rate with a mon­ster when using this bait (see below for an example)
  • image

 

 

So here you ago, I hope you enjoyed read­ing about another place where a bit of F# magic has come in handy.

The code for the genetic algo­rithms is not very com­pli­cated (or very long) but incred­i­bly spe­cific to our spe­cific domain, hence the lack of any code snip­pet in this post. But hope­fully I’ve man­aged to give you at least a flavour of what genetic algo­rithms are and how you might be able to apply them (with F# if pos­si­ble!) in your own solution.

Share

Nowa­days you see plenty of sto­ries about Con­tin­u­ous Inte­gra­tion, Con­tin­u­ous Deliv­ery and Con­tin­u­ous Deploy­ment on the web, and it’s great to see that the indus­try is mov­ing in this direc­tion, with more and more focus on automa­tion rather than hir­ing humans to do a job that machines are so much bet­ter at.

But, most of these sto­ries are also not very inter­est­ing because they tend to revolve around MVC-based web sites that con­trols both the server and the client (since the client is just the server-generated HTML) and there’s really no syn­chro­niza­tion or back­ward com­pat­i­bil­ity issues between the server and the client. It’s a great place to be to not have those prob­lems, but they are real con­cerns for us for rea­sons we’ll go into shortly.

 

The Net­flix Way

One notable excep­tion is the con­tin­u­ous deploy­ment story from Net­flix, which Carl Quinn also talked about as part of an overview of the Net­flix archi­tec­ture in this pre­sen­ta­tion.

For me, there are a num­ber of things that make the Net­flix con­tin­u­ous deploy­ment story inter­est­ing and worth studying:

  • Scale – more than 1000 dif­fer­ent client devices and over a quar­ter of the inter­net traf­fic
  • Ami­na­tor – whilst most of us try to avoid cre­at­ing new AMIs when we need to deploy new ver­sions of our code, Net­flix has decided to go the other way and instead auto­mate away the painful, man­ual steps involved with cre­at­ing new AMIs and in return get bet­ter start-up time as their VMs comes pre-baked

image

  • Use of Canary Deploy­ment – dip­ping your toe in the water by rout­ing a small frac­tion of your traf­fic to a canary clus­ter to test it out in the wild (it’s worth men­tion­ing that this facil­ity is also pro­vided out-of-the-box by Google AppEngine)
  • Red/Black push – a clever word play (and ref­er­ence to the Net­flix colour I pre­sume?) on the clas­sic blue-green deploy­ment, but also mak­ing use of AWS’s auto-scaling ser­vice as well as Netflix’s very own Zuul and Asgard ser­vices for rout­ing and deployment.

image

I’ve not heard any updates yet, but I’m very inter­ested to see how the Net­flix deploy­ment pipeline has changed over the last 12 months, espe­cially now that Docker has become widely accepted in the DevOps com­mu­nity. I won­der if it’s a viable alter­na­tive to bak­ing AMIs and instead Ami­na­tor can be adopted (and renamed since it’s no longer bak­ing AMIs) to bake Docker images instead which can then be fetched and deployed from a pri­vate repository.

If you have see any recent talks/posts that pro­vides more up-to-date infor­ma­tion, please feel free to share in the comments.

 

Need for Back­ward Compatibility

One inter­est­ing omis­sion from all the Net­flix arti­cles and talks I have found so far has been how they man­age back­ward com­pat­i­bil­ity issues between their server and client. One would assume that it must be an issue that comes up reg­u­larly when­ever you intro­duce a big new fea­ture or break­ing changes to your API and you are not able to do a syn­chro­nous, con­trolled update to all your clients.

To illus­trate a sim­ple sce­nario that we run into reg­u­larly, let’s sup­pose that in a client-server setup:

  • we have an iPhone/iPad client for our ser­vice which is cur­rently ver­sion 1.0
  • we want to release a new ver­sion 1.1 with brand spank­ing new features
  • ver­sion 1.1 requires break­ing changes to the ser­vice API

AppStore-Update

In the sce­nario out­lined above, the server changes must be deployed before review­ers from Apple open up the sub­mit­ted build or else they will find an unusable/unstable appli­ca­tion that they’ll no doubt fail and put you back to square one.

Addi­tion­ally, after the new ver­sion has been approved and you have marked it as avail­able in the App­Store, it takes up to a fur­ther 4 hours before the change is prop­a­gated through the App­Store globally.

This means your new server code has to be back­ward com­pat­i­ble with the exist­ing (ver­sion 1.0) client.

 

In our case, we cur­rently oper­ate a num­ber of social games on Face­book and mobile (both iOS and Android devices) and each game has a com­plete and inde­pen­dent ecosys­tem of back­end ser­vices that sup­port all its client platforms.

Back­ward com­pat­i­bil­ity is an impor­tant issue for us because of sce­nar­ios such as the one above, which is fur­ther com­pli­cated by the involve­ment of other app stores and plat­forms such as Google Play and Ama­zon App Store.

We also found through expe­ri­ence that every time we force our play­ers to update the game on their mobile devices we alien­ate and anger a fair chunk of our player base who will leave the game for good and occa­sion­ally leave harsh reviews along the way. Which is why even though we have the capa­bil­ity to force play­ers to update, it’s a capa­bil­ity that we use only as a last resort. The impli­ca­tion being that in prac­tice you can have many ver­sions of clients all access­ing the same back­end ser­vice which has to main­tain back­ward com­pat­i­bil­ity all the way through.

 

Deploy­ment at Gamesys Social

Cur­rently, most of our games fol­low this basic deploy­ment flow:

Current-Blue-Green

Blue-Green-Deploy

The steps involved in releas­ing to pro­duc­tion fol­low the basic prin­ci­ples of Blue-Green Deploy­ment and although it helps elim­i­nate down­time (since we are push­ing out changes in the back­ground whilst keep­ing the ser­vice run­ning so there is no vis­i­ble dis­rup­tion from the client’s point-of-view) it does noth­ing to elim­i­nate or reduce the need for main­tain­ing back­ward compatibility.

Instead, we dili­gently man­age back­ward com­pat­i­bil­ity via a com­bi­na­tion of care­ful plan­ning, com­mu­ni­ca­tion, domain exper­tise and test­ing. Whilst it has served us well enough so far it’s hardly fool-proof, not to men­tion the amount of coor­di­nated efforts required and the extra com­plex­ity it intro­duces to our codebase.

 

Hav­ing con­sid­ered going down the API ver­sion­ing route and the main­tain­abil­ity impli­ca­tions we decided to look for a dif­fer­ent way, which is how we ended up with a vari­ant of Netflix’s Red-Black deploy­ment approach we inter­nally refer to as..

 

Red-White Push

Our Red-White Push approach takes advan­tage of our exist­ing dis­cov­ery mech­a­nism whereby the client authen­ti­cates itself against a client-specific end­point along with the client build version.

Based on the client type and ver­sion the dis­cov­ery ser­vice routes the client to the cor­re­spond­ing clus­ter of game servers.

red-white-push

With this new flow, the ear­lier exam­ple might look some­thing like this instead:

AppStore-Update-RWP

The key dif­fer­ences are:

  • instead of deploy­ing over exist­ing ser­vice whilst main­tain­ing back­ward com­pat­i­bil­ity, we deploy to a new clus­ter of nodes which will only be accessed by v1.1 clients, hence no need to sup­port back­ward compatibility
  • exist­ing v1.0 clients will con­tinue to oper­ate and will access the clus­ter of nodes run­ning old (but com­pat­i­ble) server code
  • scale down the white clus­ter grad­u­ally as play­ers update to v1.1 client
  • until such time that we decide to no longer sup­port v1.0 clients then we can safely ter­mi­nate the white cluster

 

Despite what the name sug­gests, you are not actu­ally lim­ited to only red and white clus­ters. Fur­ther­more, you can still use the afore­men­tioned Blue-Green Deploy­ment for releases that doesn’t intro­duce break­ing changes (and there­fore require syn­chro­nized updates to both client and server).

 

We’re still a long way from where we want to be and there are still lots of things in our release process that need to be improved and auto­mated, but we have come a long way from even 12 months ago.

As one of my ex-colleagues said:

“Releases are not excit­ing anymore”

- Will Knox-Walker

and that is the point – mak­ing releases non-events through automation.

 

Ref­er­ences

Net­flix – Deploy­ing the Net­flix API

Net­flix – Prepar­ing the Net­flix API for Deployment

Net­flix – Announc­ing Zuul : Edge Ser­vice in the Cloud

Net­flix – How we use Zuul at Netflix

Net­flix OSS Cloud Archi­tec­ture (Par­leys presentation)

Con­tin­u­ous Deliv­ery at Net­flix – From Code to the Monkeys

Con­tin­u­ous Deliv­ery vs Con­tin­u­ous Deployment

Mar­tin Fowler – Blue-Green Deployment

Thought­Works – Imple­ment­ing Blue-Green Deploy­ments with AWS

Mar­tin Fowler – Microservices

Share

DISCLAIMER : as always, you should bench­mark against your pay­load and use case, the bench­mark num­bers I have pro­duced here is unlikely to be rep­re­sen­ta­tive of your use cases and nei­ther is any­body else’s bench­mark numbers.

You can use the sim­ple test har­ness I cre­ated and see these exam­ple code to bench­mark against your par­tic­u­lar payload.

 

Json.Net, ServiceStack.Text, Mon­goDB Drive and Jil were all updated to the lat­est version.

Rpg­Maker’s NetJ­son seri­al­izer has also been added to the mix and the results are really impres­sive with a level of per­for­mance that’s almost iden­ti­cal to protobuf-net!

image

image

 

Ver­sions tested:

Jil 1.7.0
ServiceStack.Text 4.0.24
Json.Net 6.0.4
fastJ­son 2.1.1.0
Mon­goDB Drive 1.9.2
System.Json 4.0.20126.16343
System.Text.Json 1.9.9.1
JsonFx 2.0.1209.2802
Jay­Rock 0.9.16530
Share