A look at Microsoft Orleans through Erlang-tinted glasses

Some time ago, Microsoft announced Orleans, an imple­men­ta­tion of the actor mod­el in .Net which is designed for the cloud envi­ron­ment where instances are ephemer­al.

We’re cur­rent­ly work­ing on a num­ber of projects in Erlang and have run into some assump­tions in dis­trib­uted Erlang which doesn’t hold true in a cloud-host­ed envi­ron­ment where nodes are ephemer­al and entire topolo­gies are con­stant­ly in flux. Also, as most of our back­end code for Gamesys Social is in .Net, being able to work with lan­guages that we’re already famil­iar with is a win for the team (more peo­ple being able to work on the code­base, for instance).

As such I have been tak­ing an inter­est in Orleans to see if it rep­re­sents a good fit, and whether or not it holds up to some of its lofty promis­es around scal­a­bil­i­ty, per­for­mance and reli­a­bil­i­ty. Below is an account of my per­son­al views hav­ing read the paper, down­loaded the SDK and looked through the sam­ples and fol­lowed through Richard Astbury’s Plu­ral­sight course.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Update 2014/12/08:

Since I post­ed this the oth­er day, there has been some great feed­back from the Orleans team and clar­i­fied sev­er­al places where I have pre­vi­ous­ly mis­un­der­stood based on the infor­ma­tion I had at the time of writ­ing. Some of my oth­er con­cerns still remain, but at least two of the biggest stick­ing points – sin­gle point of fail­ure and at-least-once mes­sage deliv­ery – has been dis­proved.

As such, I’ll updat­ed this post in sev­er­al place to incor­po­rate the new infor­ma­tion that the Orleans team have pro­vid­ed via the com­ments.

I’ve left what was pre­vi­ous­ly writ­ten untouched, but look out for the impact­ed sec­tions (* fol­lowed by a para­graph that is under­lined) through­out the post to see the rel­e­vant new infor­ma­tion. In this call­out sec­tions I have focused on the cor­rect behav­iour that you should expect­ed based on cor­rec­tions from Sergey and Gabriel, if you’re inter­est­ed in the back­ground and ratio­nale behind these deci­sions, please read their com­ments in full.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

TL;DR

When I first read about Orleans, I was hes­i­tant because of the use of code-gen (rem­i­nis­cent of WCF there), and that the under­ly­ing mes­sage pass­ing mech­a­nism is hid­den from you so you end up with a RPC mech­a­nism (again, rem­i­nis­cent of WCF…).

How­ev­er, after spend­ing some time with Orleans, I can def­i­nite­ly see its appeal – con­ve­nience and pro­duc­tiv­i­ty. I was able to get some­thing up and run­ning quick­ly and with ease. My orig­i­nal con­cerns about code-gen and RPC didn’t get in the way of me get­ting stuff done.

As I dig deep­er into how Orleans works though, a num­ber of more wor­ry­ing con­cerns sur­faced regard­ing some of its core design deci­sions.

For starters, *1 it’s not par­ti­tion tol­er­ant towards par­ti­tions to the data store used for its Silo man­age­ment. Should the data store be par­ti­tioned or suf­fer an out­age, it’ll result in a full out­age of your ser­vice. These are not traits of a mas­ter­less and par­ti­tion tol­er­ant sys­tem that is desir­able when you have strict uptime require­ments.

When every­thing is work­ing, Orleans guar­an­tees that there is only one instance of a vir­tu­al actor run­ning in the sys­tem, but when a node is lost the cluster’s knowl­edge of nodes will diverge and dur­ing this time the sin­gle-acti­va­tion guar­an­tees becomes even­tu­al­ly con­sis­tent. How­ev­er, you can pro­vide stronger guar­an­tees your­self (see Silo Man­age­ment sec­tion below).

*2 Orleans uses at-least-once mes­sage deliv­ery, which means it’s pos­si­ble for the same mes­sage to be sent twice when the receiv­ing node is under load or sim­ply fails to acknowl­edge the first mes­sage in a time­ly fash­ion. This again, is some­thing that you can mit­i­gate your­self (see Mes­sage Deliv­ery Guar­an­tees sec­tion below).

Final­ly, its task sched­ul­ing mech­a­nism appears to be iden­ti­cal to that of a naive event loop and exhibits all the fal­lac­i­es of an event loop (see Task Sched­ul­ing sec­tion below).

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*1 As Gabriel and Sergey both explained in the com­ments, the mem­ber­ship man­age­ment works quite a bit dif­fer­ent to what I first thought. Once con­nect­ed, all heart­beats are sent between pairs of nodes using a lin­ear algo­rithm, and the back­end data store is only used for reach­ing agree­ments on what nodes are dead and to dis­sem­i­nate the new mem­ber­ship view to all oth­er nodes.

In this case, los­ing con­nec­tion to the back­end data store would not impact exist­ing, con­nect­ed clus­ters. mak­ing it par­ti­tion tol­er­ant. If the back­end data store becomes unavail­able and at the same time as your clus­ter topol­o­gy is chang­ing then it will hin­der updates to the mem­ber­ship and stop new nodes from being able to join the clus­ter.

Hope­ful­ly the imple­men­ta­tion details of the mem­ber­ship man­age­ment will be dis­cussed in more detail in Orleans team’s future posts. Also, since Orleans will be open sourced in ear­ly 2015, we’ll be able to get a clos­er look at exact­ly how this behaves when the source code is avail­able.

*2 Gabriel point­ed out that by default Orleans does not resend mes­sages that have timed out. So by default, it uses at-most-once deliv­ery, but can be con­fig­ured to auto­mat­i­cal­ly resend upon time­out if you want at-least-once deliv­ery instead.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Terminology

In Orleans, a Grain rep­re­sents an actor, and each node has a Silo which man­ages the life­time of the grains run­ning inside the Silo. A Grain is acti­vat­ed when it receives a request, and can be deac­ti­vat­ed after it becomes idle for a while. The Silo will also remove the deac­ti­vate Grains from mem­o­ry to free up resources.

Orleans’ Grains are referred to as vir­tu­al actors. They are vir­tu­al because the state of a Grain can be per­sist­ed to a stor­age ser­vice, and then rein­stat­ed as the Grain is reac­ti­vat­ed (after being deac­ti­vat­ed due to idle­ness) on anoth­er node. This is a nice abstrac­tion from a developer’s point of view, and to enable this lev­el of indi­rec­tion, Orleans intro­duces the mech­a­nism of state providers.

Storage Providers

To use stor­age providers, you first need to define an inter­face that rep­re­sents the state of your Grain, and have it inher­it from Orleans’ IGrain­State inter­face. For exam­ple:

image

Then in your Grain imple­men­ta­tion class, you pro­vide this ITour­na­ment­Grain­State inter­face as gener­ic type para­me­ter to the Orleans.Grains base class, as below. You also need to spec­i­fy the stor­age provider you want to use via the [Stor­age­Provider] attribute. The Provider­Name spec­i­fied here points to a stor­age provider you define in the con­fig­u­ra­tion file for Orleans.

image

When a Grain is acti­vat­ed in a Silo, the Orleans run­time will go and fetch the state of the Grain for us and put it in an instance mem­ber called State. For instance, when the Acti­vateA­sync method is called, the state of our Tour­na­ment­Grain would have been pop­u­lat­ed from the DynamoDB­Stor­age provider we cre­at­ed:

image

You can mod­i­fy the state by mod­i­fy­ing its mem­bers, but the changes will not be per­sist­ed to the back­end stor­age ser­vice until you call the State.WriteStateAsync method. Un-per­sist­ed changes will be lost when the Grain is deac­ti­vat­ed by the Silo, or if the node itself is lost.

image

Final­ly, there are a num­ber of built-in stor­age providers, such as the Azure Table Stor­age, but it’s triv­ial to imple­ment your own. To imple­ment a cus­tom stor­age provider, you just need to imple­ment the IStor­age­Provider inter­face.

Stor­age providers make it very easy to cre­ate actors that can be eas­i­ly resumed after deac­ti­va­tion, but you need to be mind­ful of a num­ber of things:

  • how often you per­sist state is a trade-off between dura­bil­i­ty and per­for­mance + cost
  • if mul­ti­ple parts of the state need to be mod­i­fied in one call, you need to have a roll­back strat­e­gy in place in case of excep­tions or risk leav­ing dirty writes in your state (see Not let­ting it crash sec­tion below)
  • you need to han­dle the case when per­sis­tence fails – since you’ve mutat­ed the in-mem­o­ry state, if per­sis­tence failed do you roll­back or con­tin­ue and hope that you get anoth­er chance at per­sist­ing the state before the actor is deac­ti­vat­ed through idle­ness or the node crash­ing

In Erlang, there are no built-in mech­a­nism for stor­age providers, but there is also noth­ing stop­ping you from imple­ment­ing this your­self. Have a look at Bryan Hunter’s CQRS with Erlang talk at NDC Oslo 2014 for inspi­ra­tion.

Silo Membership

Silos use a back­end store to man­age Silo mem­ber­ships, this uses Azure Table Stor­age by default. From the MSR paper, this is what it has to say about Silo mem­ber­ships:

Servers auto­mat­i­cal­ly detect fail­ures via peri­od­ic heart­beats and reach an agree­ment on the mem­ber­ship view. For a short peri­od of time after a fail­ure, mem­ber­ship views on dif­fer­ent servers may diverge, but it is guar­an­teed that even­tu­al­ly all servers will learn about the failed serv­er and have iden­ti­cal mem­ber­ship views….if a serv­er was declared dead by the mem­ber­ship ser­vice, it will shut itself down even if the fail­ure was just a tem­po­rary net­work issue.

Fur­ther­more, on the guar­an­tee that an actor (or Grain with a spe­cif­ic ID) is only acti­vat­ed on one node:

In fail­ure-free times, Orleans guar­an­tees that an actor only has a sin­gle acti­va­tion. How­ev­er, when fail­ures occur, this is only guar­an­teed even­tu­al­ly.

Mem­ber­ship is in flux after a serv­er has failed but before its fail­ure has been com­mu­ni­cat­ed to all sur­vivors. Dur­ing this peri­od, a reg­is­ter-acti­va­tion request may be mis­rout­ed if the sender has a stale mem­ber­ship view….However, it may be that two acti­va­tions of the same actor are reg­is­tered in two dif­fer­ent direc­to­ry par­ti­tions, result­ing in two acti­va­tions of a sin­gle-acti­va­tion actor. In this case, once the mem­ber­ship has set­tled, one of the acti­va­tions is dropped from the direc­to­ry and a mes­sage is sent to its serv­er to deac­ti­vate it.”

There are cou­ple of things to note about Silo mem­ber­ship man­age­ment from the above:

  • *3 the way servers det­o­nate when they lose con­nec­tiv­i­ty to the stor­age ser­vice means it’s not par­ti­tion-tol­er­ant because if the stor­age ser­vice is par­ti­tioned from the clus­ter, even for a rel­a­tive­ly short amount of time, then there’s a chance for every node that are run­ning Silos to self det­o­nate;
  • *3 there is a sin­gle point of fail­ure at the stor­age ser­vice used to track Silo mem­ber­ships, any out­age to the stor­age ser­vice results in out­age to your Orleans ser­vice too (this hap­pened to Halo 4);
  • it offers strong con­sis­ten­cy dur­ing the good times, but fails back to even­tu­al con­sis­ten­cy dur­ing fail­ures;
  • whilst it’s not men­tioned, but I spec­u­late that depend­ing on the size of clus­ter and the time to con­verge on Silo mem­ber­ship views across the clus­ter, it’s pos­si­ble to have more than two acti­va­tions of the same Grain in the clus­ter;
  • the con­flict res­o­lu­tion approach above sug­gests that one acti­va­tion is cho­sen at ran­dom and the rest are dis­card­ed, this seems rather naive and means los­ing inter­me­di­ate changes record­ed on the dis­card Grain acti­va­tions, and
  • since each acti­va­tion can be per­sist­ing its state inde­pen­dent­ly so it’s pos­si­ble for the sur­viv­ing Grain activation’s inter­nal state to be out-of-sync with what had been per­sist­ed;
  • these fail­ure times can hap­pen a lot more often than you think, nodes can be lost due to fail­ure, but also as a result of planned/automatic scal­ing down events through­out the day as traf­fic pat­terns change (Orleans is designed for the cloud and all its elas­tic scal­a­bil­i­ty after all).

Dur­ing fail­ures, it should be pos­si­ble to pro­vide stronger guar­an­tee on sin­gle acti­va­tion using opti­mistic con­cur­ren­cy around the Grain’s state. For instance,

1. node A failed, now the cluster’s view of Silos have diverged

2a. Grain receives a request on node B, and is acti­vat­ed with state v1

2b. Grain receives a request on node C, and is acti­vat­ed with state v1

3a. Grain on node B fin­ish­es pro­cess­ing request, and suc­ceeds in sav­ing state v2

3b. Grain on node C fin­ish­es pro­cess­ing request, but fails to save state v2 (opti­mistic con­cur­ren­cy at work here)

4. Grain on node C fails the request and trig­gers deac­ti­va­tion

5. Clus­ter only has one acti­va­tion of the Grain on node B

Enforc­ing stronger sin­gle acti­va­tion guar­an­tee in this fash­ion should also remove the need for bet­ter con­flict res­o­lu­tion. For this approach to work, you will need to be able to detect per­sis­tence fail­ures due to opti­mistic con­cur­ren­cy. In DynamoDB, this can be iden­ti­fied by a con­di­tion­al check error.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*3 Again, as per Gabriel and Sergey’s com­ments below, this is not true and there’s no sin­gle point of fail­ure in this case. See *1 above, or read the com­ments for more detail.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Dis­trib­uted Erlang employs a dif­fer­ent approach for form­ing clus­ters. Nodes form a mesh net­work where every node is con­nect­ed to every oth­er node. They use a gos­sip pro­to­col to inform each oth­er when nodes join or leave the clus­ter.

image

This approach has a scal­a­bil­i­ty lim­i­ta­tion and doesn’t scale well to thou­sands, or even hun­dreds of nodes depend­ing on the capa­bil­i­ties of each node. This is due to the over­head in form­ing and main­tain­ing the clus­ter increas­es qua­drat­i­cal­ly to the num­ber of nodes. The effect is par­tic­u­lar­ly evi­dent if you require fre­quent inter-node com­mu­ni­ca­tions, or need to use func­tions in the glob­al built-in mod­ule.

In this par­tic­u­lar space, SD (scal­able dis­trib­uted) Erlang is attempt­ing to address this short­com­ing by allow­ing you to cre­ate groups of sub-clus­ters amongst your nodes so that the size of the mesh net­work is lim­it­ed to the size of the sub-clus­ters.

Random Actor Placement

Anoth­er inter­est­ing choice Orleans made is that, instead of using con­sis­tent hash­ing for actor place­ment (a tech­nique that has been suc­cess­ful­ly used in a num­ber of Key-Val­ue Stores such as Couch­Base and Riak), Orleans intro­duces anoth­er lay­er of indi­rec­tion here by intro­duc­ing the Orleans direc­to­ry.

The Orleans direc­to­ry is imple­ment­ed as a one-hop dis­trib­uted hash table (DHT). Each serv­er in the clus­ter holds a par­ti­tion of the direc­to­ry, and actors are assigned to the par­ti­tions using con­sis­tent hash­ing. Each record in the direc­to­ry maps an actor id to the loca­tion( s) of its acti­va­tions.”

image

The ratio­nale for this deci­sion is that it allows *4 ran­dom place­ment of actors will help avoid cre­ation of hotspots in your clus­ter which might result from poor­ly cho­sen IDs, or bad luck. But it means that to retain cor­rect­ness, every request to actors now require an addi­tion­al hop to the direc­to­ry par­ti­tion first. To address this per­for­mance con­cern, each node will use a local cache to store where each actor is.

I think this is a well-mean­ing attempt to a prob­lem, but I’m not con­vinced that it’s a prob­lem that deserves the

  1. addi­tion­al lay­er of indi­rec­tion and
  2. the sub­se­quent prob­lem of per­for­mance, and
  3. the sub­se­quent use of local cache and
  4. *5 the prob­lem of cache inval­i­da­tion that comes with it (which as we know, is one of the two hard prob­lems in CS)

Is it real­ly worth it? Espe­cial­ly when the IDs are guids by default, which hash well. Would it not be bet­ter to solve it with a bet­ter hash­ing algo­rithm?

From my per­son­al expe­ri­ence of work­ing with a num­ber of dif­fer­ent Key-Val­ue stores, actor place­ment has nev­er been an issue that is sig­nif­i­cant enough to deserve the spe­cial treat­ment Orleans have giv­en it. I’d real­ly like to see results of any empir­i­cal study that shows this to be a big enough issue in real-world key-val­ue store usages.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*4 As Sergey men­tioned in the com­ments, you can do a few more things such as using a Prefer­LocalPlace­ment strat­e­gy to “instruct the run­time to place an acti­va­tion of a grain of that type local to the first caller (anoth­er grain) to it.That is how the app can hint about opti­miz­ing place­ment.”. This appears the same as when you spawn a new process in Erlang. It would require fur­ther clar­i­fi­ca­tion from Sergey or Gabriel but I’d imag­ine the place­ment strat­e­gy prob­a­bly applies at the type lev­el for each type of grain.

The addi­tion­al lay­er of abstrac­tion does buy you some more flex­i­bil­i­ty, and not hav­ing to move grains around when topol­o­gy changes prob­a­bly sim­pli­fies things (I imag­ine mov­ing the direc­to­ry infor­ma­tion around is cheap­er and eas­i­er than the grains them­selves) from the imple­men­ta­tion point-of-view too.

In the Erlang space, Riak­Core pro­vides a set of tools to help you build dis­trib­uted sys­tems and its approach gives you more con­trol over the behav­iour of your sys­tem. You do how­ev­er have to imple­ment a few more things your­self, such as how to move data around when clus­ter topol­o­gy changes (though the vnode behav­iour gives you the basic tem­plate for doing this) and how to deal with col­li­sions etc.

*5 Hit­ting stale cache is not much of a prob­lem in this case, where Orleans would do a new lookup, for­ward the mes­sage to the right des­ti­na­tion and update the cache.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As a side, here’s how Riak does con­sis­tent hash­ing and read/write repli­ca­tions:

Riak NRW

Message Delivery Guarantees

Orleans pro­vides at-least-once mes­sage deliv­ery, by resend­ing mes­sages that were not acknowl­edged after a con­fig­urable time­out. Exact­ly-once seman­tics could be added by per­sist­ing the iden­ti­fiers of deliv­ered mes­sages, but we felt that the cost would be pro­hib­i­tive and most appli­ca­tions do not need it. This can still be imple­ment­ed at the appli­ca­tion lev­el”

The deci­sion to use at-least-once mes­sage deliv­ery as default is a con­tentious one in my view. *6 A slow node will cause mes­sages to be sent twice, and han­dled twice, which is prob­a­bly not what you want most of the time.

Whilst the ratio­nale regard­ing cost is valid, it seems to me that allow­ing the mes­sage deliv­ery to time out and let­ting the caller han­dle time­out cas­es is the more sen­si­ble choice here. It’d make the han­dling of time­outs an explic­it deci­sion on the appli­ca­tion developer’s part, prob­a­bly at a per call basis since some calls are more impor­tant than oth­ers.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*6 The default behav­iour is to not resent on time­out, so by default Orleans actu­al­ly uses at-most-once deliv­ery. But you can con­fig­ure it to auto­mat­i­cal­ly resend upon time­out, i.e. at-least-once.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Not letting it crash

Erlang’s mantra has always been to Let It Crash (except when you shouldn’t). When a process crash­es, it can be restart­ed by its super­vi­sor. Using tech­niques such as event sourc­ing it’s easy to return to the pre­vi­ous state just before the crash.

When you except in an Orleans’ Grain, the excep­tion does not crash the Grain itself and is sim­ply report­ed back to the caller instead. This sim­pli­fies things but runs the risk of leav­ing behind dirty writes (hence cor­rupt­ing the state) in the wake of an excep­tion. For exam­ple:

image

The choice of not crash­ing the grain in the event of excep­tions offers con­ve­nience at the cost of break­ing the atom­ic­i­ty of oper­a­tions and per­son­al­ly it’s not a choice that I agree with.

Reentrant Grains

In a dis­cus­sion on the Actor mod­el with Erik Mei­jer and Clemens Szyper­s­ki, Carl Hewitt (father of the Actor mod­el) said

Con­cep­tu­al­ly, mes­sages are processed one at a time, but the imple­men­ta­tion can allow for con­cur­rent pro­cess­ing of mes­sages”

In Orleans, grains process mes­sages one at a time nor­mal­ly. To allow con­cur­rent pro­cess­ing of mes­sages, you can mark your grain imple­men­ta­tion with the [Reen­trant] attribute.

Reen­trant grains can be used as an opti­miza­tion tech­nique to remove bot­tle­necks in your net­work of grains.

One actor is no actor, they come in sys­tems, and they have to have address­es so that one actor can send mes­sages to anoth­er actor.”

– Carl Hewitt

How­ev­er, using reen­trant grains means you lose the guar­an­tee that the state is accessed sequen­tial­ly and opens your­self up to poten­tial race-con­di­tions. You should use reen­trant grains with great care and con­sid­er­a­tion.

In Erlang, con­cur­rent pro­cess­ing of mes­sages is not allowed. But, you don’t have to block your actor whilst it waits for some expen­sive com­pu­ta­tion to com­plete. Instead, you can spawn anoth­er actor and ask the child actor to car­ry on with the expen­sive work whilst the par­ent actor process­es the next mes­sage. This is pos­si­ble because the over­head and cost of spawn­ing a process in Erlang is very low and the run­time can eas­i­ly han­dle ten of thou­sands of con­cur­rent process­es and load bal­ance across the avail­able CPU resources via its sched­ulers.

If nec­es­sary, once the child actor has fin­ished its work it can send a mes­sage back to the par­ent actor, whom can then per­form any sub­se­quent com­pu­ta­tion as required.

Using this sim­ple tech­nique, you acquire the same capa­bil­i­ty that reen­trant grains offers. You can also con­trol which mes­sages can be processed con­cur­rent­ly, rather than the all-or-noth­ing approach that reen­trant grains uses.

Immutable Messages

*7 Mes­sages sent between Grains are usu­al­ly seri­al­ized and then dese­ri­al­ized, this is an expen­sive over­head when both Grains are run­ning on the same machine. You can option­al­ly mark the mes­sages are immutable so that they won’t be seri­al­ized when passed between Grains on the same machine, but this immutabil­i­ty promise is not enforced at all and it’s entire­ly down to you to apply the due dili­gence of not mutat­ing the mes­sages.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*7 A clar­i­fi­ca­tion here, mes­sages are only seri­al­ized and dese­ri­al­ized when they are sent across nodes, mes­sages sent between grains on the same node is deep-copied instead, which is cheap­er than seri­al­iza­tion. Mark­ing the type as immutable skips the deep-copy­ing process too.

But, it’s still your respon­si­bil­i­ty to enforce the immutabil­i­ty guar­an­tee.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In Erlang, vari­ables are immutable so there is no need to do any­thing explic­it.

Task Scheduling

Orleans sched­ules appli­ca­tion turns using coop­er­a­tive mul­ti­task­ing. That means that once start­ed, an appli­ca­tion turn runs to com­ple­tion, with­out inter­rup­tion. The Orleans sched­uler uses a small num­ber of com­pute
threads that it con­trols, usu­al­ly equal to the num­ber of CPU cores, to exe­cute all appli­ca­tion actor code.”

This is anoth­er point of con­cern for me.

Here we’re exhibit­ing the same vul­ner­a­bil­i­ties of an event-loop sys­tem (e.g. Node.js, Tor­na­do) where a sin­gle poi­soned mes­sage can crip­ple your entire sys­tem. Even with­out poi­soned mes­sages, you are still left with the prob­lem of not dis­trib­ut­ing CPU resources even­ly across actors, and allow­ing slow/misbehaving actors to bad­ly impact your laten­cy for oth­er pend­ing requests.

Even hav­ing mul­ti­ple cores and hav­ing one thread per core (which is a sane choice) is not going to save you here. All you need is one slow-run­ning actor on each proces­sor-affined exe­cu­tion thread to halt the entire sys­tem.

The Erlang’s approach to sched­ul­ing makes much more sense – one sched­uler per core, and an actor is allowed to exe­cute 2000 reduc­tions (think of one reduc­tion as one func­tion call to do some­thing) before it has to yield the CPU so that anoth­er actor can get a slice of the CPU time. The orig­i­nal actor will then wait for its turn to run again.

This CPU-shar­ing pol­i­cy is no dif­fer­ent to what the OS does with threads, and there’s a good rea­son for that.

Ease of use

I think this is the big win­ner for Orleans and the focus of its design goals.

I have to admit, I was pleas­ant­ly sur­prised how eas­i­ly and quick­ly I was able to put a ser­vice togeth­er and have it run­ning local­ly. Based on what I have seen of the sam­ples and Richard’s Plu­ral­sight course, deploy­ing to the cloud is pret­ty straight for­ward too.

Cloud Friendliness

Anoth­er win for Orleans here, as it’s designed from the start to deal with clus­ter topolo­gies that can change dynam­i­cal­ly with ephemer­al instances. Where­as dis­trib­uted Erlang, at least dis­trib­uted OTP, assumes a fixed topol­o­gy where nodes have well defined roles at start. There are also chal­lenges around get­ting the built-in dis­trib­uted data­base – Mne­sia – to work well in a dynam­i­cal­ly chang­ing topol­o­gy.

Conclusion

In many ways, I think Orleans is true to its orig­i­nal design goals of opti­miz­ing for devel­op­er pro­duc­tiv­i­ty. But by shield­ing devel­op­ers from deci­sions and con­sid­er­a­tions that usu­al­ly comes with build­ing dis­trib­uted sys­tems, it has also deprived them of the oppor­tu­ni­ty to build sys­tems that need to be resilient to fail­ures and meet strin­gent uptime require­ments.

But, not every dis­trib­uted sys­tem is crit­i­cal, and not every dis­trib­uted sys­tem needs five to nine nines uptime. As long as you’re informed about the trade-offs that Orleans have made and what they mean to you as a devel­op­er, you can at least make informed choic­es of if and when to adopt Orleans.

I hope this post will help you make those informed deci­sions, if I have been mis­in­formed and incor­rect about any parts of Orleans work­ing, please do leave a com­ment below.

Links