The record­ing of my Neo4j talk at CodeMesh 2014 is now available.

 

Share

Some time ago, Microsoft announced Orleans, an imple­men­ta­tion of the actor model in .Net which is designed for the cloud envi­ron­ment where instances are ephemeral.

We’re cur­rently work­ing on a num­ber of projects in Erlang and have run into some assump­tions in dis­trib­uted Erlang which doesn’t hold true in a cloud-hosted envi­ron­ment where nodes are ephemeral and entire topolo­gies are con­stantly in flux. Also, as most of our back­end code for Gamesys Social is in .Net, being able to work with lan­guages that we’re already famil­iar with is a win for the team (more peo­ple being able to work on the code­base, for instance).

As such I have been tak­ing an inter­est in Orleans to see if it rep­re­sents a good fit, and whether or not it holds up to some of its lofty promises around scal­a­bil­ity, per­for­mance and reli­a­bil­ity. Below is an account of my per­sonal views hav­ing read the paper, down­loaded the SDK and looked through the sam­ples and fol­lowed through Richard Astbury’s Plu­ral­sight course.

 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Update 2014/12/08:

Since I posted this the other day, there has been some great feed­back from the Orleans team and clar­i­fied sev­eral places where I have pre­vi­ously mis­un­der­stood based on the infor­ma­tion I had at the time of writ­ing. Some of my other con­cerns still remain, but at least two of the biggest stick­ing points – sin­gle point of fail­ure and at-least-once mes­sage deliv­ery – has been disproved.

As such, I’ll updated this post in sev­eral place to incor­po­rate the new infor­ma­tion that the Orleans team have pro­vided via the comments.

I’ve left what was pre­vi­ously writ­ten untouched, but look out for the impacted sec­tions (* fol­lowed by a para­graph that is under­lined) through­out the post to see the rel­e­vant new infor­ma­tion. In this call­out sec­tions I have focused on the cor­rect behav­iour that you should expected based on cor­rec­tions from Sergey and Gabriel, if you’re inter­ested in the back­ground and ratio­nale behind these deci­sions, please read their com­ments in full.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 

 

TL;DR

When I first read about Orleans, I was hes­i­tant because of the use of code-gen (rem­i­nis­cent of WCF there), and that the under­ly­ing mes­sage pass­ing mech­a­nism is hid­den from you so you end up with a RPC mech­a­nism (again, rem­i­nis­cent of WCF…).

How­ever, after spend­ing some time with Orleans, I can def­i­nitely see its appeal – con­ve­nience and pro­duc­tiv­ity. I was able to get some­thing up and run­ning quickly and with ease. My orig­i­nal con­cerns about code-gen and RPC didn’t get in the way of me get­ting stuff done.

As I dig deeper into how Orleans works though, a num­ber of more wor­ry­ing con­cerns sur­faced regard­ing some of its core design decisions.

For starters, *1 it’s not par­ti­tion tol­er­ant towards par­ti­tions to the data store used for its Silo man­age­ment. Should the data store be par­ti­tioned or suf­fer an out­age, it’ll result in a full out­age of your ser­vice. These are not traits of a mas­ter­less and par­ti­tion tol­er­ant sys­tem that is desir­able when you have strict uptime requirements.

When every­thing is work­ing, Orleans guar­an­tees that there is only one instance of a vir­tual actor run­ning in the sys­tem, but when a node is lost the cluster’s knowl­edge of nodes will diverge and dur­ing this time the single-activation guar­an­tees becomes even­tu­ally con­sis­tent. How­ever, you can pro­vide stronger guar­an­tees your­self (see Silo Man­age­ment sec­tion below).

*2 Orleans uses at-least-once mes­sage deliv­ery, which means it’s pos­si­ble for the same mes­sage to be sent twice when the receiv­ing node is under load or sim­ply fails to acknowl­edge the first mes­sage in a timely fash­ion. This again, is some­thing that you can mit­i­gate your­self (see Mes­sage Deliv­ery Guar­an­tees sec­tion below).

Finally, its task sched­ul­ing mech­a­nism appears to be iden­ti­cal to that of a naive event loop and exhibits all the fal­lac­ies of an event loop (see Task Sched­ul­ing sec­tion below).

 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*1 As Gabriel and Sergey both explained in the com­ments, the mem­ber­ship man­age­ment works quite a bit dif­fer­ent to what I first thought. Once con­nected, all heart­beats are sent between pairs of nodes using a lin­ear algo­rithm, and the back­end data store is only used for reach­ing agree­ments on what nodes are dead and to dis­sem­i­nate the new mem­ber­ship view to all other nodes.

In this case, los­ing con­nec­tion to the back­end data store would not impact exist­ing, con­nected clus­ters. mak­ing it par­ti­tion tol­er­ant. If the back­end data store becomes unavail­able and at the same time as your clus­ter topol­ogy is chang­ing then it will hin­der updates to the mem­ber­ship and stop new nodes from being able to join the cluster.

Hope­fully the imple­men­ta­tion details of the mem­ber­ship man­age­ment will be dis­cussed in more detail in Orleans team’s future posts. Also, since Orleans will be open sourced in early 2015, we’ll be able to get a closer look at exactly how this behaves when the source code is available.

 

*2 Gabriel pointed out that by default Orleans does not resend mes­sages that have timed out. So by default, it uses at-most-once deliv­ery, but can be con­fig­ured to auto­mat­i­cally resend upon time­out if you want at-least-once deliv­ery instead.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 

Ter­mi­nol­ogy

In Orleans, a Grain rep­re­sents an actor, and each node has a Silo which man­ages the life­time of the grains run­ning inside the Silo. A Grain is acti­vated when it receives a request, and can be deac­ti­vated after it becomes idle for a while. The Silo will also remove the deac­ti­vate Grains from mem­ory to free up resources.

Orleans’ Grains are referred to as vir­tual actors. They are vir­tual because the state of a Grain can be per­sisted to a stor­age ser­vice, and then rein­stated as the Grain is reac­ti­vated (after being deac­ti­vated due to idle­ness) on another node. This is a nice abstrac­tion from a developer’s point of view, and to enable this level of indi­rec­tion, Orleans intro­duces the mech­a­nism of state providers.

 

Stor­age Providers

To use stor­age providers, you first need to define an inter­face that rep­re­sents the state of your Grain, and have it inherit from Orleans’ IGrain­State inter­face. For example:

image

Then in your Grain imple­men­ta­tion class, you pro­vide this ITour­na­ment­Grain­State inter­face as generic type para­me­ter to the Orleans.Grains base class, as below. You also need to spec­ify the stor­age provider you want to use via the [Stor­age­Provider] attribute. The Provider­Name spec­i­fied here points to a stor­age provider you define in the con­fig­u­ra­tion file for Orleans.

image

When a Grain is acti­vated in a Silo, the Orleans run­time will go and fetch the state of the Grain for us and put it in an instance mem­ber called State. For instance, when the Acti­vateA­sync method is called, the state of our Tour­na­ment­Grain would have been pop­u­lated from the DynamoDB­Stor­age provider we created:

image

You can mod­ify the state by mod­i­fy­ing its mem­bers, but the changes will not be per­sisted to the back­end stor­age ser­vice until you call the State.WriteStateAsync method. Un-persisted changes will be lost when the Grain is deac­ti­vated by the Silo, or if the node itself is lost.

image

Finally, there are a num­ber of built-in stor­age providers, such as the Azure Table Stor­age, but it’s triv­ial to imple­ment your own. To imple­ment a cus­tom stor­age provider, you just need to imple­ment the IStor­age­Provider interface.

 

Stor­age providers make it very easy to cre­ate actors that can be eas­ily resumed after deac­ti­va­tion, but you need to be mind­ful of a num­ber of things:

  • how often you per­sist state is a trade-off between dura­bil­ity and per­for­mance + cost
  • if mul­ti­ple parts of the state need to be mod­i­fied in one call, you need to have a roll­back strat­egy in place in case of excep­tions or risk leav­ing dirty writes in your state (see Not let­ting it crash sec­tion below)
  • you need to han­dle the case when per­sis­tence fails – since you’ve mutated the in-memory state, if per­sis­tence failed do you roll­back or con­tinue and hope that you get another chance at per­sist­ing the state before the actor is deac­ti­vated through idle­ness or the node crashing

 

In Erlang, there are no built-in mech­a­nism for stor­age providers, but there is also noth­ing stop­ping you from imple­ment­ing this your­self. Have a look at Bryan Hunter’s CQRS with Erlang talk at NDC Oslo 2014 for inspiration.

 

Silo Mem­ber­ship

Silos use a back­end store to man­age Silo mem­ber­ships, this uses Azure Table Stor­age by default. From the MSR paper, this is what it has to say about Silo memberships:

“Servers auto­mat­i­cally detect fail­ures via peri­odic heart­beats and reach an agree­ment on the mem­ber­ship view. For a short period of time after a fail­ure, mem­ber­ship views on dif­fer­ent servers may diverge, but it is guar­an­teed that even­tu­ally all servers will learn about the failed server and have iden­ti­cal mem­ber­ship views….if a server was declared dead by the mem­ber­ship ser­vice, it will shut itself down even if the fail­ure was just a tem­po­rary net­work issue.

Fur­ther­more, on the guar­an­tee that an actor (or Grain with a spe­cific ID) is only acti­vated on one node:

“In failure-free times, Orleans guar­an­tees that an actor only has a sin­gle acti­va­tion. How­ever, when fail­ures occur, this is only guar­an­teed eventually.

Mem­ber­ship is in flux after a server has failed but before its fail­ure has been com­mu­ni­cated to all sur­vivors. Dur­ing this period, a register-activation request may be mis­routed if the sender has a stale mem­ber­ship view….However, it may be that two acti­va­tions of the same actor are reg­is­tered in two dif­fer­ent direc­tory par­ti­tions, result­ing in two acti­va­tions of a single-activation actor. In this case, once the mem­ber­ship has set­tled, one of the acti­va­tions is dropped from the direc­tory and a mes­sage is sent to its server to deac­ti­vate it.”

There are cou­ple of things to note about Silo mem­ber­ship man­age­ment from the above:

  • *3 the way servers det­o­nate when they lose con­nec­tiv­ity to the stor­age ser­vice means it’s not partition-tolerant because if the stor­age ser­vice is par­ti­tioned from the clus­ter, even for a rel­a­tively short amount of time, then there’s a chance for every node that are run­ning Silos to self detonate;
  • *3 there is a sin­gle point of fail­ure at the stor­age ser­vice used to track Silo mem­ber­ships, any out­age to the stor­age ser­vice results in out­age to your Orleans ser­vice too (this hap­pened to Halo 4);
  • it offers strong con­sis­tency dur­ing the good times, but fails back to even­tual con­sis­tency dur­ing fail­ures;
  • whilst it’s not men­tioned, but I spec­u­late that depend­ing on the size of clus­ter and the time to con­verge on Silo mem­ber­ship views across the clus­ter, it’s pos­si­ble to have more than two acti­va­tions of the same Grain in the cluster;
  • the con­flict res­o­lu­tion approach above sug­gests that one acti­va­tion is cho­sen at ran­dom and the rest are dis­carded, this seems rather naive and means los­ing inter­me­di­ate changes recorded on the dis­card Grain acti­va­tions, and
  • since each acti­va­tion can be per­sist­ing its state inde­pen­dently so it’s pos­si­ble for the sur­viv­ing Grain activation’s inter­nal state to be out-of-sync with what had been persisted;
  • these fail­ure times can hap­pen a lot more often than you think, nodes can be lost due to fail­ure, but also as a result of planned/automatic scal­ing down events through­out the day as traf­fic pat­terns change (Orleans is designed for the cloud and all its elas­tic scal­a­bil­ity after all).

 

Dur­ing fail­ures, it should be pos­si­ble to pro­vide stronger guar­an­tee on sin­gle acti­va­tion using opti­mistic con­cur­rency around the Grain’s state. For instance,

1. node A failed, now the cluster’s view of Silos have diverged

2a. Grain receives a request on node B, and is acti­vated with state v1

2b. Grain receives a request on node C, and is acti­vated with state v1

3a. Grain on node B fin­ishes pro­cess­ing request, and suc­ceeds in sav­ing state v2

3b. Grain on node C fin­ishes pro­cess­ing request, but fails to save state v2 (opti­mistic con­cur­rency at work here)

4. Grain on node C fails the request and trig­gers deactivation

5. Clus­ter only has one acti­va­tion of the Grain on node B

Enforc­ing stronger sin­gle acti­va­tion guar­an­tee in this fash­ion should also remove the need for bet­ter con­flict res­o­lu­tion. For this approach to work, you will need to be able to detect per­sis­tence fail­ures due to opti­mistic con­cur­rency. In DynamoDB, this can be iden­ti­fied by a con­di­tional check error.

 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*3 Again, as per Gabriel and Sergey’s com­ments below, this is not true and there’s no sin­gle point of fail­ure in this case. See *1 above, or read the com­ments for more detail.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 

Dis­trib­uted Erlang employs a dif­fer­ent approach for form­ing clus­ters. Nodes form a mesh net­work where every node is con­nected to every other node. They use a gos­sip pro­to­col to inform each other when nodes join or leave the cluster.

image

This approach has a scal­a­bil­ity lim­i­ta­tion and doesn’t scale well to thou­sands, or even hun­dreds of nodes depend­ing on the capa­bil­i­ties of each node. This is due to the over­head in form­ing and main­tain­ing the clus­ter increases qua­drat­i­cally to the num­ber of nodes. The effect is par­tic­u­larly evi­dent if you require fre­quent inter-node com­mu­ni­ca­tions, or need to use func­tions in the global built-in module.

In this par­tic­u­lar space, SD (scal­able dis­trib­uted) Erlang is attempt­ing to address this short­com­ing by allow­ing you to cre­ate groups of sub-clusters amongst your nodes so that the size of the mesh net­work is lim­ited to the size of the sub-clusters.

 

Ran­dom Actor Placement

Another inter­est­ing choice Orleans made is that, instead of using con­sis­tent hash­ing for actor place­ment (a tech­nique that has been suc­cess­fully used in a num­ber of Key-Value Stores such as Couch­Base and Riak), Orleans intro­duces another layer of indi­rec­tion here by intro­duc­ing the Orleans direc­tory.

“The Orleans direc­tory is imple­mented as a one-hop dis­trib­uted hash table (DHT). Each server in the clus­ter holds a par­ti­tion of the direc­tory, and actors are assigned to the par­ti­tions using con­sis­tent hash­ing. Each record in the direc­tory maps an actor id to the loca­tion( s) of its activations.”

image

The ratio­nale for this deci­sion is that it allows *4 ran­dom place­ment of actors will help avoid cre­ation of hotspots in your clus­ter which might result from poorly cho­sen IDs, or bad luck. But it means that to retain cor­rect­ness, every request to actors now require an addi­tional hop to the direc­tory par­ti­tion first. To address this per­for­mance con­cern, each node will use a local cache to store where each actor is.

I think this is a well-meaning attempt to a prob­lem, but I’m not con­vinced that it’s a prob­lem that deserves the

  1. addi­tional layer of indi­rec­tion and
  2. the sub­se­quent prob­lem of per­for­mance, and
  3. the sub­se­quent use of local cache and
  4. *5 the prob­lem of cache inval­i­da­tion that comes with it (which as we know, is one of the two hard prob­lems in CS)

Is it really worth it? Espe­cially when the IDs are guids by default, which hash well. Would it not be bet­ter to solve it with a bet­ter hash­ing algorithm?

From my per­sonal expe­ri­ence of work­ing with a num­ber of dif­fer­ent Key-Value stores, actor place­ment has never been an issue that is sig­nif­i­cant enough to deserve the spe­cial treat­ment Orleans have given it. I’d really like to see results of any empir­i­cal study that shows this to be a big enough issue in real-world key-value store usages.

 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*4 As Sergey men­tioned in the com­ments, you can do a few more things such as using a Prefer­LocalPlace­ment strat­egy to “instruct the run­time to place an acti­va­tion of a grain of that type local to the first caller (another grain) to it.That is how the app can hint about opti­miz­ing place­ment.”. This appears the same as when you spawn a new process in Erlang. It would require fur­ther clar­i­fi­ca­tion from Sergey or Gabriel but I’d imag­ine the place­ment strat­egy prob­a­bly applies at the type level for each type of grain.

The addi­tional layer of abstrac­tion does buy you some more flex­i­bil­ity, and not hav­ing to move grains around when topol­ogy changes prob­a­bly sim­pli­fies things (I imag­ine mov­ing the direc­tory infor­ma­tion around is cheaper and eas­ier than the grains them­selves) from the imple­men­ta­tion point-of-view too.

In the Erlang space, Riak­Core pro­vides a set of tools to help you build dis­trib­uted sys­tems and its approach gives you more con­trol over the behav­iour of your sys­tem. You do how­ever have to imple­ment a few more things your­self, such as how to move data around when clus­ter topol­ogy changes (though the vnode behav­iour gives you the basic tem­plate for doing this) and how to deal with col­li­sions etc.

 

*5 Hit­ting stale cache is not much of a prob­lem in this case, where Orleans would do a new lookup, for­ward the mes­sage to the right des­ti­na­tion and update the cache.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 

As a side, here’s how Riak does con­sis­tent hash­ing and read/write replications:

Riak NRW

 

Mes­sage Deliv­ery Guarantees

“Orleans pro­vides at-least-once mes­sage deliv­ery, by resend­ing mes­sages that were not acknowl­edged after a con­fig­urable time­out. Exactly-once seman­tics could be added by per­sist­ing the iden­ti­fiers of deliv­ered mes­sages, but we felt that the cost would be pro­hib­i­tive and most appli­ca­tions do not need it. This can still be imple­mented at the appli­ca­tion level”

The deci­sion to use at-least-once mes­sage deliv­ery as default is a con­tentious one in my view. *6 A slow node will cause mes­sages to be sent twice, and han­dled twice, which is prob­a­bly not what you want most of the time.

Whilst the ratio­nale regard­ing cost is valid, it seems to me that allow­ing the mes­sage deliv­ery to time out and let­ting the caller han­dle time­out cases is the more sen­si­ble choice here. It’d make the han­dling of time­outs an explicit deci­sion on the appli­ca­tion developer’s part, prob­a­bly at a per call basis since some calls are more impor­tant than others.

 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*6 The default behav­iour is to not resent on time­out, so by default Orleans actu­ally uses at-most-once deliv­ery. But you can con­fig­ure it to auto­mat­i­cally resend upon time­out, i.e. at-least-once.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 

Not let­ting it crash

Erlang’s mantra has always been to Let It Crash (except when you shouldn’t). When a process crashes, it can be restarted by its super­vi­sor. Using tech­niques such as event sourc­ing it’s easy to return to the pre­vi­ous state just before the crash.

When you except in an Orleans’ Grain, the excep­tion does not crash the Grain itself and is sim­ply reported back to the caller instead. This sim­pli­fies things but runs the risk of leav­ing behind dirty writes (hence cor­rupt­ing the state) in the wake of an excep­tion. For example:

image

The choice of not crash­ing the grain in the event of excep­tions offers con­ve­nience at the cost of break­ing the atom­ic­ity of oper­a­tions and per­son­ally it’s not a choice that I agree with.

 

Reen­trant Grains

In a dis­cus­sion on the Actor model with Erik Mei­jer and Clemens Szyper­ski, Carl Hewitt (father of the Actor model) said

“Con­cep­tu­ally, mes­sages are processed one at a time, but the imple­men­ta­tion can allow for con­cur­rent pro­cess­ing of messages”

In Orleans, grains process mes­sages one at a time nor­mally. To allow con­cur­rent pro­cess­ing of mes­sages, you can mark your grain imple­men­ta­tion with the [Reen­trant] attribute.

Reen­trant grains can be used as an opti­miza­tion tech­nique to remove bot­tle­necks in your net­work of grains.

“One actor is no actor, they come in sys­tems, and they have to have addresses so that one actor can send mes­sages to another actor.”

– Carl Hewitt

How­ever, using reen­trant grains means you lose the guar­an­tee that the state is accessed sequen­tially and opens your­self up to poten­tial race-conditions. You should use reen­trant grains with great care and consideration.

 

In Erlang, con­cur­rent pro­cess­ing of mes­sages is not allowed. But, you don’t have to block your actor whilst it waits for some expen­sive com­pu­ta­tion to com­plete. Instead, you can spawn another actor and ask the child actor to carry on with the expen­sive work whilst the par­ent actor processes the next mes­sage. This is pos­si­ble because the over­head and cost of spawn­ing a process in Erlang is very low and the run­time can eas­ily han­dle ten of thou­sands of con­cur­rent processes and load bal­ance across the avail­able CPU resources via its schedulers.

If nec­es­sary, once the child actor has fin­ished its work it can send a mes­sage back to the par­ent actor, whom can then per­form any sub­se­quent com­pu­ta­tion as required.

Using this sim­ple tech­nique, you acquire the same capa­bil­ity that reen­trant grains offers. You can also con­trol which mes­sages can be processed con­cur­rently, rather than the all-or-nothing approach that reen­trant grains uses.

 

Immutable Mes­sages

*7 Mes­sages sent between Grains are usu­ally seri­al­ized and then dese­ri­al­ized, this is an expen­sive over­head when both Grains are run­ning on the same machine. You can option­ally mark the mes­sages are immutable so that they won’t be seri­al­ized when passed between Grains on the same machine, but this immutabil­ity promise is not enforced at all and it’s entirely down to you to apply the due dili­gence of not mutat­ing the messages.

 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*7 A clar­i­fi­ca­tion here, mes­sages are only seri­al­ized and dese­ri­al­ized when they are sent across nodes, mes­sages sent between grains on the same node is deep-copied instead, which is cheaper than seri­al­iza­tion. Mark­ing the type as immutable skips the deep-copying process too.

But, it’s still your respon­si­bil­ity to enforce the immutabil­ity guarantee.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 

In Erlang, vari­ables are immutable so there is no need to do any­thing explicit.

 

Task Sched­ul­ing

“Orleans sched­ules appli­ca­tion turns using coop­er­a­tive mul­ti­task­ing. That means that once started, an appli­ca­tion turn runs to com­ple­tion, with­out inter­rup­tion. The Orleans sched­uler uses a small num­ber of com­pute
threads that it con­trols, usu­ally equal to the num­ber of CPU cores, to exe­cute all appli­ca­tion actor code.”

This is another point of con­cern for me.

Here we’re exhibit­ing the same vul­ner­a­bil­i­ties of an event-loop sys­tem (e.g. Node.js, Tor­nado) where a sin­gle poi­soned mes­sage can crip­ple your entire sys­tem. Even with­out poi­soned mes­sages, you are still left with the prob­lem of not dis­trib­ut­ing CPU resources evenly across actors, and allow­ing slow/misbehaving actors to badly impact your latency for other pend­ing requests.

Even hav­ing mul­ti­ple cores and hav­ing one thread per core (which is a sane choice) is not going to save you here. All you need is one slow-running actor on each processor-affined exe­cu­tion thread to halt the entire system.

 

The Erlang’s approach to sched­ul­ing makes much more sense – one sched­uler per core, and an actor is allowed to exe­cute 2000 reduc­tions (think of one reduc­tion as one func­tion call to do some­thing) before it has to yield the CPU so that another actor can get a slice of the CPU time. The orig­i­nal actor will then wait for its turn to run again.

This CPU-sharing pol­icy is no dif­fer­ent to what the OS does with threads, and there’s a good rea­son for that.

 

Ease of use

I think this is the big win­ner for Orleans and the focus of its design goals.

I have to admit, I was pleas­antly sur­prised how eas­ily and quickly I was able to put a ser­vice together and have it run­ning locally. Based on what I have seen of the sam­ples and Richard’s Plu­ral­sight course, deploy­ing to the cloud is pretty straight for­ward too.

 

Cloud Friend­li­ness

Another win for Orleans here, as it’s designed from the start to deal with clus­ter topolo­gies that can change dynam­i­cally with ephemeral instances. Whereas dis­trib­uted Erlang, at least dis­trib­uted OTP, assumes a fixed topol­ogy where nodes have well defined roles at start. There are also chal­lenges around get­ting the built-in dis­trib­uted data­base – Mne­sia – to work well in a dynam­i­cally chang­ing topology.

 

Con­clu­sion

In many ways, I think Orleans is true to its orig­i­nal design goals of opti­miz­ing for devel­oper pro­duc­tiv­ity. But by shield­ing devel­op­ers from deci­sions and con­sid­er­a­tions that usu­ally comes with build­ing dis­trib­uted sys­tems, it has also deprived them of the oppor­tu­nity to build sys­tems that need to be resilient to fail­ures and meet strin­gent uptime requirements.

But, not every dis­trib­uted sys­tem is crit­i­cal, and not every dis­trib­uted sys­tem needs five to nine nines uptime. As long as you’re informed about the trade-offs that Orleans have made and what they mean to you as a devel­oper, you can at least make informed choices of if and when to adopt Orleans.

I hope this post will help you make those informed deci­sions, if I have been mis­in­formed and incor­rect about any parts of Orleans work­ing, please do leave a com­ment below.

 

Links

Share

It’s been a busy month, some top qual­ity con­fer­ences – Code Mesh, Build Stuff, FuncBy and NDC Lon­don – all cramped into the space of 4 weeks. It has been a blast, lots of talks and valu­able take­aways, and it was great to hang out with old friends and meet new ones. As soon as I find time I’ll put together some posts with my key take­aways from the conferences.

Dur­ing these con­fer­ences, Kevlin Hen­ney’s numer­ous talks have left a last­ing impres­sion on me. In his “Seven Inef­fec­tive Cod­ing Habits of Many Pro­gram­mers” talk at Build Stuff, he described the lack of visual hon­esty in code such as these:

pub­lic int howNotToLayoutMethodHeader(int firstArgument,

    string secondArgument)

and on what visual hon­esty means, he pre­sented a num­ber of quotes from Daniel Higginbotham’s excel­lent Clean Up Your Mess website:

To answer the ques­tion “What is clean design?” most suc­cinctly: a clean design is one that sup­ports visual think­ing so peo­ple can meet their infor­ma­tion needs with a min­i­mum of con­scious effort.”

 

You con­vey infor­ma­tion by the way you arrange a design’s ele­ments in rela­tion to each other. This infor­ma­tion is under­stood imme­di­ately, if not con­sciously, by the peo­ple view­ing your design.”

 

This is great if the visual rela­tion­ships are obvi­ous and accu­rate, but if they’re not, your audi­ence is going to get con­fused. They’ll have to exam­ine your work care­fully, going back and forth between the dif­fer­ent parts to make sure they understand.”

The quotes talk about lay­ing out infor­ma­tion so that their visual rela­tion­ships are obvi­ous and accurate.

So if you lay­out your method argu­ments in such a way that their visual rela­tion­ships are not accu­rate and you do that pur­pose­fully, then you’re in fact being dishonest.

image

 

As I sat there, I finally under­stood why F# pipes are so awe­some. I always knew it makes cleaner and more read­able code, it’s intu­itive, but I haven’t been able to find the words to explain why – the trou­ble with being able to under­stand some­thing with min­i­mum con­scious effort is that your con­scious mind can’t explain how it under­stood it.

Not any­more, now I finally under­stand it.

 

When we’re read­ing a piece of reg­u­lar Eng­lish text, we’re read­ing from left-to-right, then top-to-bottom. This con­ven­tion con­trols the flow of infor­ma­tion we receive as we read, so when we’re lay­ing out infor­ma­tion for peo­ple to con­sume, we lay them out in the order of left-to-right, then top-to-bottom.

image

But what about code?

When it comes to writ­ing nested func­tion calls, some­how this flow of infor­ma­tion has been reversed!

image

With F#’s pipes (which has been adopted in both Elm and Elixir by the way), we have finally man­aged to restore some san­ity and present sequence of func­tion calls in a way that matches the way we con­sume any other tex­tual information.

image

Visual hon­esty right before your eyes!

 

Links

Clean Up Your Mess – A guide to Visual Design for everyone

NDC Oslo 2014 – Take­aways from keynote “it’s a write/read web”

NDC Oslo 2014 – Take­aways from “career reboot for the devel­oper mind”

Take­aways from Theo Schlossnagle’s talk on Scal­able Inter­net Architecture

Take­aways from Hewitt, Mei­jer and Szyperski’s talk on the Actor model

Take­aways from Gael Fraiteur’s mul­ti­thread­ing talk

Share

 

Video should be avail­able on NDC’s vimeo chan­nel in the com­ing weeks.

Share

Here are the slides for my talks at CodeMesh, Build­Stuff and Fby this year, enjoy!

Share