After watch­ing Gael’s recent Skills­Mat­ter talk on mul­ti­thread­ing I’ve put together some notes from a very edu­ca­tional talk:

 

Hard­ware Cache Hierarchy

image

Four lev­els of cache

  • L1 (per core) – typ­i­cally used for instructions
  • L2 (per core)
  • L3 (per die)
  • DRAM (all processors)

Data can be cached in mul­ti­ple caches, and syn­chro­niza­tion hap­pens through an asyn­chro­nous mes­sage bus.

The latency increases as you go down the dif­fer­ent lev­els of cache:

image 

 

Mem­ory Reordering

Cache oper­a­tions are in gen­eral opti­mized for per­for­mance as opposed to log­i­cal behav­iour, hence depend­ing on the archi­tec­ture (x86, AMD, ARM7, etc.) cache loads and store oper­a­tions can be reordered and exe­cuted out-of-order:

image

To add to this mem­ory reorder­ing behav­iour at a hard­ware level, the CLR can also:

  • cache data into register
  • reorder
  • coa­lesce writes

The volatile key­word stops the com­piler opti­miza­tions, that’s all, it does not stop the hard­ware level optimizations.

This is where mem­ory bar­rier comes in, to ensure ser­ial access to mem­ory and to force data to be flushed and syn­chro­nized across all the local cache, this is done via the Thread.MemoryBarrier method in .Net.

 

Atom­ic­ity

Oper­a­tions on longs can­not be per­formed in an atomic way on a 32-bit archi­tec­ture, it’s pos­si­ble to get par­tially mod­i­fied value.

 

Inter­locked

Inter­locks pro­vides the only lock­ing mech­a­nism at hard­ware level, the .Net frame­work pro­vides access to these instruc­tions via the Inter­locked class.

On the Intel archi­tec­ture, inter­locks are typ­i­cally imple­mented on the L3 cache, a fact that’s reflected by the latency asso­ci­ated with using Inter­locked incre­ments com­pared with non-interlocked:

image

Com­pa­re­Ex­change is the most impor­tant tool when it comes to imple­mented lock-free algo­rithms, but since it’s imple­mented on the L3 cache, in a multi-processor envi­ron­ment it would require one of the proces­sor to take out a global lock, hence why the con­tented case above takes much longer.

You can analyse the per­for­mance of your appli­ca­tion at a CPU level using Intel’s vTune Ampli­fier XE tool.

 

Mul­ti­task­ing

Threads do not exist at a hard­ware level, CPU only under­stands tasks and it has no con­cept of ‘wait’. Syn­chro­niza­tion con­structs such as sem­a­phores and mutex are built on top of inter­locked operations.

One core can never do more than 1 ‘thing’ at the same time, unless it’s hyper-threaded in which case the core can do some­thing else whilst wait­ing on some resource to con­tinue exe­cut­ing the orig­i­nal task.

A task runs until inter­rupted by hard­ware (I/O inter­rupt) or OS.

 

Win­dows Kernel

A process has:

  • pri­vate vir­tual address space
  • resources
  • at least 1 thread

A thread is:

  • a pro­gram (sequence of instructions)
  • CPU state
  • wait depen­den­cies

Threads can wait for dis­patcher objects (Wait­Handle) – Mutex, Sem­a­phore, Event, Timer or another thread, when they’re not wait­ing for any­thing they’re placed in the wait­ing queue by the thread sched­uler until it is their turn to be exe­cuted on the CPU.

After a thread has been exe­cuted for some time, it is then moved back to the wait­ing queue (via a ker­nel inter­rupt) to give some other thread a slice of the avail­able CPU time. Alter­na­tively, if the thread needs to wait for a dis­patcher object then it goes back to the wait­ing state.

image

Dis­patcher objects reside in the ker­nel and can be shared among dif­fer­ent processes, they’re very expen­sive!

image

Which is why you don’t want to use ker­nel objects for waits that are typ­i­cally very short, instead they’re best used when wait­ing for some­thing that takes longer to return, e.g. I/O.

Com­pared to other wait meth­ods (e.g. Thread.Sleep, Thread.Yield, WaitHandle.Wait, etc.) Thread.SpinWait is an odd ball because it’s not a ker­nel method, it resem­bles a con­tin­u­ous loop (it keeps ‘spin­ning’) but it tells a hyper-threaded CPU that it’s ok to do some­thing else. It’s gen­er­ally use­ful when you know the inter­rupt will hap­pen very quickly and hence sav­ing you from an unnec­es­sary con­text switch. If the inter­rupt does not hap­pen quickly as expected, the Spin­Wait will be trans­formed into a nor­mal thread wait (Thread.Sleep) to avoid wast­ing CPU cycles.

 

.Net Frame­work Thread Synchronization

image

 

The lock Keyword

  1. start with inter­locked oper­a­tions (no contention)
  2. con­tinue with ‘spin wait’
  3. cre­ate ker­nel event and wait

Good per­for­mance if low contention.

 

Design Pat­terns

  • Thread unsafe
  • Actor
  • Reader-Writer Syn­chro­nized

This is where the Post­Sharp mul­ti­thread­ing toolkit comes to the res­cue! It can help you imple­ment each of these pat­terns auto­mat­i­cally, Gael has talked more about the toolkit in this blog post.

Share

I have heard a few peo­ple argue that when it comes to per­for­mance crit­i­cal code you should pre­fer arrays over other col­lec­tions (such as F#’s lists) as it ben­e­fits from sequen­tial reads (which is faster than seeks) and offers bet­ter mem­ory locality.

To test that the­ory some­what, I wanted to see if there is any dif­fer­ence in how fast you can iter­ate through an array ver­sus a list in F#, and how much faster you can map over an array com­pared to a list:

The result is a lit­tle sur­pris­ing, whilst I wasn’t expect­ing there to be a mas­sive dif­fer­ence in the iter­at­ing through the two types of col­lec­tions, I didn’t think map­ping over a list would be quite as slow in com­par­i­son. I knew that con­struct­ing a list is much heav­ier than con­struct­ing an array, but I didn’t think it’d take 22x as long in this case.

What was even more sur­pris­ing was how much slower the Seq.iter and Seq.map func­tions are com­pared to the Array and List mod­ule equiv­a­lents! This is, accord­ing to John Palmer:

Once you call in to Seq you lose the type infor­ma­tion — mov­ing to the next ele­ment in the list requires a call to IEnumerator.MoveNext. Com­pare to for Array you just incre­ment an index and for List you can just deref­er­ence a pointer. Essen­tially, you are get­ting an extra func­tion call for each ele­ment in the list.

The con­ver­sions back to List and Array also slow the code down for sim­i­lar reasons

Update 2012/06/04:

As a work around, you COULD shadow the Seq mod­ule with iter and map func­tions that adds sim­ple type check­ing and in the case of an array or list sim­ply call the cor­re­spond­ing func­tion in the Array or List mod­ule instead:

Whilst this approach will work to a cer­tain extend, you should be care­ful with which func­tions you shadow. For instance, it’s not safe to shadow Seq.map because it can be used in con­junc­tion with other func­tions such as Seq.takeWhile or Seq.take. In the base imple­men­ta­tion, a line of code such as:

arr |> Seq.map incr |> Seq.take 3

will not map over every ele­ment in the source array.

With the shad­owed ver­sion (see above) of Seq.map, how­ever, this would first cre­ate a new array by apply­ing the map­per func­tion against every ele­ment in the source array before dis­card­ing all but the first three ele­ments in the new array. This, as you can imag­ine, is far less effi­cient and requires much more mem­ory space (for the new array) and defeats the pur­pose of using Seq mod­ule func­tions in most cases.

Share

I’m proud to announce that we have released another new slot – Wealthy Whale – to Jack­potJoy Slots, it is also the first All Pays slots we have imple­mented on our F# slots engine!

image

In an All Pays slot, there are no ‘pay lines’, instead, any match­ing sym­bol (or Wild sym­bol) that appears on adja­cent reels will be matched, which means there’s an aston­ish­ing 243 dif­fer­ent ways you can win in the slot!

image

You can also mul­ti­ply your win­nings by hav­ing more than one match­ing sym­bol on a reel, for exam­ple, the fol­low­ing screen­shot rep­re­sents four sep­a­rate instances of a 5-symbol win!

image

And finally, no slot is with­out the all impor­tant bonus game, and Wealthy Whale comes with a 5-stage pick bonus:

image image

image image

image image

Hope you enjoy our new slot!

Share