Takeaways from Gael Fraiteur’s multithreading talk

After watch­ing Gael’s recent Skills­Mat­ter talk on mul­ti­thread­ing I’ve put togeth­er some notes from a very edu­ca­tion­al talk:

 

Hard­ware Cache Hier­ar­chy

image

Four lev­els of cache

  • L1 (per core) – typ­i­cal­ly used for instruc­tions
  • L2 (per core)
  • L3 (per die)
  • DRAM (all proces­sors)

Data can be cached in mul­ti­ple caches, and syn­chro­niza­tion hap­pens through an asyn­chro­nous mes­sage bus.

The laten­cy increas­es as you go down the dif­fer­ent lev­els of cache:

image

 

Mem­o­ry Reorder­ing

Cache oper­a­tions are in gen­er­al opti­mized for per­for­mance as opposed to log­i­cal behav­iour, hence depend­ing on the archi­tec­ture (x86, AMD, ARM7, etc.) cache loads and store oper­a­tions can be reordered and exe­cut­ed out-of-order:

image

To add to this mem­o­ry reorder­ing behav­iour at a hard­ware lev­el, the CLR can also:

  • cache data into reg­is­ter
  • reorder
  • coa­lesce writes

The volatile key­word stops the com­pil­er opti­miza­tions, that’s all, it does not stop the hard­ware lev­el opti­miza­tions.

This is where mem­o­ry bar­ri­er comes in, to ensure ser­i­al access to mem­o­ry and to force data to be flushed and syn­chro­nized across all the local cache, this is done via the Thread.MemoryBarrier method in .Net.

 

Atom­ic­i­ty

Oper­a­tions on longs can­not be per­formed in an atom­ic way on a 32-bit archi­tec­ture, it’s pos­si­ble to get par­tial­ly mod­i­fied val­ue.

 

Inter­locked

Inter­locks pro­vides the only lock­ing mech­a­nism at hard­ware lev­el, the .Net frame­work pro­vides access to these instruc­tions via the Inter­locked class.

On the Intel archi­tec­ture, inter­locks are typ­i­cal­ly imple­ment­ed on the L3 cache, a fact that’s reflect­ed by the laten­cy asso­ci­at­ed with using Inter­locked incre­ments com­pared with non-inter­locked:

image

Com­pa­re­Ex­change is the most impor­tant tool when it comes to imple­ment­ed lock-free algo­rithms, but since it’s imple­ment­ed on the L3 cache, in a mul­ti-proces­sor envi­ron­ment it would require one of the proces­sor to take out a glob­al lock, hence why the con­tent­ed case above takes much longer.

You can analyse the per­for­mance of your appli­ca­tion at a CPU lev­el using Intel’s vTune Ampli­fi­er XE tool.

 

Mul­ti­task­ing

Threads do not exist at a hard­ware lev­el, CPU only under­stands tasks and it has no con­cept of ‘wait’. Syn­chro­niza­tion con­structs such as sem­a­phores and mutex are built on top of inter­locked oper­a­tions.

One core can nev­er do more than 1 ‘thing’ at the same time, unless it’s hyper-thread­ed in which case the core can do some­thing else whilst wait­ing on some resource to con­tin­ue exe­cut­ing the orig­i­nal task.

A task runs until inter­rupt­ed by hard­ware (I/O inter­rupt) or OS.

 

Win­dows Ker­nel

A process has:

  • pri­vate vir­tu­al address space
  • resources
  • at least 1 thread

A thread is:

  • a pro­gram (sequence of instruc­tions)
  • CPU state
  • wait depen­den­cies

Threads can wait for dis­patch­er objects (Wait­Handle) – Mutex, Sem­a­phore, Event, Timer or anoth­er thread, when they’re not wait­ing for any­thing they’re placed in the wait­ing queue by the thread sched­uler until it is their turn to be exe­cut­ed on the CPU.

After a thread has been exe­cut­ed for some time, it is then moved back to the wait­ing queue (via a ker­nel inter­rupt) to give some oth­er thread a slice of the avail­able CPU time. Alter­na­tive­ly, if the thread needs to wait for a dis­patch­er object then it goes back to the wait­ing state.

image

Dis­patch­er objects reside in the ker­nel and can be shared among dif­fer­ent process­es, they’re very expen­sive!

image

Which is why you don’t want to use ker­nel objects for waits that are typ­i­cal­ly very short, instead they’re best used when wait­ing for some­thing that takes longer to return, e.g. I/O.

Com­pared to oth­er wait meth­ods (e.g. Thread.Sleep, Thread.Yield, WaitHandle.Wait, etc.) Thread.SpinWait is an odd ball because it’s not a ker­nel method, it resem­bles a con­tin­u­ous loop (it keeps ‘spin­ning’) but it tells a hyper-thread­ed CPU that it’s ok to do some­thing else. It’s gen­er­al­ly use­ful when you know the inter­rupt will hap­pen very quick­ly and hence sav­ing you from an unnec­es­sary con­text switch. If the inter­rupt does not hap­pen quick­ly as expect­ed, the Spin­Wait will be trans­formed into a nor­mal thread wait (Thread.Sleep) to avoid wast­ing CPU cycles.

 

.Net Frame­work Thread Syn­chro­niza­tion

image

 

The lock Key­word

  1. start with inter­locked oper­a­tions (no con­tention)
  2. con­tin­ue with ‘spin wait’
  3. cre­ate ker­nel event and wait

Good per­for­mance if low con­tention.

 

Design Pat­terns

  • Thread unsafe
  • Actor
  • Read­er-Writer Syn­chro­nized

This is where the Post­Sharp mul­ti­thread­ing toolk­it comes to the res­cue! It can help you imple­ment each of these pat­terns auto­mat­i­cal­ly, Gael has talked more about the toolk­it in this blog post.