After watch­ing Gael’s recent Skills­Mat­ter talk on mul­ti­thread­ing I’ve put together some notes from a very edu­ca­tional talk:

 

Hard­ware Cache Hierarchy

image

Four lev­els of cache

  • L1 (per core) – typ­i­cally used for instructions
  • L2 (per core)
  • L3 (per die)
  • DRAM (all processors)

Data can be cached in mul­ti­ple caches, and syn­chro­niza­tion hap­pens through an asyn­chro­nous mes­sage bus.

The latency increases as you go down the dif­fer­ent lev­els of cache:

image 

 

Mem­ory Reordering

Cache oper­a­tions are in gen­eral opti­mized for per­for­mance as opposed to log­i­cal behav­iour, hence depend­ing on the archi­tec­ture (x86, AMD, ARM7, etc.) cache loads and store oper­a­tions can be reordered and exe­cuted out-of-order:

image

To add to this mem­ory reorder­ing behav­iour at a hard­ware level, the CLR can also:

  • cache data into register
  • reorder
  • coa­lesce writes

The volatile key­word stops the com­piler opti­miza­tions, that’s all, it does not stop the hard­ware level optimizations.

This is where mem­ory bar­rier comes in, to ensure ser­ial access to mem­ory and to force data to be flushed and syn­chro­nized across all the local cache, this is done via the Thread.MemoryBarrier method in .Net.

 

Atom­ic­ity

Oper­a­tions on longs can­not be per­formed in an atomic way on a 32-bit archi­tec­ture, it’s pos­si­ble to get par­tially mod­i­fied value.

 

Inter­locked

Inter­locks pro­vides the only lock­ing mech­a­nism at hard­ware level, the .Net frame­work pro­vides access to these instruc­tions via the Inter­locked class.

On the Intel archi­tec­ture, inter­locks are typ­i­cally imple­mented on the L3 cache, a fact that’s reflected by the latency asso­ci­ated with using Inter­locked incre­ments com­pared with non-interlocked:

image

Com­pa­re­Ex­change is the most impor­tant tool when it comes to imple­mented lock-free algo­rithms, but since it’s imple­mented on the L3 cache, in a multi-processor envi­ron­ment it would require one of the proces­sor to take out a global lock, hence why the con­tented case above takes much longer.

You can analyse the per­for­mance of your appli­ca­tion at a CPU level using Intel’s vTune Ampli­fier XE tool.

 

Mul­ti­task­ing

Threads do not exist at a hard­ware level, CPU only under­stands tasks and it has no con­cept of ‘wait’. Syn­chro­niza­tion con­structs such as sem­a­phores and mutex are built on top of inter­locked operations.

One core can never do more than 1 ‘thing’ at the same time, unless it’s hyper-threaded in which case the core can do some­thing else whilst wait­ing on some resource to con­tinue exe­cut­ing the orig­i­nal task.

A task runs until inter­rupted by hard­ware (I/O inter­rupt) or OS.

 

Win­dows Kernel

A process has:

  • pri­vate vir­tual address space
  • resources
  • at least 1 thread

A thread is:

  • a pro­gram (sequence of instructions)
  • CPU state
  • wait depen­den­cies

Threads can wait for dis­patcher objects (Wait­Handle) – Mutex, Sem­a­phore, Event, Timer or another thread, when they’re not wait­ing for any­thing they’re placed in the wait­ing queue by the thread sched­uler until it is their turn to be exe­cuted on the CPU.

After a thread has been exe­cuted for some time, it is then moved back to the wait­ing queue (via a ker­nel inter­rupt) to give some other thread a slice of the avail­able CPU time. Alter­na­tively, if the thread needs to wait for a dis­patcher object then it goes back to the wait­ing state.

image

Dis­patcher objects reside in the ker­nel and can be shared among dif­fer­ent processes, they’re very expen­sive!

image

Which is why you don’t want to use ker­nel objects for waits that are typ­i­cally very short, instead they’re best used when wait­ing for some­thing that takes longer to return, e.g. I/O.

Com­pared to other wait meth­ods (e.g. Thread.Sleep, Thread.Yield, WaitHandle.Wait, etc.) Thread.SpinWait is an odd ball because it’s not a ker­nel method, it resem­bles a con­tin­u­ous loop (it keeps ‘spin­ning’) but it tells a hyper-threaded CPU that it’s ok to do some­thing else. It’s gen­er­ally use­ful when you know the inter­rupt will hap­pen very quickly and hence sav­ing you from an unnec­es­sary con­text switch. If the inter­rupt does not hap­pen quickly as expected, the Spin­Wait will be trans­formed into a nor­mal thread wait (Thread.Sleep) to avoid wast­ing CPU cycles.

 

.Net Frame­work Thread Synchronization

image

 

The lock Keyword

  1. start with inter­locked oper­a­tions (no contention)
  2. con­tinue with ‘spin wait’
  3. cre­ate ker­nel event and wait

Good per­for­mance if low contention.

 

Design Pat­terns

  • Thread unsafe
  • Actor
  • Reader-Writer Syn­chro­nized

This is where the Post­Sharp mul­ti­thread­ing toolkit comes to the res­cue! It can help you imple­ment each of these pat­terns auto­mat­i­cally, Gael has talked more about the toolkit in this blog post.

Share

Leave a Reply