.Net 4 intro­duced the Lazy<T> type which allows you to cre­ate an object that can be lazily ini­tial­ized so that you can delay the cre­ation of large objects, for instance.

How­ever, if your ini­tial­iza­tion logic has the poten­tial to except at run­time (e.g. time out excep­tions read­ing from some exter­nal data source) then you should pay close atten­tion to which con­struc­tor you use to cre­ate a new instance of the Lazy<T> type. Depend­ing on the selected LazyThread­Safe­t­y­Mode, excep­tions in the ini­tial­iza­tion code might be cached and rethrown on all sub­se­quent attempts to fetch the lazily ini­tial­ized value. Whilst this ensures that threads will always get the same result, hence remov­ing ambi­gu­ity, it does mean that you’ve got only one shot at ini­tial­iz­ing that value…

 

LazyThread­Safe­t­y­Mode

In cases where you need to be able to tol­er­ate occa­sional ini­tial­iza­tion errors (e.g. read­ing a large object from S3 can fail from time to time for a num­ber of rea­sons) and be able to try again at a sec­ond attempt, the rule of thumb is to instan­ti­ate the Lazy<T> type by set­ting LazyThread­Safe­t­y­Mode to Pub­li­ca­tionOnly. In Pub­li­ca­tionOnly thread safety mode, mul­ti­ple threads can invoke the ini­tial­iza­tion logic but the first thread to com­plete the ini­tial­iza­tion suc­cess­fully sets the value of the Lazy<T> instance.

For exam­ple, the fol­low­ing only works under the Pub­li­ca­tionOnly mode:

 

F#

F# pro­vides a slightly nicer syn­tax for defin­ing a lazy com­pu­ta­tion:

image

the Control.Lazy<T> type is an abbre­vi­a­tion of the BCL Lazy<T> type with a Force exten­sion method which under the hood just calls Lazy<T>.Value.

Pre­sum­ably the above trans­lates roughly to the fol­low­ing C# code:

var x = 10;

var result = new Lazy<int>(() => x + 10);

and the thread safety mode using the Lazy(Func<T>) con­struc­tor is LazyThreadSafetyMode.ExecutionAndPublication which caches and rethrows any excep­tions caught in the ini­tial­iza­tion. E.g.:

image

Share

After watch­ing Gael’s recent Skills­Mat­ter talk on mul­ti­thread­ing I’ve put together some notes from a very edu­ca­tional talk:

 

Hard­ware Cache Hierarchy

image

Four lev­els of cache

  • L1 (per core) – typ­i­cally used for instructions
  • L2 (per core)
  • L3 (per die)
  • DRAM (all processors)

Data can be cached in mul­ti­ple caches, and syn­chro­niza­tion hap­pens through an asyn­chro­nous mes­sage bus.

The latency increases as you go down the dif­fer­ent lev­els of cache:

image 

 

Mem­ory Reordering

Cache oper­a­tions are in gen­eral opti­mized for per­for­mance as opposed to log­i­cal behav­iour, hence depend­ing on the archi­tec­ture (x86, AMD, ARM7, etc.) cache loads and store oper­a­tions can be reordered and exe­cuted out-of-order:

image

To add to this mem­ory reorder­ing behav­iour at a hard­ware level, the CLR can also:

  • cache data into register
  • reorder
  • coa­lesce writes

The volatile key­word stops the com­piler opti­miza­tions, that’s all, it does not stop the hard­ware level optimizations.

This is where mem­ory bar­rier comes in, to ensure ser­ial access to mem­ory and to force data to be flushed and syn­chro­nized across all the local cache, this is done via the Thread.MemoryBarrier method in .Net.

 

Atom­ic­ity

Oper­a­tions on longs can­not be per­formed in an atomic way on a 32-bit archi­tec­ture, it’s pos­si­ble to get par­tially mod­i­fied value.

 

Inter­locked

Inter­locks pro­vides the only lock­ing mech­a­nism at hard­ware level, the .Net frame­work pro­vides access to these instruc­tions via the Inter­locked class.

On the Intel archi­tec­ture, inter­locks are typ­i­cally imple­mented on the L3 cache, a fact that’s reflected by the latency asso­ci­ated with using Inter­locked incre­ments com­pared with non-interlocked:

image

Com­pa­re­Ex­change is the most impor­tant tool when it comes to imple­mented lock-free algo­rithms, but since it’s imple­mented on the L3 cache, in a multi-processor envi­ron­ment it would require one of the proces­sor to take out a global lock, hence why the con­tented case above takes much longer.

You can analyse the per­for­mance of your appli­ca­tion at a CPU level using Intel’s vTune Ampli­fier XE tool.

 

Mul­ti­task­ing

Threads do not exist at a hard­ware level, CPU only under­stands tasks and it has no con­cept of ‘wait’. Syn­chro­niza­tion con­structs such as sem­a­phores and mutex are built on top of inter­locked operations.

One core can never do more than 1 ‘thing’ at the same time, unless it’s hyper-threaded in which case the core can do some­thing else whilst wait­ing on some resource to con­tinue exe­cut­ing the orig­i­nal task.

A task runs until inter­rupted by hard­ware (I/O inter­rupt) or OS.

 

Win­dows Kernel

A process has:

  • pri­vate vir­tual address space
  • resources
  • at least 1 thread

A thread is:

  • a pro­gram (sequence of instructions)
  • CPU state
  • wait depen­den­cies

Threads can wait for dis­patcher objects (Wait­Handle) – Mutex, Sem­a­phore, Event, Timer or another thread, when they’re not wait­ing for any­thing they’re placed in the wait­ing queue by the thread sched­uler until it is their turn to be exe­cuted on the CPU.

After a thread has been exe­cuted for some time, it is then moved back to the wait­ing queue (via a ker­nel inter­rupt) to give some other thread a slice of the avail­able CPU time. Alter­na­tively, if the thread needs to wait for a dis­patcher object then it goes back to the wait­ing state.

image

Dis­patcher objects reside in the ker­nel and can be shared among dif­fer­ent processes, they’re very expen­sive!

image

Which is why you don’t want to use ker­nel objects for waits that are typ­i­cally very short, instead they’re best used when wait­ing for some­thing that takes longer to return, e.g. I/O.

Com­pared to other wait meth­ods (e.g. Thread.Sleep, Thread.Yield, WaitHandle.Wait, etc.) Thread.SpinWait is an odd ball because it’s not a ker­nel method, it resem­bles a con­tin­u­ous loop (it keeps ‘spin­ning’) but it tells a hyper-threaded CPU that it’s ok to do some­thing else. It’s gen­er­ally use­ful when you know the inter­rupt will hap­pen very quickly and hence sav­ing you from an unnec­es­sary con­text switch. If the inter­rupt does not hap­pen quickly as expected, the Spin­Wait will be trans­formed into a nor­mal thread wait (Thread.Sleep) to avoid wast­ing CPU cycles.

 

.Net Frame­work Thread Synchronization

image

 

The lock Keyword

  1. start with inter­locked oper­a­tions (no contention)
  2. con­tinue with ‘spin wait’
  3. cre­ate ker­nel event and wait

Good per­for­mance if low contention.

 

Design Pat­terns

  • Thread unsafe
  • Actor
  • Reader-Writer Syn­chro­nized

This is where the Post­Sharp mul­ti­thread­ing toolkit comes to the res­cue! It can help you imple­ment each of these pat­terns auto­mat­i­cally, Gael has talked more about the toolkit in this blog post.

Share

I’ve cov­ered the topic of using Smart­Thread­Pool and the frame­work thread pool in more details here and here, this post will instead focus on a more spe­cific sce­nario where the rate of new work items being queued out­strips the pool’s abil­ity to process those items and what hap­pens then.

First, let’s try to quan­tify the work items being queued when you do some­thing like this:

   1: var threadPool = new SmartThreadPool();

   2: var result = threadPool.QueueWorkItem(....);

The work item being queued is a del­e­gate of some sort, basi­cally some piece of code that needs to be run, until a thread in the pool becomes avail­able and process the work item, it’ll sim­ply stay in mem­ory as a bunch of 1’s and 0’s just like every­thing else.

Now, if new work items are queued at a faster rate than the threads in the pool are able to process them, it’s easy to imag­ine that the amount of mem­ory required to keep the del­e­gates will fol­low an upward trend until you even­tu­ally run out of avail­able mem­ory and an Out­OfMem­o­ryEx­cep­tion gets thrown.

Does that sound like a rea­son­able assump­tion? So let’s find out what actu­ally happens!

Test 1 – Sim­ple delegate

To sim­u­late a sce­nario where the thread pool gets over­run by work items, I’m going to instan­ti­ate a new smart thread pool and make sure there’s only one thread in the pool at all times. Then I recur­sively queue up an action which puts the thread (the one in the pool) to sleep for a long time so that there’s no threads to process sub­se­quent work items:

   1: // instantiate a basic smt with only one thread in the pool

   2: var threadpool = new SmartThreadPool(new STPStartInfo

   3:                                          {

   4:                                              MaxWorkerThreads = 1,

   5:                                              MinWorkerThreads = 1,

   6:                                          });

   7:  

   8: var queuedItemCount = 0;

   9: try

  10: {

  11:     // keep queuing a new items which just put the one and only thread

  12:     // in the threadpool to sleep for a very long time

  13:     while (true)

  14:     {

  15:         // put the thread to sleep for a long long time so it can't handle anymore

  16:         // queued work items

  17:         threadpool.QueueWorkItem(() => Thread.Sleep(10000000));

  18:         queuedItemCount++;

  19:     }

  20: }

  21: catch (OutOfMemoryException)

  22: {

  23:     Console.WriteLine("OutOfMemoryException caught after queuing {0} work items", queuedItemCount);

  24: }

The result? As expected, the mem­ory used by the process went on a pretty steep climb and within a minute it bombed out after eat­ing up just over 1.8GB of RAM:

image 

image

All the while we man­aged to queue up 7205254 instances of the sim­ple del­e­gate used in this test, keep this num­ber in mind as we look at what hap­pens when the clo­sure also requires some expen­sive piece of data to be kept around in mem­ory too.

Test 2 – Del­e­gate with very long string

For this test, I’m gonna include a 1000 char­ac­ter long string in the clo­sures being queued so that string objects need to be kept around in mem­ory for as long as the clo­sures are still around. Now let’s see what happens!

   1: // instantiate a basic smt with only one thread in the pool

   2: var threadpool = new SmartThreadPool(new STPStartInfo

   3:                                          {

   4:                                              MaxWorkerThreads = 1,

   5:                                              MinWorkerThreads = 1,

   6:                                          });

   7:  

   8: var queuedItemCount = 0;

   9: try

  10: {

  11:     // keep queuing a new items which just put the one and only thread

  12:     // in the threadpool to sleep for a very long time

  13:     while (true)

  14:     {

  15:         // generate a 1000 character long string, that's 1000 bytes

  16:         var veryLongText = new string(Enumerable.Range(1, 1000).Select(i => 'E').ToArray());

  17:  

  18:         // include the very long string in the closure here

  19:         threadpool.QueueWorkItem(() =>

  20:                                      {

  21:                                          Thread.Sleep(10000000);

  22:                                          Console.WriteLine(veryLongText);

  23:                                      });

  24:         queuedItemCount++;

  25:     }

  26: }

  27: catch (OutOfMemoryException)

  28: {

  29:     Console.WriteLine("OutOfMemoryException caught after queuing {0} work items", queuedItemCount);

  30: }

Unsur­pris­ingly, the mem­ory was ate up even faster this time around and at the end we were only able to queue 782232 work items before we ran out of mem­ory, which is sig­nif­i­cantly lower com­pared to the pre­vi­ous test:

image

Part­ing thoughts…

Besides it being a fun lit­tle exper­i­ment to try out, there is a story here, one that tells of a worst case sce­nario (albeit one that’s highly unlikely but not impos­si­ble) which is worth keep­ing in the back of your mind of when util­is­ing thread pools to deal with highly fre­quent, data intense, block­ing calls.

Share

Occa­sion­ally you might want to make the value of a sta­tic or instance field local to a thread (i.e. each thread holds an inde­pen­dent copy of the field), what you need in this case, is a thread-local stor­age.

In C#, there are mainly two ways to do this.

Thread­Sta­tic

You can mark a field with the Thread­Sta­tic attribute:

[ThreadStatic]
public static int _x;
…
Enumerable.Range(1, 10).Select(i => new Thread(() => Console.WriteLine(_x++))).ToList()
          .ForEach(t => t.Start()); // prints 0 ten times

Whilst this is the eas­i­est way to imple­ment thread-local stor­age in C# it’s impor­tant to under­stand the lim­i­ta­tions here:

  • the Thread­Sta­tic attribute doesn’t work with instance fields, it com­piles and runs but does nothing..
[ThreadStatic]
public int _x;
…
Enumerable.Range(1, 10).Select(i => new Thread(() => Console.WriteLine(_x++))).ToList()
          .ForEach(t => t.Start()); // prints 0, 1, 2, … 9
  • field always start with the default value
[ThreadStatic]
public static int _x = 1;
…
Enumerable.Range(1, 10).Select(i => new Thread(() => Console.WriteLine(_x++))).ToList()
          .ForEach(t => t.Start()); // prints 0 ten times

ThreadLocal<T>

C#  4 has intro­duced a new class specif­i­cally for the thread-local stor­age of data – the ThreadLocal<T> class:

private readonly ThreadLocal<int> _localX = new ThreadLocal<int>(() => 1);
…
Enumerable.Range(1, 10).Select(i => new Thread(() => Console.WriteLine(_localX++))).ToList()
          .ForEach(t => t.Start()); // prints 1 ten times

There are some bonuses to using the ThreadLocal<T> class:

  • val­ues are lazily eval­u­ated, the fac­tory func­tion eval­u­ates on the first call for each thread
  • you have more con­trol over the ini­tial­iza­tion of the field and is able to ini­tial­ize the field with a non-default value

Sum­mary

As you can see, using ThreadLocal<T> has some clear advan­tages over Thread­Sta­tic, though using 4.0 only fea­tures like ThreadLocal<T> means you have to tar­get your project at the .Net 4 frame­work and is there­fore not back­ward com­pat­i­ble with pre­vi­ous ver­sions of the framework.

It’s also worth not­ing that besides ThreadLocal<T> and Thread­Sta­tic you can also use Thread.GetData and Thread.SetData to fetch and store thread spe­cific data from and to a named Local­Data­S­toreS­lot though this is usu­ally cumbersome…

Share

For a while now I’ve been curi­ous as to whether the CLR uses the Thread­Pool to exe­cute a del­e­gate when Begin­In­voke is called:

private void InvokeFunc(Func<int> func)
{
    func.BeginInvoke(null, null); // does this execute on a threadpool thread?
}

Whilst com­mon sense dic­tates that this must surely be true, I couldn’t be cer­tain since I haven’t man­aged to find any con­fir­ma­tion in the documentations.

Thanks to Jon Skeet and Jeff Ster­nal who pro­vided the answer to my ques­tion and a link to the MSDN arti­cle which con­firms it:

If the Begin­In­voke method is called, the com­mon lan­guage run­time (CLR) queues the request and returns imme­di­ately to the caller. The tar­get method is called asyn­chro­nously on a thread from the thread pool.

This of course, means that if your del­e­gate is likely to take a while to exe­cute you should not call Begin­In­voke on the del­e­gate to avoid block­ing the Thread­Pool threads, instead you could cre­ate a new thread or use a Smart­Thread­Pool instance. I’ve dis­cussed these aspect in more detail here and here if you’re interested.

Ref­er­ences:

MSDN – Asyn­chro­nous Pro­gram­ming using Delegates

Stack­Over­flow Ques­tion – Does Func.BeginInvoke use the ThreadPool

Share