Be Lazy, but be ware of initialization exception

.Net 4 introduced the Lazy<T> type which allows you to create an object that can be lazily initialized so that you can delay the creation of large objects, for instance.

However, if your initialization logic has the potential to except at runtime (e.g. time out exceptions reading from some external data source) then you should pay close attention to which constructor you use to create a new instance of the Lazy<T> type. Depending on the selected LazyThreadSafetyMode, exceptions in the initialization code might be cached and rethrown on all subsequent attempts to fetch the lazily initialized value. Whilst this ensures that threads will always get the same result, hence removing ambiguity, it does mean that you’ve got only one shot at initializing that value…

 

LazyThreadSafetyMode

In cases where you need to be able to tolerate occasional initialization errors (e.g. reading a large object from S3 can fail from time to time for a number of reasons) and be able to try again at a second attempt, the rule of thumb is to instantiate the Lazy<T> type by setting LazyThreadSafetyMode to PublicationOnly. In PublicationOnly thread safety mode, multiple threads can invoke the initialization logic but the first thread to complete the initialization successfully sets the value of the Lazy<T> instance.

For example, the following only works under the PublicationOnly mode:

 

F#

F# provides a slightly nicer syntax for defining a lazy computation:

image

the Control.Lazy<T> type is an abbreviation of the BCL Lazy<T> type with a Force extension method which under the hood just calls Lazy<T>.Value.

Presumably the above translates roughly to the following C# code:

var x = 10;

var result = new Lazy<int>(() => x + 10);

and the thread safety mode using the Lazy(Func<T>) constructor is LazyThreadSafetyMode.ExecutionAndPublication which caches and rethrows any exceptions caught in the initialization. E.g.:

image

Takeaways from Gael Fraiteur’s multithreading talk

After watching Gael’s recent SkillsMatter talk on multithreading I’ve put together some notes from a very educational talk:

 

Hardware Cache Hierarchy

image

Four levels of cache

  • L1 (per core) – typically used for instructions
  • L2 (per core)
  • L3 (per die)
  • DRAM (all processors)

Data can be cached in multiple caches, and synchronization happens through an asynchronous message bus.

The latency increases as you go down the different levels of cache:

image

 

Memory Reordering

Cache operations are in general optimized for performance as opposed to logical behaviour, hence depending on the architecture (x86, AMD, ARM7, etc.) cache loads and store operations can be reordered and executed out-of-order:

image

To add to this memory reordering behaviour at a hardware level, the CLR can also:

  • cache data into register
  • reorder
  • coalesce writes

The volatile keyword stops the compiler optimizations, that’s all, it does not stop the hardware level optimizations.

This is where memory barrier comes in, to ensure serial access to memory and to force data to be flushed and synchronized across all the local cache, this is done via the Thread.MemoryBarrier method in .Net.

 

Atomicity

Operations on longs cannot be performed in an atomic way on a 32-bit architecture, it’s possible to get partially modified value.

 

Interlocked

Interlocks provides the only locking mechanism at hardware level, the .Net framework provides access to these instructions via the Interlocked class.

On the Intel architecture, interlocks are typically implemented on the L3 cache, a fact that’s reflected by the latency associated with using Interlocked increments compared with non-interlocked:

image

CompareExchange is the most important tool when it comes to implemented lock-free algorithms, but since it’s implemented on the L3 cache, in a multi-processor environment it would require one of the processor to take out a global lock, hence why the contented case above takes much longer.

You can analyse the performance of your application at a CPU level using Intel’s vTune Amplifier XE tool.

 

Multitasking

Threads do not exist at a hardware level, CPU only understands tasks and it has no concept of ‘wait’. Synchronization constructs such as semaphores and mutex are built on top of interlocked operations.

One core can never do more than 1 ‘thing’ at the same time, unless it’s hyper-threaded in which case the core can do something else whilst waiting on some resource to continue executing the original task.

A task runs until interrupted by hardware (I/O interrupt) or OS.

 

Windows Kernel

A process has:

  • private virtual address space
  • resources
  • at least 1 thread

A thread is:

  • a program (sequence of instructions)
  • CPU state
  • wait dependencies

Threads can wait for dispatcher objects (WaitHandle) – Mutex, Semaphore, Event, Timer or another thread, when they’re not waiting for anything they’re placed in the waiting queue by the thread scheduler until it is their turn to be executed on the CPU.

After a thread has been executed for some time, it is then moved back to the waiting queue (via a kernel interrupt) to give some other thread a slice of the available CPU time. Alternatively, if the thread needs to wait for a dispatcher object then it goes back to the waiting state.

image

Dispatcher objects reside in the kernel and can be shared among different processes, they’re very expensive!

image

Which is why you don’t want to use kernel objects for waits that are typically very short, instead they’re best used when waiting for something that takes longer to return, e.g. I/O.

Compared to other wait methods (e.g. Thread.Sleep, Thread.Yield, WaitHandle.Wait, etc.) Thread.SpinWait is an odd ball because it’s not a kernel method, it resembles a continuous loop (it keeps ‘spinning’) but it tells a hyper-threaded CPU that it’s ok to do something else. It’s generally useful when you know the interrupt will happen very quickly and hence saving you from an unnecessary context switch. If the interrupt does not happen quickly as expected, the SpinWait will be transformed into a normal thread wait (Thread.Sleep) to avoid wasting CPU cycles.

 

.Net Framework Thread Synchronization

image

 

The lock Keyword

  1. start with interlocked operations (no contention)
  2. continue with ‘spin wait’
  3. create kernel event and wait

Good performance if low contention.

 

Design Patterns

  • Thread unsafe
  • Actor
  • Reader-Writer Synchronized

This is where the PostSharp multithreading toolkit comes to the rescue! It can help you implement each of these patterns automatically, Gael has talked more about the toolkit in this blog post.

SmartThreadPool – What happens when you take on more than you can chew

I’ve covered the topic of using SmartThreadPool and the framework thread pool in more details here and here, this post will instead focus on a more specific scenario where the rate of new work items being queued outstrips the pool’s ability to process those items and what happens then.

First, let’s try to quantify the work items being queued when you do something like this:

   1: var threadPool = new SmartThreadPool();

   2: var result = threadPool.QueueWorkItem(....);

The work item being queued is a delegate of some sort, basically some piece of code that needs to be run, until a thread in the pool becomes available and process the work item, it’ll simply stay in memory as a bunch of 1’s and 0’s just like everything else.

Now, if new work items are queued at a faster rate than the threads in the pool are able to process them, it’s easy to imagine that the amount of memory required to keep the delegates will follow an upward trend until you eventually run out of available memory and an OutOfMemoryException gets thrown.

Does that sound like a reasonable assumption? So let’s find out what actually happens!

Test 1 – Simple delegate

To simulate a scenario where the thread pool gets overrun by work items, I’m going to instantiate a new smart thread pool and make sure there’s only one thread in the pool at all times. Then I recursively queue up an action which puts the thread (the one in the pool) to sleep for a long time so that there’s no threads to process subsequent work items:

   1: // instantiate a basic smt with only one thread in the pool

   2: var threadpool = new SmartThreadPool(new STPStartInfo

   3:                                          {

   4:                                              MaxWorkerThreads = 1,

   5:                                              MinWorkerThreads = 1,

   6:                                          });

   7:  

   8: var queuedItemCount = 0;

   9: try

  10: {

  11:     // keep queuing a new items which just put the one and only thread

  12:     // in the threadpool to sleep for a very long time

  13:     while (true)

  14:     {

  15:         // put the thread to sleep for a long long time so it can't handle anymore

  16:         // queued work items

  17:         threadpool.QueueWorkItem(() => Thread.Sleep(10000000));

  18:         queuedItemCount++;

  19:     }

  20: }

  21: catch (OutOfMemoryException)

  22: {

  23:     Console.WriteLine("OutOfMemoryException caught after queuing {0} work items", queuedItemCount);

  24: }

The result? As expected, the memory used by the process went on a pretty steep climb and within a minute it bombed out after eating up just over 1.8GB of RAM:

image 

image

All the while we managed to queue up 7205254 instances of the simple delegate used in this test, keep this number in mind as we look at what happens when the closure also requires some expensive piece of data to be kept around in memory too.

Test 2 – Delegate with very long string

For this test, I’m gonna include a 1000 character long string in the closures being queued so that string objects need to be kept around in memory for as long as the closures are still around. Now let’s see what happens!

   1: // instantiate a basic smt with only one thread in the pool

   2: var threadpool = new SmartThreadPool(new STPStartInfo

   3:                                          {

   4:                                              MaxWorkerThreads = 1,

   5:                                              MinWorkerThreads = 1,

   6:                                          });

   7:  

   8: var queuedItemCount = 0;

   9: try

  10: {

  11:     // keep queuing a new items which just put the one and only thread

  12:     // in the threadpool to sleep for a very long time

  13:     while (true)

  14:     {

  15:         // generate a 1000 character long string, that's 1000 bytes

  16:         var veryLongText = new string(Enumerable.Range(1, 1000).Select(i => 'E').ToArray());

  17:  

  18:         // include the very long string in the closure here

  19:         threadpool.QueueWorkItem(() =>

  20:                                      {

  21:                                          Thread.Sleep(10000000);

  22:                                          Console.WriteLine(veryLongText);

  23:                                      });

  24:         queuedItemCount++;

  25:     }

  26: }

  27: catch (OutOfMemoryException)

  28: {

  29:     Console.WriteLine("OutOfMemoryException caught after queuing {0} work items", queuedItemCount);

  30: }

Unsurprisingly, the memory was ate up even faster this time around and at the end we were only able to queue 782232 work items before we ran out of memory, which is significantly lower compared to the previous test:

image

Parting thoughts…

Besides it being a fun little experiment to try out, there is a story here, one that tells of a worst case scenario (albeit one that’s highly unlikely but not impossible) which is worth keeping in the back of your mind of when utilising thread pools to deal with highly frequent, data intense, blocking calls.

ThreadStatic vs ThreadLocal<T>

Occasionally you might want to make the value of a static or instance field local to a thread (i.e. each thread holds an independent copy of the field), what you need in this case, is a thread-local storage.

In C#, there are mainly two ways to do this.

ThreadStatic

You can mark a field with the ThreadStatic attribute:

<br />
[ThreadStatic]<br />
public static int _x;<br />
…<br />
Enumerable.Range(1, 10).Select(i =&gt; new Thread(() =&gt; Console.WriteLine(_x++))).ToList()<br />
          .ForEach(t =&gt; t.Start()); // prints 0 ten times<br />

Whilst this is the easiest way to implement thread-local storage in C# it’s important to understand the limitations here:

  • the ThreadStatic attribute doesn’t work with instance fields, it compiles and runs but does nothing..

<br />
[ThreadStatic]<br />
public int _x;<br />
…<br />
Enumerable.Range(1, 10).Select(i =&gt; new Thread(() =&gt; Console.WriteLine(_x++))).ToList()<br />
          .ForEach(t =&gt; t.Start()); // prints 0, 1, 2, … 9<br />

  • field always start with the default value

<br />
[ThreadStatic]<br />
public static int _x = 1;<br />
…<br />
Enumerable.Range(1, 10).Select(i =&gt; new Thread(() =&gt; Console.WriteLine(_x++))).ToList()<br />
          .ForEach(t =&gt; t.Start()); // prints 0 ten times<br />

ThreadLocal<T>

C#  4 has introduced a new class specifically for the thread-local storage of data – the ThreadLocal<T> class:

<br />
private readonly ThreadLocal&lt;int&gt; _localX = new ThreadLocal&lt;int&gt;(() =&gt; 1);<br />
…<br />
Enumerable.Range(1, 10).Select(i =&gt; new Thread(() =&gt; Console.WriteLine(_localX++))).ToList()<br />
          .ForEach(t =&gt; t.Start()); // prints 1 ten times<br />

There are some bonuses to using the ThreadLocal<T> class:

  • values are lazily evaluated, the factory function evaluates on the first call for each thread
  • you have more control over the initialization of the field and is able to initialize the field with a non-default value

Summary

As you can see, using ThreadLocal<T> has some clear advantages over ThreadStatic, though using 4.0 only features like ThreadLocal<T> means you have to target your project at the .Net 4 framework and is therefore not backward compatible with previous versions of the framework.

It’s also worth noting that besides ThreadLocal<T> and ThreadStatic you can also use Thread.GetData and Thread.SetData to fetch and store thread specific data from and to a named LocalDataStoreSlot though this is usually cumbersome…

.Net Threading – BeginInvoke uses the thread pool

For a while now I’ve been curious as to whether the CLR uses the ThreadPool to execute a delegate when BeginInvoke is called:

private void InvokeFunc(Func<int> func)
{
    func.BeginInvoke(null, null); // does this execute on a threadpool thread?
}

Whilst common sense dictates that this must surely be true, I couldn’t be certain since I haven’t managed to find any confirmation in the documentations.

Thanks to Jon Skeet and Jeff Sternal who provided the answer to my question and a link to the MSDN article which confirms it:

If the BeginInvoke method is called, the common language runtime (CLR) queues the request and returns immediately to the caller. The target method is called asynchronously on a thread from the thread pool.

This of course, means that if your delegate is likely to take a while to execute you should not call BeginInvoke on the delegate to avoid blocking the ThreadPool threads, instead you could create a new thread or use a SmartThreadPool instance. I’ve discussed these aspect in more detail here and here if you’re interested.

References:

MSDN – Asynchronous Programming using Delegates

StackOverflow Question – Does Func.BeginInvoke use the ThreadPool