Just a quick note to say that I have updated the JSON seri­al­iz­ers bench­mark to use the lat­est Nuget ver­sions of ServiceStack.Text, Json.Net and JsonFX.

I have also included the JSON and BSON seri­al­iz­ers from the Mon­goDB C# Dri­ver in the test, and since BSON is a binary for­mat I have included protobuf-net as a ref­er­ence since it’s the fastest binary seri­al­izer I know of on the .Net platform:

image

image

Com­pared to the last set of results, you can see that the lat­est ver­sion of JsonFX seems have gone much much slower on dese­ri­al­iza­tion, whilst ServiceStack.Text is still the fastest JSON seri­al­izer around. The Mon­goDB C# Dri­ver’s JSON and BSON both held up pretty well in the test too.

Share

Note: Don’t for­get to check out Bench­marks page to see the lat­est round up of binary and JSON serializers.

Fol­low­ing on from my pre­vi­ous test, I have now included JsonFx and as well as the Json.Net BSON seri­al­izer in the mix to see how they match up.

The results (in mil­lisec­onds) as well as the aver­age pay­load size (for each of the 100K objects seri­al­ized) are as follows.

image[4]

Graph­i­cally this is how they look:

image

I have included protobuf-net in this test to pro­vide more mean­ing­ful com­par­i­son for Json.Net BSON seri­al­izer since it gen­er­ates a binary pay­load and as such has a dif­fer­ent use case to the other JSON serializers.

In gen­eral, I con­sider JSON to be appro­pri­ate when the seri­al­ized data needs to be human read­able, a binary pay­load on the other hand, is more appro­pri­ate for com­mu­ni­ca­tion between applications/services.

Obser­va­tions

You can see from the results above that the Json.Net BSON seri­al­izer actu­ally gen­er­ates a big­ger pay­load than its JSON coun­ter­part. This is because the sim­ple POCO being seri­al­ized con­tains an array of 10 inte­gers in the range of 1 to 100. When the inte­ger ‘1’ is seri­al­ized as JSON, it’ll take 1 byte to rep­re­sent as one char­ac­ter, but an inte­ger will always take 4 bytes to rep­re­sent as binary!

In com­par­i­son, the pro­to­col buffer for­mat uses varint encod­ing so that smaller num­bers take a smaller num­ber of bytes to rep­re­sent, and it is not self-describing (the prop­erty names are not part of the pay­load) so it’s able to gen­er­ate a much much smaller pay­load com­pared to JSON and BSON.

Lastly, whilst the Json.Net BSON seri­al­izer offers a slightly faster dese­ri­al­iza­tion time com­pared to the Json.Net JSON seri­al­izer, it does how­ever, have a much slower seri­al­iza­tion speed.

Dis­claimers

Bench­marks do not tell the whole story, and the num­bers will nat­u­rally vary depend­ing on a num­ber of fac­tors such as the type of data being tested on. In the real world, you will also need to take into account how you’re likely to inter­act with the data, e.g. if you know you’ll be dese­ri­al­iz­ing data a lot more often than seri­al­iz­ing them then dese­ri­al­iza­tion speed will of course become less impor­tant than seri­al­iza­tion speed!

In the case of BSON and inte­gers, whilst it’s less effi­cient (than JSON) when seri­al­iz­ing small num­bers, it’s more effi­cient when the num­bers are big­ger than 4 digits.

Share

After watch­ing Gael’s recent Skills­Mat­ter talk on mul­ti­thread­ing I’ve put together some notes from a very edu­ca­tional talk:

 

Hard­ware Cache Hierarchy

image

Four lev­els of cache

  • L1 (per core) – typ­i­cally used for instructions
  • L2 (per core)
  • L3 (per die)
  • DRAM (all processors)

Data can be cached in mul­ti­ple caches, and syn­chro­niza­tion hap­pens through an asyn­chro­nous mes­sage bus.

The latency increases as you go down the dif­fer­ent lev­els of cache:

image 

 

Mem­ory Reordering

Cache oper­a­tions are in gen­eral opti­mized for per­for­mance as opposed to log­i­cal behav­iour, hence depend­ing on the archi­tec­ture (x86, AMD, ARM7, etc.) cache loads and store oper­a­tions can be reordered and exe­cuted out-of-order:

image

To add to this mem­ory reorder­ing behav­iour at a hard­ware level, the CLR can also:

  • cache data into register
  • reorder
  • coa­lesce writes

The volatile key­word stops the com­piler opti­miza­tions, that’s all, it does not stop the hard­ware level optimizations.

This is where mem­ory bar­rier comes in, to ensure ser­ial access to mem­ory and to force data to be flushed and syn­chro­nized across all the local cache, this is done via the Thread.MemoryBarrier method in .Net.

 

Atom­ic­ity

Oper­a­tions on longs can­not be per­formed in an atomic way on a 32-bit archi­tec­ture, it’s pos­si­ble to get par­tially mod­i­fied value.

 

Inter­locked

Inter­locks pro­vides the only lock­ing mech­a­nism at hard­ware level, the .Net frame­work pro­vides access to these instruc­tions via the Inter­locked class.

On the Intel archi­tec­ture, inter­locks are typ­i­cally imple­mented on the L3 cache, a fact that’s reflected by the latency asso­ci­ated with using Inter­locked incre­ments com­pared with non-interlocked:

image

Com­pa­re­Ex­change is the most impor­tant tool when it comes to imple­mented lock-free algo­rithms, but since it’s imple­mented on the L3 cache, in a multi-processor envi­ron­ment it would require one of the proces­sor to take out a global lock, hence why the con­tented case above takes much longer.

You can analyse the per­for­mance of your appli­ca­tion at a CPU level using Intel’s vTune Ampli­fier XE tool.

 

Mul­ti­task­ing

Threads do not exist at a hard­ware level, CPU only under­stands tasks and it has no con­cept of ‘wait’. Syn­chro­niza­tion con­structs such as sem­a­phores and mutex are built on top of inter­locked operations.

One core can never do more than 1 ‘thing’ at the same time, unless it’s hyper-threaded in which case the core can do some­thing else whilst wait­ing on some resource to con­tinue exe­cut­ing the orig­i­nal task.

A task runs until inter­rupted by hard­ware (I/O inter­rupt) or OS.

 

Win­dows Kernel

A process has:

  • pri­vate vir­tual address space
  • resources
  • at least 1 thread

A thread is:

  • a pro­gram (sequence of instructions)
  • CPU state
  • wait depen­den­cies

Threads can wait for dis­patcher objects (Wait­Handle) – Mutex, Sem­a­phore, Event, Timer or another thread, when they’re not wait­ing for any­thing they’re placed in the wait­ing queue by the thread sched­uler until it is their turn to be exe­cuted on the CPU.

After a thread has been exe­cuted for some time, it is then moved back to the wait­ing queue (via a ker­nel inter­rupt) to give some other thread a slice of the avail­able CPU time. Alter­na­tively, if the thread needs to wait for a dis­patcher object then it goes back to the wait­ing state.

image

Dis­patcher objects reside in the ker­nel and can be shared among dif­fer­ent processes, they’re very expen­sive!

image

Which is why you don’t want to use ker­nel objects for waits that are typ­i­cally very short, instead they’re best used when wait­ing for some­thing that takes longer to return, e.g. I/O.

Com­pared to other wait meth­ods (e.g. Thread.Sleep, Thread.Yield, WaitHandle.Wait, etc.) Thread.SpinWait is an odd ball because it’s not a ker­nel method, it resem­bles a con­tin­u­ous loop (it keeps ‘spin­ning’) but it tells a hyper-threaded CPU that it’s ok to do some­thing else. It’s gen­er­ally use­ful when you know the inter­rupt will hap­pen very quickly and hence sav­ing you from an unnec­es­sary con­text switch. If the inter­rupt does not hap­pen quickly as expected, the Spin­Wait will be trans­formed into a nor­mal thread wait (Thread.Sleep) to avoid wast­ing CPU cycles.

 

.Net Frame­work Thread Synchronization

image

 

The lock Keyword

  1. start with inter­locked oper­a­tions (no contention)
  2. con­tinue with ‘spin wait’
  3. cre­ate ker­nel event and wait

Good per­for­mance if low contention.

 

Design Pat­terns

  • Thread unsafe
  • Actor
  • Reader-Writer Syn­chro­nized

This is where the Post­Sharp mul­ti­thread­ing toolkit comes to the res­cue! It can help you imple­ment each of these pat­terns auto­mat­i­cally, Gael has talked more about the toolkit in this blog post.

Share

I have heard a few peo­ple argue that when it comes to per­for­mance crit­i­cal code you should pre­fer arrays over other col­lec­tions (such as F#’s lists) as it ben­e­fits from sequen­tial reads (which is faster than seeks) and offers bet­ter mem­ory locality.

To test that the­ory some­what, I wanted to see if there is any dif­fer­ence in how fast you can iter­ate through an array ver­sus a list in F#, and how much faster you can map over an array com­pared to a list:

The result is a lit­tle sur­pris­ing, whilst I wasn’t expect­ing there to be a mas­sive dif­fer­ence in the iter­at­ing through the two types of col­lec­tions, I didn’t think map­ping over a list would be quite as slow in com­par­i­son. I knew that con­struct­ing a list is much heav­ier than con­struct­ing an array, but I didn’t think it’d take 22x as long in this case.

What was even more sur­pris­ing was how much slower the Seq.iter and Seq.map func­tions are com­pared to the Array and List mod­ule equiv­a­lents! This is, accord­ing to John Palmer:

Once you call in to Seq you lose the type infor­ma­tion — mov­ing to the next ele­ment in the list requires a call to IEnumerator.MoveNext. Com­pare to for Array you just incre­ment an index and for List you can just deref­er­ence a pointer. Essen­tially, you are get­ting an extra func­tion call for each ele­ment in the list.

The con­ver­sions back to List and Array also slow the code down for sim­i­lar reasons

Update 2012/06/04:

As a work around, you COULD shadow the Seq mod­ule with iter and map func­tions that adds sim­ple type check­ing and in the case of an array or list sim­ply call the cor­re­spond­ing func­tion in the Array or List mod­ule instead:

Whilst this approach will work to a cer­tain extend, you should be care­ful with which func­tions you shadow. For instance, it’s not safe to shadow Seq.map because it can be used in con­junc­tion with other func­tions such as Seq.takeWhile or Seq.take. In the base imple­men­ta­tion, a line of code such as:

arr |> Seq.map incr |> Seq.take 3

will not map over every ele­ment in the source array.

With the shad­owed ver­sion (see above) of Seq.map, how­ever, this would first cre­ate a new array by apply­ing the map­per func­tion against every ele­ment in the source array before dis­card­ing all but the first three ele­ments in the new array. This, as you can imag­ine, is far less effi­cient and requires much more mem­ory space (for the new array) and defeats the pur­pose of using Seq mod­ule func­tions in most cases.

Share

To find out if a string con­tains a piece of sub­string, here are three sim­ple ways of going about it in C#, just to name a few:

Out of curios­ity I wanted to see if there was any notice­able dif­fer­ence in the per­for­mance of each of these options.

Given a sim­ple string “Mary had a lit­tle lamb”, let’s find out how long it takes to test whether or not this string con­tains the terms ‘lit­tle’ (the match case) and ‘big’ (the no match case) using each of these approaches, repeated over 100k times:

image

As you can see, Regex.IsMatch is by far the slow­est option in this test, although using RegexOptions.Compiled yielded slightly faster exe­cu­tion time. What was also inter­est­ing is that String.Contains turned out to be sig­nif­i­cantly faster than String.IndexOf.

If you take a look at the imple­men­ta­tion for String.Contains in a reflec­tor you will see:

image

So that explains the dif­fer­ence between the exe­cu­tion times for String.Contains and String.IndexOf, and indeed if I change the String.IndexOf test to use StringComparison.Ordinal (default is StringComparison.CurrentCulture) then I get an iden­ti­cal result to String.Contains.

With all that said, String.Contains and String.IndexOf is only use­ful for check­ing the exis­tence of an exact sub­string, but Regex is much more pow­er­ful and allows you to do so much more. How­ever, you do end up pay­ing for them even when you don’t need those addi­tional capabilities!

Share