Smallest .Net ref type is 12 bytes (or why you should consider using value types)

(Update 2015/07/21 : read the next post in this series to learn about the places where implic­it box­ing hap­pens with val­ue types and how you can pre­vent them)

 

I chanced upon Sasha Goldshtein’s excel­lent book, Pro .Net Per­for­mance : Opti­mize Your C# Appli­ca­tion, a few years back and I thor­ough­ly enjoyed it.

pro .net performance cover

Even though it has been almost 3 years since I read it and I have for­got­ten many of the things I learnt already but I still remem­ber this line very clear­ly:

…in fact, even a class with no instance fields will occu­py 12 bytes when instan­ti­at­ed…

this is of course, lim­it­ed to a 32-bit sys­tem. On a 64-bit sys­tem, the small­est ref­er­ence type instance will take up 24 bytes of mem­o­ry!

 

How?!?

The rea­son for this is due to the way .Net objects are laid out in mem­o­ry where you have:

  • 4 bytes for Object Head­er Word;
  • 4 bytes for Method Table Point­er; and
  • min of 4 bytes for instance fields

When your class has only a byte field it’ll still take up 4 bytes of space for instance fields because objects have to align with 4-byte mul­ti­ples. Even when your class has no instance fields, it’ll still take up 4 bytes. There­fore, on a 32-bit sys­tem, the small­est ref­er­ence type instance will be 12 bytes.

On a 64-bit sys­tem, a word is 8 bytes instead of 4, and objects are aligned to the near­est 8-byte mul­ti­ple, so the small­est ref­er­ence type instance in this case will be 24 bytes.

side­bar : The Object Head­er Word and Method Table Point­er are used by the JIT and CLR. The book goes into a lot of detail on the struc­ture and pur­pose of these, which for this blog post we’ll ignore. If you’re inter­est­ed in learn­ing more about them, go buy the book, it’ll be mon­ey well spent.

 

Reference vs Value Type

There are plen­ty to talk about when it comes to ref­er­ence vs val­ue type, includ­ing:

  • stack vs heap allo­ca­tion;
  • default equal­i­ty seman­tics, i.e. com­pare-by-ref vs com­pare-by-val­ue;
  • pass-by-val­ue vs pass-by-ref;

For the pur­pose of this post let’s focus on how they dif­fer in terms of mem­o­ry con­sump­tion and cache friend­li­ness when you have a large array of each.

Memory Consumption

Sup­pose you have a Point2D class with only X and Y inte­ger fields, each instance of this type will occu­py a total of 16 bytes (includ­ing 8 bytes for X and Y) in the heap. Tak­ing into account a 4 byte ref­er­ence point­er (again, on a 32-bit sys­tem), it brings our total invest­ment per instance of Point2D type to 20 bytes!

If you have an array with 10M instances of Point2D then you’ll have com­mit­ted 190MB of mem­o­ry for this array!

On the oth­er hand, if Point2D was a val­ue type then each instance would only take up 8 bytes for the X and Y val­ues, and they’ll be tight­ly packed into the array with­out each need­ing an addi­tion­al 4 bytes for ref­er­ence point­er. Over­all the amount of mem­o­ry you’re com­mit­ting to this array would drop to 76MB.

Cache Friendliness

Whilst there are no inher­ent dif­fer­ences in the speed of access­ing stack vs heap allo­cat­ed mem­o­ry (they are just dif­fer­ent ranges of address­es in the vir­tu­al mem­o­ry after all) there are a num­ber of per­for­mance-relat­ed con­sid­er­a­tions.

Stack allo­cat­ed mem­o­ry does not incur GC over­head

The stack is self-man­aged in the sense that when you leave a scope you just move the point­er back to the pre­vi­ous posi­tion and you’d have “deal­lo­cat­ed” the pre­vi­ous­ly allo­cat­ed mem­o­ry.

With heap allo­cat­ed mem­o­ry you incur the over­head of a rather sophis­ti­cat­ed gen­er­a­tional GC, which as part of a col­lec­tion needs to move sur­viv­ing objects around to com­pact the mem­o­ry space (which becomes more and more expen­sive as you move from gen 0 to 2).

side­bar : I remem­ber read­ing a post a few years ago that dis­cussed how the Stack­Over­flow guys went through their entire code­base and con­vert­ed as many class­es to struct as they can in order to reduce the amount of laten­cy spikes on their servers due to GC paus­es. Whilst I’m not advo­cat­ing for you to do the same, just to illus­trate that GC paus­es is a com­mon prob­lem on web servers. The Back­ground GC mode intro­duced in recent .Net releas­es would have reduced the fre­quen­cy of these paus­es but sen­si­ble uses of val­ue types would still help in this regard.

On the stack, tem­po­ral local­i­ty implies spa­tial local­i­ty and tem­po­ral access local­i­ty

This means that, objects that are allo­cat­ed close togeth­er in time are also stored close togeth­er in space. Fur­ther more, objects allo­cat­ed close in time (e.g. inside the same method) are also like­ly to be used close togeth­er.

Both spa­tial and tem­po­ral access local­i­ty plays nice­ly with how the cache works (i.e. few­er cache miss­es) and how OS pag­ing works (i.e. few­er vir­tu­al mem­o­ry swaps).

On the stack, mem­o­ry den­si­ty is high­er

Since val­ue types don’t have the over­heads with ref­er­ence types — Object Head­er Word and Method Table Point­er — so you’re able to pack more objects in the same amount of mem­o­ry.

High­er mem­o­ry den­si­ty leads to bet­ter per­for­mance because it means few­er fetch­es from mem­o­ry.

On mod­ern hard­ware, the entire thread stack can fit into CPU cache

When you spawn a new thread, it’s allo­cat­ed with 1MB of stack space. Where­as CPUs nowa­days comes with a much big­ger L3 cache (e.g. Nehalem has up to 24MB L3 cache) so the entire stack can fit into the L3 cache which is a lot faster to access com­pared to main mem­o­ry (have a look at this talk from Gael Frai­teur, cre­ator of Post­Sharp, to see just how access time dif­fers).

 

For bet­ter or worse, mod­ern hard­ware is built to take advan­tage of both spa­tial and tem­po­ral local­i­ties and these opti­miza­tions man­i­fest them­selves in how data is fetched from main mem­o­ry in cache lines.

Con­sid­er what hap­pens at a hard­ware lev­el when you iter­ate through an array of 10 mil­lion Point2D objects. If Point2D was a ref­er­ence type, then each time you iter­ate over an instance of it in the array:

  • CPU will fetch a cache line (typ­i­cal­ly 64 bytes) start­ing from where the array ele­ment is
  • it’ll access the mem­o­ry to get the ref­er­ence point­er (4 bytes) for the object in the heap
  • it’ll dis­card the rest of the cache line
  • it’ll fetch anoth­er cache line (again, 64 bytes) start­ing from where the object is
  • it’ll read the object and do what­ev­er it is you asked it to do
  • and then it’ll do the same for the next ele­ment in the array and so on…

notice how much work goes wast­ed here due to ref­er­ence jump­ing?

On the oth­er hand, if Point2D was a val­ue type with only X and Y val­ues (8 bytes):

  • CPU will fetch a cache line (64 bytes) start­ing from where the array ele­ment is
  • it’ll read the object (8 bytes) and do its work
  • it’ll try to get the next object, real­iz­ing that it’s already in the cache (what a lucky day!)
  • it’ll read the next object (8 bytes) from the exist­ing cache line and so on

the CPU is able to read 8 objects from one fetch vs. 2 fetch­es per object!

 

To see how this cache friend­li­ness trans­late to per­for­mance num­bers we can eas­i­ly mea­sure — exe­cu­tion time. Let’s con­sid­er this sim­ple exam­ple:

Here is the result of run­ning the tests above in release mode:

image

or in tab­u­lar for­mat:

cre­ate 10M objects

iter­ate over 10M objects

ref­er­ence type

872.425ms

80.9892ms

val­ue type

83.1042ms

27.4273ms

As well as the addi­tion­al mem­o­ry over­head, it’s also over 10x slow­er to cre­ate and 3x slow­er to iter­ate over an array of 10 mil­lion objects when they are defined as a ref­er­ence type.

You might say “well, these are still only mil­lisec­onds!”, but for a web serv­er that needs to deal with a large num­ber of con­cur­rent requests, these mar­gins are very sig­nif­i­cant!

In some areas of soft­ware, such as high fre­quen­cy trad­ing or arbi­trage trad­ing, even nanosec­onds can make a big dif­fer­ence to your bot­tom line.

side­bar : peo­ple are doing some real­ly funky stuff in those areas to gain a slight edge over each oth­er, includ­ing:

  • build­ing data cen­tres right next to the exchange;
  • using microwave to trans­mit data so that you receive data faster than exchanges con­nect­ed by fiber optic cables;
  • using FPGA to run your trad­ing log­ic;
  • etc.

It’s an area of finance that is gen­uine­ly quite inter­est­ing and I have heard some fas­ci­nat­ing tales from friends who are work­ing in these areas.

 

Conclusions

Using val­ue types sen­si­bly is a very pow­er­ful way to improve the per­for­mance of your .Net appli­ca­tion — this is true for both C# and F#. With­out going over­board, you should con­sid­er using structs if:

  • the objects are small and you need to cre­ate lots of them;
  • you need high mem­o­ry den­si­ty;
  • mem­o­ry con­sump­tion is a con­straint;
  • you need to iter­ate large arrays of them fre­quent­ly

In addi­tion, here are some tips to help you “get val­ue types right”:

  • make them immutable;
  • over­ride Equals (the one that takes an object as argu­ment);
  • over­load Equals to take anoth­er instance of the same val­ue type (e.g. Equals(Point2D oth­er));
  • over­load oper­a­tors == and !=
  • over­ride GetH­ash­Code

In the next post we’ll look at some pit­falls wrt the use of val­ue types, and why we have the best prac­tices above.

 

Scott Mey­ers did a great talk on CPU cache at last year’s NDC Oslo and touched on oth­er top­ics such as false shar­ing, etc. It’s a good start­ing point if you’re new to how CPU cache affects the per­for­mance of your code.

Mar­tin Thomp­son (of the Dis­rup­tor fame) has an excel­lent blog where he writes about many relat­ed top­ics includ­ing:

Martin’s posts use Java as exam­ple but many of the lessons are direct­ly applic­a­ble in .Net too.

 

Links