Slides for “F# at Gamesys”

I recently gave a talk on the var­i­ous use cases we have for F# at Gamesys Social at the Tokyo F# User Group dur­ing my trip there.

The slides are avail­able on Slideshare and I’ll share links to record­ing once they become available.

APL — solving Fizz Buzz

Note: see the rest of the series so far.

 

This is a clas­sic inter­view ques­tion, and after some exper­i­men­ta­tion I ended up with the fol­low­ing solu­tion in APL:

apl-fizzbuzz

F \ \iota 100

=> 1 2 Fizz 4 Buzz Fizz 7 8 Fizz Buzz 11 Fizz 13 14 FizzBuzz 16 17 Fizz 19 Buzz Fizz 22 23 Fizz Buzz 26 Fizz 28 29 FizzBuzz 31 32 Fizz 34 Buzz Fizz 37 38 Fizz Buzz 41 Fizz 43 44 FizzBuzz 46 47 Fizz 49 Buzz Fizz 52 53 Fizz Buzz 56 Fizz 58 59 FizzBuzz 61 62 Fizz 64 Buzz Fizz 67 68 Fizz Buzz 71 Fizz 73 74 FizzBuzz 76 77 Fizz 79 Buzz Fizz 82 83 Fizz Buzz 86 Fizz 88 89 FizzBuzz 91 92 Fizz 94 Buzz Fizz 97 98 Fizz Buzz

 

The answer looks cor­rect, though this is per­haps not the most straight for­ward solu­tion, so let’s walk through what’s hap­pen­ing here:

  • first turn the input array \omega into booleans to deter­mine if they’re divis­i­ble by 3 0=3 | \omega or 5 0=5 | \omega

0 = 3 | \omega : 0 0 1 0 0 1 0 0 1 0 0 1 …

0 = 5 | \omega : 0 0 0 0 1 0 0 0 0 1 0 0 …

  • assum­ing that \omega has length of 100, then trans­form the boolean arrays from the pre­vi­ous step so that 1s becomes 101s (for those divis­i­ble by 3) and 102s (for those divis­i­ble by 5)

(1 + \rho \omega) \times 0 = 3 | \omega : 0 0 101 0 0 101 0 0 101 0 0 101…

(2 + \rho \omega) \times 0 = 5 | \omega : 0 0 0 0 102 0 0 0 0 102 0 0 0…

  • add the two arrays together, so that those divis­i­ble by 3 has value 101, divis­i­ble by 5 has value 102 and those divis­i­ble by 15 has value 203

0 0 101 0 102 101 0 0 101 102 0 101 0 0 203 0 0 101 …

  • apply min­i­mum \lfloor to this array with 103 being the operand, so now num­bers divis­i­ble by 15 has value 103 instead of 203

0 0 101 0 102 101 0 0 101 102 0 101 0 0 103 0 0 101 …

  • apply max­i­mum \lceil to this array with the input array \omega to end up with the following

1 2 101 4 102 101 7 8 101 102 11 101 13 14 103 16 17 101

  • use this array as index for an array that is made up of the input array \omega with ‘Fizz’ ‘Buzz’ ‘FizzBuzz’ added to the end at index posi­tion 101, 102 and 103 respectively

 

APL — solving Euler problem 1

Note: see the rest of the series so far.

 

Sur­pris­ingly, my last post on APL did well on Hack­erNews, turns out there is gen­uine inter­est in APL out there, which is great to see 

I have been busy with a num­ber of con­fer­ences and talks since then so I haven’t had time to revisit APL. Over the next cou­ple of posts in this series (or rather, what I hope would amount to a series) let’s try to solve some sim­ple prob­lems with APL. Mind you, my knowl­edge of APL is pretty shal­low so what I end up with is likely far from idiomatic APL code, so feel free to offer pointers/feedbacks in the com­ments sec­tion below!

 

To start us off nice and easy, let’s look at some­thing really sim­ple, like Euler Prob­lem 1.

If we list all the nat­ural num­bers below 10 that are mul­ti­ples of 3 or 5, we get 3, 5, 6 and 9. The sum of these mul­ti­ples is 23.

Find the sum of all the mul­ti­ples of 3 or 5 below 1000.

Super easy, right? I knew how to solve this with a one-liner even when I was a total begin­ner in F#.

fsharp-euler-1-v1

which roughly trans­lates to the fol­low­ing in APL:

apl-euler-1

F \ \iota 9 => 23

Looks great, and run­ning F \ \iota 999 gives the right answer (I won’t spoil the answer for you, but feel free to run this in TryAPL at your own dis­cre­tion!) So here’s what we did (p.s. | is the remain­der operator):

  1. find ele­ments from the input \omega that are divis­i­ble by 3 (3|\omega)=0 or 5 (5|\omega)=0
  2. which gives us an array of booleans, which we then OR \lor
  3. then we use com­pres­sion to fil­ter the input array / \omega to return only the ele­ments that are divis­i­ble by 3 or 5
  4. and finally reduce over these ele­ments with addition

p.s. OK, it’s not exactly a faith­ful trans­la­tion of the F# code above, the equiv­a­lent F# code should prob­a­bly use reduce as well:

fsharp-euler-1-v2

UPDATE 24/07/2015 : much to my pleas­ant sur­prise this piqued Tomas’s inter­est and he even did a quick AplMode to show how you can mimic the APL code above in F# and solve the prob­lem in 19 characters.

To be fair, I’m writ­ing APL whilst think­ing like a F#er so I’m not doing it jus­tice, but Ste­fano shared an even shorter APL solu­tion (17 char­ac­ters):
apl-euler-1-stefano

This took me a lit­tle while to under­stand, but for­tu­nately Dya­log APL’s IDE’s really quite easy to use and so after some exper­i­men­ta­tion here’s how Stefano’s solu­tion works:

(first you need to under­stand that in APL, you always read from RIGHT-TO-LEFT unless you can ( ))

1) gen­er­ate an outer prod­uct (\circ . ) of reminders (using the residue | operator):

  • the array of 3 and 5; and
  • input array \omega

this gen­er­ates (for 1 to 9) the matrix:

1  2  0  1  2  0  1  2 0

1  2  3  4  0  1  2  3 4

2) trans­form matrix into boolean by com­par­ing them to 0 (0 = …)

0  0  1  0  0  1  0  0 1

0  0  0  0  1  0  0  0 0

3). so now we reduce the matrix using the log­i­cal OR oper­a­tor (the spe­cial sym­bol you see here is called reduce first and is the same as reduce / but along the first dimen­sion of the array)

apl-euler-1-reducefirst

0  0  1  0  1  1  0  0 1

4) since this array has the same length as our input, we can just mul­ti­ply the two together!

0  0  3  0  5  6  0  0 9

5) and finally reduce over this array using addi­tion and voila!

Beware of implicit boxing of value types

In the last post, we looked at some inef­fi­cien­cies with ref­er­ence types in .Net and per­haps over­sold value types a lit­tle  In any case, now that we’ve made the ini­tial sale and you’re back for more, let’s talk about some pit­falls wrt the use of value types you should be aware of. Specif­i­cally let’s focus on cases where the CLR will cause implicit box­ing on your value types.

 

We all know that when we cast a value type to object, we cause box­ing. For instance, if we need to shove an int into an object[] or an ArrayList.

This is not great, but at least we’re doing this con­sciously and have had the chance to make a deci­sion about it. How­ever, there are a num­ber of sit­u­a­tions where the CLR will emit a box IL instruc­tion for us implic­itly with­out us real­iz­ing. These are far worse.

 

When you invoke a vir­tual method

Value types inherit from the System.ValueType, which itself inher­its from System.Object. Amongst other things, System.ValueType pro­vides an over­ride for Equals that gives value types the default compare-by-value behaviour.

How­ever, a value types is stored in mem­ory with­out the Method Table Pointer (see the last post) so in order to dis­patch a vir­tual method call it’ll first have to be boxed into a ref­er­ence type first.

There is an excep­tion to this though, as the CLR is able to call Equals directly if the value type over­rides the Equals method (which is why it’s a best prac­tice to do so).

Aside from ben­e­fit­ing from CLR’s short-circuit behav­iour above, another good rea­son to over­ride Equals in your cus­tom value type is because the default imple­men­ta­tion uses reflec­tion to fetch its fields to compare.

Fur­ther more, it then uses UnsafeGet­Value to get the value of the field on both the cur­rent object and the object being com­pared to, both of which causes fur­ther box­ing as FieldInfo.UnsafeGetValue takes an object as argument.

But wait, there’s more…

Even if you over­ride Equals(object other), it’ll still cause box­ing to the input argu­ment. To get around this you’ll need to over­load Equals to take in another Point2D instance instead, i.e. Equals(Point2D other), which is of course another rec­om­mended best prac­tice from the pre­vi­ous post.

Given these three ver­sions of Point2D:

  • V1 = plain old struct
  • V2 = over­rides Equals(object other)
  • V3 = over­loads Equals(Point2DV3 other)

We can see how they fared when we iter­ate through 10M instances of each and call­ing Equals each time.

Cou­ple of things to note from the above:

  • V1 causes twice as much box­ing as V2 (makes sense given the short-circuiting behaviour)
  • V1 also takes nearly four times as long to exe­cute com­pared to V2
  • V3 causes no box­ing, and runs in nearly half the time as V2!

side­bar : I chose to run the test in F# because it’s easy to get quick insight (real and CPU time + GC counts) when you enable the #time direc­tive. How­ever, I had to define the types in C# as by default F# com­piles struct types with all the opti­miza­tions we have talked about - which means:

a. it’s a good rea­son to use F#; but

b. defeats the pur­pose of run­ning the­ses tests!

 

Unfor­tu­nately, there is still one more edge case…

 

When you call List<T>.Contains

When you call List<T>.Contains, an instance of EqualityComparer<T> will be used to com­pare the argu­ment against every ele­ment of the list.

This even­tu­ally causes a new EqualityComparer<T> to be cre­ated.

In the default case (where Point2D doesn’t imple­ment the IEquatable<T> inter­face), the ObjectEqualityComparer<T> will be returned. And it is in here, that the overridden/inherited Equals(object other) method will be used and causes box­ing to occur for every comparison!

If, on the other hand, Point2D imple­ments the IEquatable<T> inter­face then the out­come will be very dif­fer­ent. This allows some clever logic to kick in and use the over­loaded Equals(Point2D other) instead.

 

So now, let’s intro­duce V4 of Point2D that imple­ments the IEquatable<T> inter­face and see how it com­pares to

  • V2 = over­rid­den Equals(Object other) ; and
  • V3 = over­loaded Equals(Point2DV3 other)

when used in a List<T>.

For V2List<T>.Contains per­forms the same as our hand coded ver­sion, but the improve­ments we made with V3 is lost due to the rea­sons out­lined above.

V4 rec­ti­fied this by allow­ing the opti­miza­tion in List<T> to kick in.

 

Which brings us to the next implicit boxing…

 

When you invoke an inter­face method

Like vir­tual meth­ods, in order to dis­patch an inter­face method you also need the Method Table Pointer, which means box­ing is required.

For­tu­nately, the CLR is able to short-circuit this by call­ing the method directly if the compile-time type is resolved to the actual value type (e.g. Point2D) rather than the inter­face type.

 

For this test, we’ll try:

  • invoke Equals(Point2DV4 other) via the IEquatable<Point2DV4> inter­face; vs
  • invoke Equals(Point2DV4 other) directly on Point2DV4

 

As you can see, invok­ing the inter­face method Equals(Point2DV4 other) does indeed incur box­ing once for each instance of Point2D.

 

When Dictionary<T> invokes GetHashCode

GetH­ash­Code is used by hash-based col­lec­tion types, the most com­mon being Dictionary<TKey, TValue> and HashSet<T>.

In both cases, it’s invoked through the IEqualityComparer<T> type we talked ear­lier, and in both cases the com­parer is also ini­tial­ized through EqualityComparer<T>.Default and the Cre­ate­Com­parer method.

GetH­ash­Code is invoked in many places within Dictionary<TKey, TValue> - on Add, Con­tainsKey, Remove, etc.

For this test let’s find out the effects of:

  • imple­ment­ing the IEquatable<T> inter­face in terms of boxing
  • over­rid­ing GetH­ash­Code (assum­ing the default imple­men­ta­tion requires the use of reflection)

but first, let’s cre­ate V5 of our Point2D struct, this time with a over­rid­den GetH­ash­Code imple­men­ta­tion (albeit a bad one, which is OK for this since we only want to see the per­for­mance impli­ca­tion of hav­ing one).

In this test, we have:

  • V3 = no GetH­ash­Code over­ride, not imple­ment IEquatable<T>
  • V4 = no GetH­ash­Code over­ride, imple­ments IEquatable<T>
  • V5 = has GetH­ash­Code over­ride, imple­ments IEquatable<T>

I used a much smaller sam­ple size here (10K instead of 10M) because the amount of time it took to add even 10K items to a dic­tio­nary was suf­fi­cient to illus­trate the dif­fer­ence here. Even for this small sam­ple size, the dif­fer­ence is very notice­able, cou­ple of things to note:

  • since V3 doesn’t imple­ment IEquatable<T> it incurs box­ing, and a lot of it because Equals is also called through the IEqualityComparer<T>
  • V4 elim­i­nates the need for box­ing but is still pretty slow due to the default GetH­ash­Code implementation
  • V5 addresses both of these problems!

side­bar : with GetH­ash­Code there are also con­sid­er­a­tions around what makes a good hash code, e.g. low chance of col­li­sion, evenly dis­trib­uted, etc. but that’s another dis­cus­sion for another day. Most times I just let tools like Resharper work out the imple­men­ta­tion based on the fields my value type has.

 

Con­clu­sions

As we have seen, there are a num­ber of places where implicit box­ing can hap­pen and how much of a dif­fer­ence these might have on your per­for­mance. So, to reit­er­ate what was already said in the pre­vi­ous post, here are some best prac­tices for using value types:

  • make them immutable
  • over­ride Equals (the one that takes an object as argument);
  • over­load Equals to take another instance of the same value type (e.g. Equals(Point2D other));
  • over­load oper­a­tors == and !=;
  • over­ride GetHashCode

 

Links

Smallest .Net ref type is 12 bytes (or why you should consider using value types)

(Update 2015/07/21 : read the next post in this series to learn about the places where implicit box­ing hap­pens with value types and how you can pre­vent them)

 

I chanced upon Sasha Goldshtein’s excel­lent book, Pro .Net Per­for­mance : Opti­mize Your C# Appli­ca­tion, a few years back and I thor­oughly enjoyed it.

pro .net performance cover

Even though it has been almost 3 years since I read it and I have for­got­ten many of the things I learnt already but I still remem­ber this line very clearly:

…in fact, even a class with no instance fields will occupy 12 bytes when instantiated…

this is of course, lim­ited to a 32-bit sys­tem. On a 64-bit sys­tem, the small­est ref­er­ence type instance will take up 24 bytes of memory!

 

How?!?

The rea­son for this is due to the way .Net objects are laid out in mem­ory where you have:

  • 4 bytes for Object Header Word;
  • 4 bytes for Method Table Pointer; and
  • min of 4 bytes for instance fields

When your class has only a byte field it’ll still take up 4 bytes of space for instance fields because objects have to align with 4-byte mul­ti­ples. Even when your class has no instance fields, it’ll still take up 4 bytes. There­fore, on a 32-bit sys­tem, the small­est ref­er­ence type instance will be 12 bytes.

On a 64-bit sys­tem, a word is 8 bytes instead of 4, and objects are aligned to the near­est 8-byte mul­ti­ple, so the small­est ref­er­ence type instance in this case will be 24 bytes.

side­bar : The Object Header Word and Method Table Pointer are used by the JIT and CLR. The book goes into a lot of detail on the struc­ture and pur­pose of these, which for this blog post we’ll ignore. If you’re inter­ested in learn­ing more about them, go buy the book, it’ll be money well spent.

 

Ref­er­ence vs Value Type

There are plenty to talk about when it comes to ref­er­ence vs value type, including:

  • stack vs heap allocation;
  • default equal­ity seman­tics, i.e. compare-by-ref vs compare-by-value;
  • pass-by-value vs pass-by-ref;

For the pur­pose of this post let’s focus on how they dif­fer in terms of mem­ory con­sump­tion and cache friend­li­ness when you have a large array of each.

Mem­ory Consumption

Sup­pose you have a Point2D class with only X and Y inte­ger fields, each instance of this type will occupy a total of 16 bytes (includ­ing 8 bytes for X and Y) in the heap. Tak­ing into account a 4 byte ref­er­ence pointer (again, on a 32-bit sys­tem), it brings our total invest­ment per instance of Point2D type to 20 bytes!

If you have an array with 10M instances of Point2D then you’ll have com­mit­ted 190MB of mem­ory for this array!

On the other hand, if Point2D was a value type then each instance would only take up 8 bytes for the X and Y val­ues, and they’ll be tightly packed into the array with­out each need­ing an addi­tional 4 bytes for ref­er­ence pointer. Over­all the amount of mem­ory you’re com­mit­ting to this array would drop to 76MB.

Cache Friend­li­ness

Whilst there are no inher­ent dif­fer­ences in the speed of access­ing stack vs heap allo­cated mem­ory (they are just dif­fer­ent ranges of addresses in the vir­tual mem­ory after all) there are a num­ber of performance-related considerations.

Stack allo­cated mem­ory does not incur GC overhead

The stack is self-managed in the sense that when you leave a scope you just move the pointer back to the pre­vi­ous posi­tion and you’d have “deal­lo­cated” the pre­vi­ously allo­cated memory.

With heap allo­cated mem­ory you incur the over­head of a rather sophis­ti­cated gen­er­a­tional GC, which as part of a col­lec­tion needs to move sur­viv­ing objects around to com­pact the mem­ory space (which becomes more and more expen­sive as you move from gen 0 to 2).

side­bar : I remem­ber read­ing a post a few years ago that dis­cussed how the Stack­Over­flow guys went through their entire code­base and con­verted as many classes to struct as they can in order to reduce the amount of latency spikes on their servers due to GC pauses. Whilst I’m not advo­cat­ing for you to do the same, just to illus­trate that GC pauses is a com­mon prob­lem on web servers. The Back­ground GC mode intro­duced in recent .Net releases would have reduced the fre­quency of these pauses but sen­si­ble uses of value types would still help in this regard.

On the stack, tem­po­ral local­ity implies spa­tial local­ity and tem­po­ral access locality

This means that, objects that are allo­cated close together in time are also stored close together in space. Fur­ther more, objects allo­cated close in time (e.g. inside the same method) are also likely to be used close together.

Both spa­tial and tem­po­ral access local­ity plays nicely with how the cache works (i.e. fewer cache misses) and how OS pag­ing works (i.e. fewer vir­tual mem­ory swaps).

On the stack, mem­ory den­sity is higher

Since value types don’t have the over­heads with ref­er­ence types — Object Header Word and Method Table Pointer — so you’re able to pack more objects in the same amount of memory.

Higher mem­ory den­sity leads to bet­ter per­for­mance because it means fewer fetches from memory.

On mod­ern hard­ware, the entire thread stack can fit into CPU cache

When you spawn a new thread, it’s allo­cated with 1MB of stack space. Whereas CPUs nowa­days comes with a much big­ger L3 cache (e.g. Nehalem has up to 24MB L3 cache) so the entire stack can fit into the L3 cache which is a lot faster to access com­pared to main mem­ory (have a look at this talk from Gael Frai­teur, cre­ator of Post­Sharp, to see just how access time differs).

 

For bet­ter or worse, mod­ern hard­ware is built to take advan­tage of both spa­tial and tem­po­ral local­i­ties and these opti­miza­tions man­i­fest them­selves in how data is fetched from main mem­ory in cache lines.

Con­sider what hap­pens at a hard­ware level when you iter­ate through an array of 10 mil­lion Point2D objects. If Point2D was a ref­er­ence type, then each time you iter­ate over an instance of it in the array:

  • CPU will fetch a cache line (typ­i­cally 64 bytes) start­ing from where the array ele­ment is
  • it’ll access the mem­ory to get the ref­er­ence pointer (4 bytes) for the object in the heap
  • it’ll dis­card the rest of the cache line
  • it’ll fetch another cache line (again, 64 bytes) start­ing from where the object is
  • it’ll read the object and do what­ever it is you asked it to do
  • and then it’ll do the same for the next ele­ment in the array and so on…

notice how much work goes wasted here due to ref­er­ence jumping?

On the other hand, if Point2D was a value type with only X and Y val­ues (8 bytes):

  • CPU will fetch a cache line (64 bytes) start­ing from where the array ele­ment is
  • it’ll read the object (8 bytes) and do its work
  • it’ll try to get the next object, real­iz­ing that it’s already in the cache (what a lucky day!)
  • it’ll read the next object (8 bytes) from the exist­ing cache line and so on

the CPU is able to read 8 objects from one fetch vs. 2 fetches per object!

 

To see how this cache friend­li­ness trans­late to per­for­mance num­bers we can eas­ily mea­sure — exe­cu­tion time. Let’s con­sider this sim­ple example:

Here is the result of run­ning the tests above in release mode:

image

or in tab­u­lar format:

cre­ate 10M objects

iter­ate over 10M objects

ref­er­ence type

872.425ms

80.9892ms

value type

83.1042ms

27.4273ms

As well as the addi­tional mem­ory over­head, it’s also over 10x slower to cre­ate and 3x slower to iter­ate over an array of 10 mil­lion objects when they are defined as a ref­er­ence type.

You might say “well, these are still only mil­lisec­onds!”, but for a web server that needs to deal with a large num­ber of con­cur­rent requests, these mar­gins are very significant!

In some areas of soft­ware, such as high fre­quency trad­ing or arbi­trage trad­ing, even nanosec­onds can make a big dif­fer­ence to your bot­tom line.

side­bar : peo­ple are doing some really funky stuff in those areas to gain a slight edge over each other, including:

  • build­ing data cen­tres right next to the exchange;
  • using microwave to trans­mit data so that you receive data faster than exchanges con­nected by fiber optic cables;
  • using FPGA to run your trad­ing logic;
  • etc.

It’s an area of finance that is gen­uinely quite inter­est­ing and I have heard some fas­ci­nat­ing tales from friends who are work­ing in these areas.

 

Con­clu­sions

Using value types sen­si­bly is a very pow­er­ful way to improve the per­for­mance of your .Net appli­ca­tion — this is true for both C# and F#. With­out going over­board, you should con­sider using structs if:

  • the objects are small and you need to cre­ate lots of them;
  • you need high mem­ory density;
  • mem­ory con­sump­tion is a constraint;
  • you need to iter­ate large arrays of them frequently

In addi­tion, here are some tips to help you “get value types right”:

  • make them immutable;
  • over­ride Equals (the one that takes an object as argument);
  • over­load Equals to take another instance of the same value type (e.g. Equals(Point2D other));
  • over­load oper­a­tors == and !=
  • over­ride GetH­ash­Code

In the next post we’ll look at some pit­falls wrt the use of value types, and why we have the best prac­tices above.

 

Scott Mey­ers did a great talk on CPU cache at last year’s NDC Oslo and touched on other top­ics such as false shar­ing, etc. It’s a good start­ing point if you’re new to how CPU cache affects the per­for­mance of your code.

Mar­tin Thomp­son (of the Dis­rup­tor fame) has an excel­lent blog where he writes about many related top­ics including:

Martin’s posts use Java as exam­ple but many of the lessons are directly applic­a­ble in .Net too.

 

Links