I recently gave a talk on the various use cases we have for F# at Gamesys Social at the Tokyo F# User Group during my trip there.
The slides are available on Slideshare and I’ll share links to recording once they become available.
Note: see the rest of the series so far.
This is a classic interview question, and after some experimentation I ended up with the following solution in APL:
=> 1 2 Fizz 4 Buzz Fizz 7 8 Fizz Buzz 11 Fizz 13 14 FizzBuzz 16 17 Fizz 19 Buzz Fizz 22 23 Fizz Buzz 26 Fizz 28 29 FizzBuzz 31 32 Fizz 34 Buzz Fizz 37 38 Fizz Buzz 41 Fizz 43 44 FizzBuzz 46 47 Fizz 49 Buzz Fizz 52 53 Fizz Buzz 56 Fizz 58 59 FizzBuzz 61 62 Fizz 64 Buzz Fizz 67 68 Fizz Buzz 71 Fizz 73 74 FizzBuzz 76 77 Fizz 79 Buzz Fizz 82 83 Fizz Buzz 86 Fizz 88 89 FizzBuzz 91 92 Fizz 94 Buzz Fizz 97 98 Fizz Buzz
The answer looks correct, though this is perhaps not the most straight forward solution, so let’s walk through what’s happening here:
: 0 0 1 0 0 1 0 0 1 0 0 1 …
: 0 0 0 0 1 0 0 0 0 1 0 0 …
: 0 0 101 0 0 101 0 0 101 0 0 101…
: 0 0 0 0 102 0 0 0 0 102 0 0 0…
0 0 101 0 102 101 0 0 101 102 0 101 0 0 203 0 0 101 …
0 0 101 0 102 101 0 0 101 102 0 101 0 0 103 0 0 101 …
1 2 101 4 102 101 7 8 101 102 11 101 13 14 103 16 17 101 …
Note: see the rest of the series so far.
I have been busy with a number of conferences and talks since then so I haven’t had time to revisit APL. Over the next couple of posts in this series (or rather, what I hope would amount to a series) let’s try to solve some simple problems with APL. Mind you, my knowledge of APL is pretty shallow so what I end up with is likely far from idiomatic APL code, so feel free to offer pointers/feedbacks in the comments section below!
To start us off nice and easy, let’s look at something really simple, like Euler Problem 1.
If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.
Find the sum of all the multiples of 3 or 5 below 1000.
Super easy, right? I knew how to solve this with a one-liner even when I was a total beginner in F#.
which roughly translates to the following in APL:
Looks great, and running gives the right answer (I won’t spoil the answer for you, but feel free to run this in TryAPL at your own discretion!) So here’s what we did (p.s. | is the remainder operator):
p.s. OK, it’s not exactly a faithful translation of the F# code above, the equivalent F# code should probably use reduce as well:
UPDATE 24/07/2015 : much to my pleasant surprise this piqued Tomas’s interest and he even did a quick AplMode to show how you can mimic the APL code above in F# and solve the problem in 19 characters.
To be fair, I’m writing APL whilst thinking like a F#er so I’m not doing it justice, but Stefano shared an even shorter APL solution (17 characters):
This took me a little while to understand, but fortunately Dyalog APL’s IDE’s really quite easy to use and so after some experimentation here’s how Stefano’s solution works:
(first you need to understand that in APL, you always read from RIGHT-TO-LEFT unless you can ( ))
1) generate an outer product () of reminders (using the residue | operator):
this generates (for 1 to 9) the matrix:
1 2 0 1 2 0 1 2 0
1 2 3 4 0 1 2 3 4
2) transform matrix into boolean by comparing them to 0 (0 = …)
0 0 1 0 0 1 0 0 1
0 0 0 0 1 0 0 0 0
3). so now we reduce the matrix using the logical OR operator (the special symbol you see here is called reduce first and is the same as reduce / but along the first dimension of the array)
0 0 1 0 1 1 0 0 1
4) since this array has the same length as our input, we can just multiply the two together!
0 0 3 0 5 6 0 0 9
5) and finally reduce over this array using addition and voila!
In the last post, we looked at some inefficiencies with reference types in .Net and perhaps oversold value types a little In any case, now that we’ve made the initial sale and you’re back for more, let’s talk about some pitfalls wrt the use of value types you should be aware of. Specifically let’s focus on cases where the CLR will cause implicit boxing on your value types.
We all know that when we cast a value type to object, we cause boxing. For instance, if we need to shove an int into an object or an ArrayList.
This is not great, but at least we’re doing this consciously and have had the chance to make a decision about it. However, there are a number of situations where the CLR will emit a box IL instruction for us implicitly without us realizing. These are far worse.
Value types inherit from the System.ValueType, which itself inherits from System.Object. Amongst other things, System.ValueType provides an override for Equals that gives value types the default compare-by-value behaviour.
However, a value types is stored in memory without the Method Table Pointer (see the last post) so in order to dispatch a virtual method call it’ll first have to be boxed into a reference type first.
There is an exception to this though, as the CLR is able to call Equals directly if the value type overrides the Equals method (which is why it’s a best practice to do so).
Aside from benefiting from CLR’s short-circuit behaviour above, another good reason to override Equals in your custom value type is because the default implementation uses reflection to fetch its fields to compare.
Further more, it then uses UnsafeGetValue to get the value of the field on both the current object and the object being compared to, both of which causes further boxing as FieldInfo.UnsafeGetValue takes an object as argument.
Even if you override Equals(object other), it’ll still cause boxing to the input argument. To get around this you’ll need to overload Equals to take in another Point2D instance instead, i.e. Equals(Point2D other), which is of course another recommended best practice from the previous post.
Given these three versions of Point2D:
We can see how they fared when we iterate through 10M instances of each and calling Equals each time.
Couple of things to note from the above:
sidebar : I chose to run the test in F# because it’s easy to get quick insight (real and CPU time + GC counts) when you enable the #time directive. However, I had to define the types in C# as by default F# compiles struct types with all the optimizations we have talked about - which means:
a. it’s a good reason to use F#; but
b. defeats the purpose of running theses tests!
Unfortunately, there is still one more edge case…
This eventually causes a new EqualityComparer<T> to be created.
In the default case (where Point2D doesn’t implement the IEquatable<T> interface), the ObjectEqualityComparer<T> will be returned. And it is in here, that the overridden/inherited Equals(object other) method will be used and causes boxing to occur for every comparison!
If, on the other hand, Point2D implements the IEquatable<T> interface then the outcome will be very different. This allows some clever logic to kick in and use the overloaded Equals(Point2D other) instead.
So now, let’s introduce V4 of Point2D that implements the IEquatable<T> interface and see how it compares to
when used in a List<T>.
For V2, List<T>.Contains performs the same as our hand coded version, but the improvements we made with V3 is lost due to the reasons outlined above.
V4 rectified this by allowing the optimization in List<T> to kick in.
Which brings us to the next implicit boxing…
Like virtual methods, in order to dispatch an interface method you also need the Method Table Pointer, which means boxing is required.
Fortunately, the CLR is able to short-circuit this by calling the method directly if the compile-time type is resolved to the actual value type (e.g. Point2D) rather than the interface type.
For this test, we’ll try:
As you can see, invoking the interface method Equals(Point2DV4 other) does indeed incur boxing once for each instance of Point2D.
In both cases, it’s invoked through the IEqualityComparer<T> type we talked earlier, and in both cases the comparer is also initialized through EqualityComparer<T>.Default and the CreateComparer method.
For this test let’s find out the effects of:
but first, let’s create V5 of our Point2D struct, this time with a overridden GetHashCode implementation (albeit a bad one, which is OK for this since we only want to see the performance implication of having one).
In this test, we have:
I used a much smaller sample size here (10K instead of 10M) because the amount of time it took to add even 10K items to a dictionary was sufficient to illustrate the difference here. Even for this small sample size, the difference is very noticeable, couple of things to note:
sidebar : with GetHashCode there are also considerations around what makes a good hash code, e.g. low chance of collision, evenly distributed, etc. but that’s another discussion for another day. Most times I just let tools like Resharper work out the implementation based on the fields my value type has.
As we have seen, there are a number of places where implicit boxing can happen and how much of a difference these might have on your performance. So, to reiterate what was already said in the previous post, here are some best practices for using value types:
(Update 2015/07/21 : read the next post in this series to learn about the places where implicit boxing happens with value types and how you can prevent them)
I chanced upon Sasha Goldshtein’s excellent book, Pro .Net Performance : Optimize Your C# Application, a few years back and I thoroughly enjoyed it.
Even though it has been almost 3 years since I read it and I have forgotten many of the things I learnt already but I still remember this line very clearly:
…in fact, even a class with no instance fields will occupy 12 bytes when instantiated…
this is of course, limited to a 32-bit system. On a 64-bit system, the smallest reference type instance will take up 24 bytes of memory!
The reason for this is due to the way .Net objects are laid out in memory where you have:
When your class has only a byte field it’ll still take up 4 bytes of space for instance fields because objects have to align with 4-byte multiples. Even when your class has no instance fields, it’ll still take up 4 bytes. Therefore, on a 32-bit system, the smallest reference type instance will be 12 bytes.
On a 64-bit system, a word is 8 bytes instead of 4, and objects are aligned to the nearest 8-byte multiple, so the smallest reference type instance in this case will be 24 bytes.
sidebar : The Object Header Word and Method Table Pointer are used by the JIT and CLR. The book goes into a lot of detail on the structure and purpose of these, which for this blog post we’ll ignore. If you’re interested in learning more about them, go buy the book, it’ll be money well spent.
There are plenty to talk about when it comes to reference vs value type, including:
For the purpose of this post let’s focus on how they differ in terms of memory consumption and cache friendliness when you have a large array of each.
Suppose you have a Point2D class with only X and Y integer fields, each instance of this type will occupy a total of 16 bytes (including 8 bytes for X and Y) in the heap. Taking into account a 4 byte reference pointer (again, on a 32-bit system), it brings our total investment per instance of Point2D type to 20 bytes!
If you have an array with 10M instances of Point2D then you’ll have committed 190MB of memory for this array!
On the other hand, if Point2D was a value type then each instance would only take up 8 bytes for the X and Y values, and they’ll be tightly packed into the array without each needing an additional 4 bytes for reference pointer. Overall the amount of memory you’re committing to this array would drop to 76MB.
Whilst there are no inherent differences in the speed of accessing stack vs heap allocated memory (they are just different ranges of addresses in the virtual memory after all) there are a number of performance-related considerations.
Stack allocated memory does not incur GC overhead
The stack is self-managed in the sense that when you leave a scope you just move the pointer back to the previous position and you’d have “deallocated” the previously allocated memory.
With heap allocated memory you incur the overhead of a rather sophisticated generational GC, which as part of a collection needs to move surviving objects around to compact the memory space (which becomes more and more expensive as you move from gen 0 to 2).
sidebar : I remember reading a post a few years ago that discussed how the StackOverflow guys went through their entire codebase and converted as many classes to struct as they can in order to reduce the amount of latency spikes on their servers due to GC pauses. Whilst I’m not advocating for you to do the same, just to illustrate that GC pauses is a common problem on web servers. The Background GC mode introduced in recent .Net releases would have reduced the frequency of these pauses but sensible uses of value types would still help in this regard.
On the stack, temporal locality implies spatial locality and temporal access locality
This means that, objects that are allocated close together in time are also stored close together in space. Further more, objects allocated close in time (e.g. inside the same method) are also likely to be used close together.
Both spatial and temporal access locality plays nicely with how the cache works (i.e. fewer cache misses) and how OS paging works (i.e. fewer virtual memory swaps).
On the stack, memory density is higher
Since value types don’t have the overheads with reference types — Object Header Word and Method Table Pointer — so you’re able to pack more objects in the same amount of memory.
Higher memory density leads to better performance because it means fewer fetches from memory.
On modern hardware, the entire thread stack can fit into CPU cache
When you spawn a new thread, it’s allocated with 1MB of stack space. Whereas CPUs nowadays comes with a much bigger L3 cache (e.g. Nehalem has up to 24MB L3 cache) so the entire stack can fit into the L3 cache which is a lot faster to access compared to main memory (have a look at this talk from Gael Fraiteur, creator of PostSharp, to see just how access time differs).
For better or worse, modern hardware is built to take advantage of both spatial and temporal localities and these optimizations manifest themselves in how data is fetched from main memory in cache lines.
Consider what happens at a hardware level when you iterate through an array of 10 million Point2D objects. If Point2D was a reference type, then each time you iterate over an instance of it in the array:
notice how much work goes wasted here due to reference jumping?
On the other hand, if Point2D was a value type with only X and Y values (8 bytes):
the CPU is able to read 8 objects from one fetch vs. 2 fetches per object!
To see how this cache friendliness translate to performance numbers we can easily measure — execution time. Let’s consider this simple example:
Here is the result of running the tests above in release mode:
or in tabular format:
create 10M objects
iterate over 10M objects
As well as the additional memory overhead, it’s also over 10x slower to create and 3x slower to iterate over an array of 10 million objects when they are defined as a reference type.
You might say “well, these are still only milliseconds!”, but for a web server that needs to deal with a large number of concurrent requests, these margins are very significant!
In some areas of software, such as high frequency trading or arbitrage trading, even nanoseconds can make a big difference to your bottom line.
sidebar : people are doing some really funky stuff in those areas to gain a slight edge over each other, including:
It’s an area of finance that is genuinely quite interesting and I have heard some fascinating tales from friends who are working in these areas.
Using value types sensibly is a very powerful way to improve the performance of your .Net application — this is true for both C# and F#. Without going overboard, you should consider using structs if:
In addition, here are some tips to help you “get value types right”:
In the next post we’ll look at some pitfalls wrt the use of value types, and why we have the best practices above.
Scott Meyers did a great talk on CPU cache at last year’s NDC Oslo and touched on other topics such as false sharing, etc. It’s a good starting point if you’re new to how CPU cache affects the performance of your code.
Martin’s posts use Java as example but many of the lessons are directly applicable in .Net too.