Slides and video from my webinar on localization and design pattern automation

Hello,

The slides and recording of my webinar on Tuesday is now live, thanks to the folks at PostSharp for the quick turnaround!

Beware of implicit boxing of value types

In the last post, we looked at some inefficiencies with reference types in .Net and perhaps oversold value types a little  In any case, now that we’ve made the initial sale and you’re back for more, let’s talk about some pitfalls wrt the use of value types you should be aware of. Specifically let’s focus on cases where the CLR will cause implicit boxing on your value types.

 

We all know that when we cast a value type to object, we cause boxing. For instance, if we need to shove an int into an object[] or an ArrayList.

This is not great, but at least we’re doing this consciously and have had the chance to make a decision about it. However, there are a number of situations where the CLR will emit a box IL instruction for us implicitly without us realizing. These are far worse.

 

When you invoke a virtual method

Value types inherit from the System.ValueType, which itself inherits from System.Object. Amongst other things, System.ValueType provides an override for Equals that gives value types the default compare-by-value behaviour.

However, a value types is stored in memory without the Method Table Pointer (see the last post) so in order to dispatch a virtual method call it’ll first have to be boxed into a reference type first.

There is an exception to this though, as the CLR is able to call Equals directly if the value type overrides the Equals method (which is why it’s a best practice to do so).

Aside from benefiting from CLR’s short-circuit behaviour above, another good reason to override Equals in your custom value type is because the default implementation uses reflection to fetch its fields to compare.

Further more, it then uses UnsafeGetValue to get the value of the field on both the current object and the object being compared to, both of which causes further boxing as FieldInfo.UnsafeGetValue takes an object as argument.

But wait, there’s more…

Even if you override Equals(object other), it’ll still cause boxing to the input argument. To get around this you’ll need to overload Equals to take in another Point2D instance instead, i.e. Equals(Point2D other), which is of course another recommended best practice from the previous post.

Given these three versions of Point2D:

  • V1 = plain old struct
  • V2 = overrides Equals(object other)
  • V3 = overloads Equals(Point2DV3 other)

We can see how they fared when we iterate through 10M instances of each and calling Equals each time.

Couple of things to note from the above:

  • V1 causes twice as much boxing as V2 (makes sense given the short-circuiting behaviour)
  • V1 also takes nearly four times as long to execute compared to V2
  • V3 causes no boxing, and runs in nearly half the time as V2!

sidebar : I chose to run the test in F# because it’s easy to get quick insight (real and CPU time + GC counts) when you enable the #time directive. However, I had to define the types in C# as by default F# compiles struct types with all the optimizations we have talked about – which means:

a. it’s a good reason to use F#; but

b. defeats the purpose of running theses tests!

 

Unfortunately, there is still one more edge case…

 

When you call List<T>.Contains

When you call List<T>.Contains, an instance of EqualityComparer<T> will be used to compare the argument against every element of the list.

This eventually causes a new EqualityComparer<T> to be created.

In the default case (where Point2D doesn’t implement the IEquatable<T> interface), the ObjectEqualityComparer<T> will be returned. And it is in here, that the overridden/inherited Equals(object other) method will be used and causes boxing to occur for every comparison!

If, on the other hand, Point2D implements the IEquatable<T> interface then the outcome will be very different. This allows some clever logic to kick in and use the overloaded Equals(Point2D other) instead.

 

So now, let’s introduce V4 of Point2D that implements the IEquatable<T> interface and see how it compares to

  • V2 = overridden Equals(Object other) ; and
  • V3 = overloaded Equals(Point2DV3 other)

when used in a List<T>.

For V2, List<T>.Contains performs the same as our hand coded version, but the improvements we made with V3 is lost due to the reasons outlined above.

V4 rectified this by allowing the optimization in List<T> to kick in.

 

Which brings us to the next implicit boxing…

 

When you invoke an interface method

Like virtual methods, in order to dispatch an interface method you also need the Method Table Pointer, which means boxing is required.

Fortunately, the CLR is able to short-circuit this by calling the method directly if the compile-time type is resolved to the actual value type (e.g. Point2D) rather than the interface type.

 

For this test, we’ll try:

  • invoke Equals(Point2DV4 other) via the IEquatable<Point2DV4> interface; vs
  • invoke Equals(Point2DV4 other) directly on Point2DV4

 

As you can see, invoking the interface method Equals(Point2DV4 other) does indeed incur boxing once for each instance of Point2D.

 

When Dictionary<T> invokes GetHashCode

GetHashCode is used by hash-based collection types, the most common being Dictionary<TKey, TValue> and HashSet<T>.

In both cases, it’s invoked through the IEqualityComparer<T> type we talked earlier, and in both cases the comparer is also initialized through EqualityComparer<T>.Default and the CreateComparer method.

GetHashCode is invoked in many places within Dictionary<TKey, TValue> – on Add, ContainsKey, Remove, etc.

For this test let’s find out the effects of:

  • implementing the IEquatable<T> interface in terms of boxing
  • overriding GetHashCode (assuming the default implementation requires the use of reflection)

but first, let’s create V5 of our Point2D struct, this time with a overridden GetHashCode implementation (albeit a bad one, which is OK for this since we only want to see the performance implication of having one).

In this test, we have:

  • V3 = no GetHashCode override, not implement IEquatable<T>
  • V4 = no GetHashCode override, implements IEquatable<T>
  • V5 = has GetHashCode override, implements IEquatable<T>

I used a much smaller sample size here (10K instead of 10M) because the amount of time it took to add even 10K items to a dictionary was sufficient to illustrate the difference here. Even for this small sample size, the difference is very noticeable, couple of things to note:

  • since V3 doesn’t implement IEquatable<T> it incurs boxing, and a lot of it because Equals is also called through the IEqualityComparer<T>
  • V4 eliminates the need for boxing but is still pretty slow due to the default GetHashCode implementation
  • V5 addresses both of these problems!

sidebar : with GetHashCode there are also considerations around what makes a good hash code, e.g. low chance of collision, evenly distributed, etc. but that’s another discussion for another day. Most times I just let tools like Resharper work out the implementation based on the fields my value type has.

 

Conclusions

As we have seen, there are a number of places where implicit boxing can happen and how much of a difference these might have on your performance. So, to reiterate what was already said in the previous post, here are some best practices for using value types:

  • make them immutable
  • override Equals (the one that takes an object as argument);
  • overload Equals to take another instance of the same value type (e.g. Equals(Point2D other));
  • overload operators == and !=;
  • override GetHashCode

 

Links

Smallest .Net ref type is 12 bytes (or why you should consider using value types)

(Update 2015/07/21 : read the next post in this series to learn about the places where implicit boxing happens with value types and how you can prevent them)

 

I chanced upon Sasha Goldshtein’s excellent book, Pro .Net Performance : Optimize Your C# Application, a few years back and I thoroughly enjoyed it.

pro .net performance cover

Even though it has been almost 3 years since I read it and I have forgotten many of the things I learnt already but I still remember this line very clearly:

…in fact, even a class with no instance fields will occupy 12 bytes when instantiated…

this is of course, limited to a 32-bit system. On a 64-bit system, the smallest reference type instance will take up 24 bytes of memory!

 

How?!?

The reason for this is due to the way .Net objects are laid out in memory where you have:

  • 4 bytes for Object Header Word;
  • 4 bytes for Method Table Pointer; and
  • min of 4 bytes for instance fields

When your class has only a byte field it’ll still take up 4 bytes of space for instance fields because objects have to align with 4-byte multiples. Even when your class has no instance fields, it’ll still take up 4 bytes. Therefore, on a 32-bit system, the smallest reference type instance will be 12 bytes.

On a 64-bit system, a word is 8 bytes instead of 4, and objects are aligned to the nearest 8-byte multiple, so the smallest reference type instance in this case will be 24 bytes.

sidebar : The Object Header Word and Method Table Pointer are used by the JIT and CLR. The book goes into a lot of detail on the structure and purpose of these, which for this blog post we’ll ignore. If you’re interested in learning more about them, go buy the book, it’ll be money well spent.

 

Reference vs Value Type

There are plenty to talk about when it comes to reference vs value type, including:

  • stack vs heap allocation;
  • default equality semantics, i.e. compare-by-ref vs compare-by-value;
  • pass-by-value vs pass-by-ref;

For the purpose of this post let’s focus on how they differ in terms of memory consumption and cache friendliness when you have a large array of each.

Memory Consumption

Suppose you have a Point2D class with only X and Y integer fields, each instance of this type will occupy a total of 16 bytes (including 8 bytes for X and Y) in the heap. Taking into account a 4 byte reference pointer (again, on a 32-bit system), it brings our total investment per instance of Point2D type to 20 bytes!

If you have an array with 10M instances of Point2D then you’ll have committed 190MB of memory for this array!

On the other hand, if Point2D was a value type then each instance would only take up 8 bytes for the X and Y values, and they’ll be tightly packed into the array without each needing an additional 4 bytes for reference pointer. Overall the amount of memory you’re committing to this array would drop to 76MB.

Cache Friendliness

Whilst there are no inherent differences in the speed of accessing stack vs heap allocated memory (they are just different ranges of addresses in the virtual memory after all) there are a number of performance-related considerations.

Stack allocated memory does not incur GC overhead

The stack is self-managed in the sense that when you leave a scope you just move the pointer back to the previous position and you’d have “deallocated” the previously allocated memory.

With heap allocated memory you incur the overhead of a rather sophisticated generational GC, which as part of a collection needs to move surviving objects around to compact the memory space (which becomes more and more expensive as you move from gen 0 to 2).

sidebar : I remember reading a post a few years ago that discussed how the StackOverflow guys went through their entire codebase and converted as many classes to struct as they can in order to reduce the amount of latency spikes on their servers due to GC pauses. Whilst I’m not advocating for you to do the same, just to illustrate that GC pauses is a common problem on web servers. The Background GC mode introduced in recent .Net releases would have reduced the frequency of these pauses but sensible uses of value types would still help in this regard.

On the stack, temporal locality implies spatial locality and temporal access locality

This means that, objects that are allocated close together in time are also stored close together in space. Further more, objects allocated close in time (e.g. inside the same method) are also likely to be used close together.

Both spatial and temporal access locality plays nicely with how the cache works (i.e. fewer cache misses) and how OS paging works (i.e. fewer virtual memory swaps).

On the stack, memory density is higher

Since value types don’t have the overheads with reference types – Object Header Word and Method Table Pointer – so you’re able to pack more objects in the same amount of memory.

Higher memory density leads to better performance because it means fewer fetches from memory.

On modern hardware, the entire thread stack can fit into CPU cache

When you spawn a new thread, it’s allocated with 1MB of stack space. Whereas CPUs nowadays comes with a much bigger L3 cache (e.g. Nehalem has up to 24MB L3 cache) so the entire stack can fit into the L3 cache which is a lot faster to access compared to main memory (have a look at this talk from Gael Fraiteur, creator of PostSharp, to see just how access time differs).

 

For better or worse, modern hardware is built to take advantage of both spatial and temporal localities and these optimizations manifest themselves in how data is fetched from main memory in cache lines.

Consider what happens at a hardware level when you iterate through an array of 10 million Point2D objects. If Point2D was a reference type, then each time you iterate over an instance of it in the array:

  • CPU will fetch a cache line (typically 64 bytes) starting from where the array element is
  • it’ll access the memory to get the reference pointer (4 bytes) for the object in the heap
  • it’ll discard the rest of the cache line
  • it’ll fetch another cache line (again, 64 bytes) starting from where the object is
  • it’ll read the object and do whatever it is you asked it to do
  • and then it’ll do the same for the next element in the array and so on…

notice how much work goes wasted here due to reference jumping?

On the other hand, if Point2D was a value type with only X and Y values (8 bytes):

  • CPU will fetch a cache line (64 bytes) starting from where the array element is
  • it’ll read the object (8 bytes) and do its work
  • it’ll try to get the next object, realizing that it’s already in the cache (what a lucky day!)
  • it’ll read the next object (8 bytes) from the existing cache line and so on

the CPU is able to read 8 objects from one fetch vs. 2 fetches per object!

 

To see how this cache friendliness translate to performance numbers we can easily measure – execution time. Let’s consider this simple example:

Here is the result of running the tests above in release mode:

image

or in tabular format:

create 10M objects

iterate over 10M objects

reference type

872.425ms

80.9892ms

value type

83.1042ms

27.4273ms

As well as the additional memory overhead, it’s also over 10x slower to create and 3x slower to iterate over an array of 10 million objects when they are defined as a reference type.

You might say “well, these are still only milliseconds!”, but for a web server that needs to deal with a large number of concurrent requests, these margins are very significant!

In some areas of software, such as high frequency trading or arbitrage trading, even nanoseconds can make a big difference to your bottom line.

sidebar : people are doing some really funky stuff in those areas to gain a slight edge over each other, including:

  • building data centres right next to the exchange;
  • using microwave to transmit data so that you receive data faster than exchanges connected by fiber optic cables;
  • using FPGA to run your trading logic;
  • etc.

It’s an area of finance that is genuinely quite interesting and I have heard some fascinating tales from friends who are working in these areas.

 

Conclusions

Using value types sensibly is a very powerful way to improve the performance of your .Net application – this is true for both C# and F#. Without going overboard, you should consider using structs if:

  • the objects are small and you need to create lots of them;
  • you need high memory density;
  • memory consumption is a constraint;
  • you need to iterate large arrays of them frequently

In addition, here are some tips to help you “get value types right”:

  • make them immutable;
  • override Equals (the one that takes an object as argument);
  • overload Equals to take another instance of the same value type (e.g. Equals(Point2D other));
  • overload operators == and !=
  • override GetHashCode

In the next post we’ll look at some pitfalls wrt the use of value types, and why we have the best practices above.

 

Scott Meyers did a great talk on CPU cache at last year’s NDC Oslo and touched on other topics such as false sharing, etc. It’s a good starting point if you’re new to how CPU cache affects the performance of your code.

Martin Thompson (of the Disruptor fame) has an excellent blog where he writes about many related topics including:

Martin’s posts use Java as example but many of the lessons are directly applicable in .Net too.

 

Links

Design for Latency issues

The most common issue I have encountered in production are latency/performance related. They can be symptoms of a whole host of underlying causes ranging from AWS network issues (which can also manifest itself in latency/error-rate spikes in any of the AWS services), over-loaded servers to simple GC pauses.

Latency issues are inevitable – as much as you can improve the performance of your application, things will go wrong, eventually, and often they’re out of your control.

So you must design for them, and degrade the quality of your application gracefully to minimize the impact on your users’ experiences.

As backend developers, one of the fallacies that we often fall into is to allow our dev environments to be too lenient. Servers and databases are never under load during development, so we lure client developers into a false sense of comfort and set them up to fail when the application runs into a slow-responding server in production for the first time.

 

Latency Injection

To program my fellow client developers to always be mindful of latency spikes, we decided to inject random latency delays on every request:

  1. check if we should inject a random delay;
  2. if yes, then work out how much latency to inject and sleep the thread;
  3. and finally invoke the original method to process the request

This is an implementation pattern that can be automated. I wrote a simple PostSharp attribute to do this, whilst piggybacking existing configuration mechanisms to control its behaviour at runtime.

Then I multicast the attribute to all our service endpoints and my work was done!

We run latency injection in our dev environment and it helped identify numerous bugs in the client application as a result and proved to be a worthwhile exercise.

But we didn’t stop there.

 

Error Injection

We throttle user requests to all of our services to stop mischievous players from spamming our servers using proxy tools such as Charles and Fiddler, or handcrafted bots.

But, occasionally, legitimate requests can also be throttled as result of client bugs, over-zealous retry strategy or incorrectly configured throttling threshold.

Once again, we decided to make these errors much more visible in the dev environment so that client developers expect them and handle them gracefully.

To do that, we:

  1. set the threshold very low in dev
  2. used a PostSharp attribute to randomly inject throttling error on operations where it makes sense

The attribute that injects throttling error is very simple, and looks something along the lines of:

The same approach can be taken to include any service specific errors that the client should be able to gracefully recover from – session expiration, state out-of-sync, etc.

 

Design for Failure

Simulating latency issues and other errors fall under the practice of Design for Failure, which Simon Wardley identifies as one of the characteristics of a next-generation tech company.

image

p.s. you should check out Simon’s work on value chain mapping if you haven’t already, they’re inspiring.

 

Chaos Engines

Netflix’s use of Chaos Monkey and Chaos Gorilla is a shining example of Design for Failure at scale.

Chaos Monkey randomly terminates instances to simulate hardware failures and test your system’s ability to withstand such failures.

Chaos Gorilla takes this exercise to the next level and simulate outages to entire Amazon availability zones to test their system’s ability to automatically re-balance to other availability zones without user-visible impact or manual intervention.

Netflix has taken a lot of inspiration from Release It! by Michael Nygard and Drift into Failure by Sidney Dekker. Both books are awesome and I highly recommend them.

image image

 

Global redundancy, or not

Based on reactions to AWS outages on social media, it’s clear to see that many (ourselves included) do not take full advantage of the cloud for global redundancy.

You might scoff at that but for many the decision to not have a globally redundant infrastructure is a conscious one because the cost of such redundancy is not always justifiable.

It’s possible to raise your single-point-of-failure (SPOF) from individual resources/instances, to AZs, to regions, all the way to cloud providers.

But you’re incurring additional costs at each turn:

  • your infrastructure becomes more complex and difficult to reason with;
  • you might need more engineers to manage that complexity;
  • you will need to invest in better tooling for monitoring and automation;
  • you might need more engineers to build those tools;
  • you incur more wastage in CPU/memory/bandwidth/etc. (it is called redundancy for a reason);
  • you have higher network latency for cross-AZ/region communications;

 

Global redundancy at Uber

On the other hand, for many organizations the cost of downtime outweighs the cost of global redundancy.

For instance, for Uber’s customers the cost of switching to a competitor is low, which means availability is of paramount importance for Uber.

Uber devised a rather simple, elegant mechanism for their client applications to failover seamlessly in the event of a datacentre outage. See this post for more details.

 

Latency Propagation

Finally, as more and more companies adopt a microservices approach a whole host of challenges will become evident (many of which have been discussed in Michael Nygard’s Release it!).

One of these challenges is the propagation of latency through inter-service communications.

If each of your services have a 99 percentile latency of 1s then only 1% of calls will take longer than 1s when you depend on only 1 service. But if you depend on 100 services then 63% of calls will take more than 1s!

In this regard, Google fellow Jeff Dean’s paper Achieving Rapid Response Times in Large Online Services presents an elegant solution to this problem.

image

I haven’t put this into practice myself, but I imagine this can be easily implemented using Rx’s amb operator.

 

Links

Being visually honest with F#

It’s been a busy month, some top quality conferences – Code Mesh, Build Stuff, FuncBy and NDC London – all cramped into the space of 4 weeks. It has been a blast, lots of talks and valuable takeaways, and it was great to hang out with old friends and meet new ones. As soon as I find time I’ll put together some posts with my key takeaways from the conferences.

During these conferences, Kevlin Henney’s numerous talks have left a lasting impression on me. In his “Seven Ineffective Coding Habits of Many Programmers” talk at Build Stuff, he described the lack of visual honesty in code such as these:

public int howNotToLayoutMethodHeader(int firstArgument,

    string secondArgument)

and on what visual honesty means, he presented a number of quotes from Daniel Higginbotham’s excellent Clean Up Your Mess website:

“To answer the question “What is clean design?” most succinctly: a clean design is one that supports visual thinking so people can meet their information needs with a minimum of conscious effort.”

 

“You convey information by the way you arrange a design’s elements in relation to each other. This information is understood immediately, if not consciously, by the people viewing your design.”

 

“This is great if the visual relationships are obvious and accurate, but if they’re not, your audience is going to get confused. They’ll have to examine your work carefully, going back and forth between the different parts to make sure they understand.”

The quotes talk about laying out information so that their visual relationships are obvious and accurate.

So if you layout your method arguments in such a way that their visual relationships are not accurate and you do that purposefully, then you’re in fact being dishonest.

image

 

As I sat there, I finally understood why F# pipes are so awesome. I always knew it makes cleaner and more readable code, it’s intuitive, but I haven’t been able to find the words to explain why – the trouble with being able to understand something with minimum conscious effort is that your conscious mind can’t explain how it understood it.

Not anymore, now I finally understand it.

 

When we’re reading a piece of regular English text, we’re reading from left-to-right, then top-to-bottom. This convention controls the flow of information we receive as we read, so when we’re laying out information for people to consume, we lay them out in the order of left-to-right, then top-to-bottom.

image

But what about code?

When it comes to writing nested function calls, somehow this flow of information has been reversed!

image

With F#’s pipes (which has been adopted in both Elm and Elixir by the way), we have finally managed to restore some sanity and present sequence of function calls in a way that matches the way we consume any other textual information.

image

Visual honesty right before your eyes!

 

Links

Clean Up Your Mess – A guide to Visual Design for everyone

NDC Oslo 2014 – Takeaways from keynote “it’s a write/read web”

NDC Oslo 2014 – Takeaways from “career reboot for the developer mind”

Takeaways from Theo Schlossnagle’s talk on Scalable Internet Architecture

Takeaways from Hewitt, Meijer and Szyperski’s talk on the Actor model

Takeaways from Gael Fraiteur’s multithreading talk