LINQ – Lambda Expression vs Query Expression

As you’re probably aware of already, LINQ comes in two flavours – using Lambda expressions and using SQL-like query expressions:

Func<int, bool> isEven = i => i % 2 == 0;
int[] ints = new int[] { 1, 2, 3, 4, 5, 6, 7, 8, 9 };

// using Query expression
var evensQuery = from i in ints where isEven(i) select i;
// using Lambda expression
var evensLambda = ints.Where(isEven);

Both yields the same result because query expressions are translated into their lambda expressions before they’re compiled. So performance-wise, there’s no difference whatsoever between the two.

Which one you should use is mostly personal preference, many people prefer lambda expressions because they’re shorter and more concise, but personally I prefer the query syntax having worked extensively with SQL. With that said, it’s important to bear in mind that there are situations where one will be better suited than the other.

Joins

Here’s an example of how you can join sequence together using Lambda and query expressions:

class Person
{
    public string Name { get; set; }
}
class Pet
{
    public string Name { get; set; }
    public Person Owner { get; set; }
}

void Main()
{
    var magnus = new Person { Name = "Hedlund, Magnus" };
    var terry = new Person { Name = "Adams, Terry" };
    var charlotte = new Person { Name = "Weiss, Charlotte" };
    var barley = new Pet { Name = "Barley", Owner = terry };
    var boots = new Pet { Name = "Boots", Owner = terry };
    var whiskers = new Pet { Name = "Whiskers", Owner = charlotte };
    var daisy = new Pet { Name = "Daisy", Owner = magnus };
    var people = new List<Person> { magnus, terry, charlotte };
    var pets = new List<Pet> { barley, boots, whiskers, daisy };

    // using lambda expression
    var lambda = people.Join(pets,              // outer sequence
                             person => person,  // inner sequence key
                             pet => pet.Owner,  // outer sequence key
                             (person, pet) =>
                                 new { OwnerName = person.Name, Pet = pet.Name });

    // using query expression
    var query = from person in people
                join pet in pets on person equals pet.Owner
                select new { OwnerName = person.Name, Pet = pet.Name };
}

Again, both yields the same result and there is no performance penalties associated with either, but it’s easy to see why query syntax is far more readable and expressive of your intent here than the lambda expression!

Lambda-Only Functions

There are a number of methods that are only available with the Lambda expression, Single(), Take(), Skip(), First() just to name a few. Although you can mix and match the two by calling the Lambda-only methods at the end of the query:

// mix and match query and Lambda syntax
var query = (from person in people
             join pet in pets on person equals pet.Owner
             select new { OwnerName = person.Name, Pet = pet.Name }).Skip(1).Take(2);

As this reduces the readability of your code, it’s generally better to first assign the result of a query expression to a variable and then use Lambda expression using that variable:

var query = from person in people
            join pet in pets on person equals pet.Owner
            select new { OwnerName = person.Name, Pet = pet.Name };

var result = query.Skip(1).Take(2);

Both versions returns the same result because of delayed execution (the query is not executed against the underlying list until you try to iterate through the result variable). Also, because query expressions are translated to Lambda expressions first before being compiled there will not be performance penalties either. BUT, if you don’t want delayed execution, or need to use one of the aggregate functions such as Average() or Sum(), for example, you should be aware of the possibility of the underlying sequence being modified between the assignments to query and result. In this case,I’d argue it’s best to use Lambda expressions to start with or add the Lambda-only methods to the query expression.

The Stack, The Heap And The Memory Pitfalls

In the last couple of days or so I have spent some time reading Karl Seguin’s excellent and FREE to download ebook – Foundations of Programming which covers many topics from dependency injection to best practices for dealing with exceptions.

The main topic that took my fancy was the Back to Basics: Memory section, here’s a summary I put together with additional example.

In C#, variables are stored in either the Stack or the Heap based on their type:

  • Values types go on the stack
  • Reference types go on the heap

Remember, a struct in C# is a value type, as is an enum, so they both go on the stack. Which is why it’s generally recommended (for better performance) that you prefer a struct type to a reference type for small objects which are mainly used for storing data.

Also, value types that belong to reference types also go on the heap along with the instance of the reference type.

In Java, because everything is a reference type so all the variables go on the heap making the size of the heap one of the most important attributes that determine the performance of a Java application. The C# creators saw this as inefficient and unnecessary, which is why we have value types in C# today :-)

The Stack

Values on the stack are automatically managed even without garbage collection because items are added and removed from the stack in a LIFO fashion every time you enter/exit a scope (be it a method or a statement), which is precisely why variables defined within a for loop or if statement aren’t available outside that scope.

You will receive a StackOverflowException when you’ve used up all the available space on the stack, though it’s almost certainly the symptom of an infinite loop (bug!) or poorly designed system which involves near-endless recursive calls.

The Heap

Most heap-based memory allocations occur when we create a new object, at which point the compiler figures out how much memory we’ll need, allocate an appropriate amount of memory space and returns a pointer to the allocated memory.

Unlike the stack, objects on the heap aren’t local to a given scope. Instead, most are deeply nested references of other referenced objects. In unmanaged languages like C, it’s the programmer’s responsibility to free any allocated memory, a manual process which inevitably lead to many memory leaks down the years!

In managed language, the runtime takes care of cleaning up resources. The .Net framework uses a Generation Garbage Collector which puts object references into generations based on their age and clears the most recently created references more often.

How they work together

As mentioned earlier, every time you create a new object, some memory gets allocated and what you assign to your variable is actually a reference pointer to the start of that block of memory. This reference pointer comes in the form of a unique number represented in hexadecimal format, and as an integer they reside on the stack unless they are part of a reference object.

So for example, the following code will result in two values on the stack, one of which is a pointer to the string:

int intValue = 1;
string stringValue = "Hello World";

image

When these two variables go out of scope, the values are popped off the stack, but the memory allocated on the heap is not cleared. Whilst this results in a memory leak in C/C++, the garbage collector (GC) will free up the allocated memory for you in a managed language like C# or Java.

Pitfalls in C#

Despite having the GC to do all the dirty work so you don’t have to, there are still a number of pitfalls which might sting you:

Boxing & Unboxing

Boxing occurs when a value type is ‘boxed’ into a reference type (when you put a value type into an ArrayList for example). Unboxing occurs when a reference type is converted back into a value type (when you cast an item from the ArrayList back to its original type for example).

The generics features introduced in .Net 2.0 increases type-safety but also addresses the performance hit resulting from boxing and unboxing.

ByRef

Most developers understand the implication of passing a value type by reference, but few understands why you’d want to pass a reference by reference. When you pass a reference type ByValue you are actually passing a copy of the reference pointer, but when you pass a reference type ByRef you’re passing the reference pointer itself.

The only reason to pass a reference type by reference is if you want to modify the pointer itself – as in where it points to. However, this can lead to some nasty bugs:

void Main()
{
    List<string> list = new List<string> { "Hello", "World"; };

    // pass a copy of the reference pointer
    NoBug(list);
    // no error here
    Console.WriteLine(list.Count);

    // pass the actual reference pointer
    BadBug(ref list);
    // reference pointer has been amended, this throws NullReferenceException!
    Console.WriteLine(list.Count);
}

public void BadBug(ref List<string> list)
{
    list = null; // this changes the original reference pointer
}

public void NoBug(List<string> list)
{
    list = null; // this changes the local copy of the reference pointer
}

In almost all cases, you should use an out parameter or a simple assignment instead (whichever that expressed your intention more clearly).

Whilst I’m on the topic, do you know the difference between using out and using ref? When you pass a parameter to a method using the out keyword, the parameter must be assigned inside the method scope; when you pass a parameter to a method using the ref keyword, the parameter must be assigned before it’s passed to the method.

Managed Memory Leaks

Yes, memory leak is still possible in a managed language! Typically, this type of memory leak happens when you hold on to a reference indefinitely, though most of the time this might not amount to any noticeable impact on your application it can sting you rather unexpectedly as the system matures and starts to handle greater loads of data. For example, I ran into a platform bug with ADO.NET a little while back and it took the best part of a week to figure out and fix it! There are memory profilers out there that can help hunt down memory leaks in a .Net application, the best ones being dotTrace and ANTS Profiler. For memory profiling, I prefer ANTS Profiler which allows you to easily compare two snapshots of your memory usage.

One specific situation worth mentioning as a common cause of memory leak is events. If, in a class you register for an event, a reference is created to your class. Unless you de-register from the event your object lifecycle will ultimately be determined by the event source. Two solutions exist:

1. de-registering from events when you’re done (the IDisposable pattern is ideal here)

2. use the WeakEvent Pattern or a simplified version.

Another potential source of memory leak is when you implement some of caching mechanism for your application without any expiration policy, in which case your cache is likely to keep growing until it takes up all available memory space and thus triggering OutOfMemoryException.

Fragmentation

As your program runs its course, the heap becomes increasingly fragmented and you could end up with a lot of unusable memory space spread out between usable chunks of memory.

Usually, the GC will take care of this by compacting the heap and the .Net framework will update the references accordingly, but there are times when the .Net framework can’t move an object – when the object is pinned to a specific memory location.

Pinning

Pinned memory occurs when an object is locked to a specific address on the heap. This usually is a result of interaction with unmanaged code – the GC updates object references in managed code when it compacts the heap, but has no way of updating the references in unmanaged code and therefore before interoping it must first pin objects in memory.

A common way to get around this is to declare large objects which don’t cause as much fragmentation as many small ones. Large objects are placed in a special heap called the Large Object Heap (LOH) which isn’t compacted at all. For more information on pinning, here’s a good article on pinning and asynchronous sockets.

Another reason why an object might be pinned is if you compile your assembly with the unsafe option, which then allows you to pin an object via the fixed statement. The fixed statement can greatly improve performance by allowing objects to be manipulated directly with pointer arithmetic, which isn’t possible if the object isn’t pinned because the GC might reallocate your object.

Under normal circumstances however, you should never mark your assembly as unsafe and use the fixed statement!

Garbage Spewers

Already discussed here.

Setting things to null

You don’t need to set your reference types to null after you’re done with that because once that variable falls out of scope it will be popped off the stack anyway.

Deterministic Finalization

Even in a managed environment, developers still need to manage some of their references such as file handles or database connections because these resources are limited and therefore should be freed as soon as possible. This is where deterministic finalization and the Dispose pattern come into play, because deterministic finalization releases resources not memories.

If you don’t call Dispose on an object which implements IDisposable, the GC will do it for you eventually but in order to release precious resources or DB connections in a timely fashion you should use the using statement wherever possible.

Pattern for dealing with null handler for events in C#

If you’ve used events in C# before, you’ve probably written code like this too:

public event EventHandler Started;
...
// make sure Started is not null before firing the event
// else, NullReferenceException will be thrown
if (Started != null)
{
    Started(this, some_event_args);
}

This is perfectly ok and normal to do, but can quickly become tiresome if you have to fire events in multiple places in your code and have to do a null reference check every time!

So instead, I have been using this pattern for a while:

// initialise with empty event hanlder so there's no need for Null reference check later
public event EventHandler Started = (s, e) => { };
...
// no need for Null reference check anymore as Started is never null
Started(this, some_event_args);

If all you need is the ability to add/remove handlers, then this pattern would do you fine as the event will never be null because there’s no way for you to remove the anonymous method the event was initialised with unless you set the event to null.

However, if you occasionally need to clear ALL the event handlers and set the event to null, then don’t use this pattern as you might start seeing NullReferenceException being thrown before you add event handlers back in.

Covariance and Contravariance in C# 4.0

Over the last couple of days I have read a number of blogs on the topic of covariance and contravariance, a new feature to be introduced in the upcoming C# 4.0 release. Whilst most of the blog posts were happy to provide a detailed account of what these terms mean and provide some simple examples, none really hit home how they differ from polymorphism (or assignment compatibility, the ability to assign a more specific type to a variable of a less specific type). That is, until I stumble across this excellent post on Eric Lippert’s blog which explains exactly that!

Covariance

For those of you who’s new to the topic of covariance and contravariance, here’s some code snippets to demonstrate what covariance means and the problem with it:

class Animal {}
class Dog : Animal {}
class Cat : Animal {}

void ChangeToDog(Animal[] animals)
{
    animals[0] = new Dog(); // this throws runtime error because of type mismatch
}

void Main()
{
    // this is fine, by virtue of polymorphism
    // this is also known as 'type variance' because types of the two variables 'vary'
    Animal animal = new Cat();

    // this is also fine, this is called 'covariance'
    // this works because the rules for arrays is co-ordinated with the rules for variance
    // of the array element types
    Animal[] animals = new Animal[] { new Cat() };

    // here's the problem - because the ChangeToDog method accepts an array of Animal
    // objects therefore it has to accept an array of Cat objects, and the subsequent
    // assignment is also valid to the compiler because it follows the rules of
    // variance, but clearly, you can't assign a Dog to an array of Cat objects!
    ChangeToDog(new Cat[] { new Cat() });
}

Firstly, the problem demonstrated above throws a runtime error, which is not ideal as we C# developers prefer compile time errors. But the main issue here is that this is a WTF error because it catches you completely by surprise, as you’re writing this method there is no way for you to know that the consumer of your method will call it covariantly and cause a runtime exception to be thrown for an otherwise perfectly valid assignment.

So when generics became available in C# 2.0, the designers of C# took away the ability to use type variance with generics as you can see from the code snippets below:

class Animal {}
class Dog : Animal {}
class Cat : Animal {}

void ChangeToDog<T>(List<T> animals) where T : Animal
{
    // compile time error
    animals[0] = new Dog();

    // same as above, compile time error
    animals[0] = (T) new Dog();

    // this compiles but throws a runtime type mismatch error as before, but no longer
    // a WTF error because this has 'suspicious looking code' written all over it!
    animals[0] = (T) (object) new Dog();

    // this is pretty much the safest way to do the type of casting we want
    if (animals is List<Dog>)
    {
        animals[0] = (T) (object) new Dog();
    }
}

void Main()
{
    // this no longer works!
    List<Animal> animals = new List<Cat>() { new Cat() };
}

The inheritance relationship between a Cat and an Animal is not preserved because there was no covariance support for generics in C# 3.5.

In C# 4.0, you will be able to use covariance with interfaces on types that are only used at output positions which makes it safe and prevents the WTF exception from the first example from happening.

To use covariance in your code, you need to mark a type variable with the out keyword, here’s the new look IEnumerable<T> and IEnumerator<T> in C# 4.0:

public interface IEnumerable<out T> : IEnumerable
{
    IEnumerator<T> GetEnumerator();
}

public interface IEnumerator<out T> : IEnumerator
{
    bool MoveNext();
    T Current { get; }
}

Contravariance

Conversely to covariance, contravariance reverses the inheritance relationship of T:

public class CustomEventArgs : EventArgs { }
// delegate with handler which accepts type CustomEventArgs
public EventHandler<CustomEventArgs> MyDelegate;
// handler with a less specific type EventArgs
public void MyHandler(object sender, EventArgs e) { }

void Main()
{
    // this works thanks to contravariance
    MyDelegate += MyHandler;
}

Right now, contravariance is supported in generic delegates in the handler’s event argument as I have demonstrated above, but not in the type parameter T so this bit of code doesn’t compile:

class Person {}
class Student : Person {}

// throws compile error for type mismatch because Action<T> is not contravariant in T
Action<Student> studentAction = (Person p) => { /* do something */ };

In C# 4.0, you can use contravariance on a type that is only used in input positions which allows you to cast in the reverse direction (e.g. from Animal to Dog).

The in keyword can be used to mark an input type as contravariant, the limitation is that you are only allowed it for reference conversion, so no boxing allowed!

public interface IComparer<in T>
{
    int Compare(T left, T right);
}

This means you will be able to use an IComparer<object> as though it was a IComparer<string>.

As you’d imagine, covariance and contravariance is also supported in the type parameters for generic delegates, the default Func and Action delegates have all been marked with the appropriate in or out keywords for each type parameter.

Further reading

Eric Lippert’s 11-part (so far!) blog entries on Covariance and Contravariance

Joe Albahari’s presentation on What’s New in C# 4.0

Functional programming with Linq – IEnumerable.Aggregate

As I was learning functional programming with F# I came across the List.reduce function which iterates through a list and builds up an accumulator value by running another function against each element in the list.

Back to the more familiar C# territory, LINQ has introduced some functional features to C# and one of these is the Aggregate function on IEnumerable<T> which works in the same way as the List.reduce function.

In the following example, you can use the Aggregate function to built up a comma separated string from an array of string:

var strings = new List<string> { "Jack", "Jill", "Jim", "Joe", "Jane" };
// this returns "Jack, Jill, Jim, Joe, Jane"
var comSeparatedStrings = strings.Aggregate((acc, item) => acc + ", " + item);