Memory leak in ADO.NET DataSet

Over the last cou­ple of years, there have been many discussions/debates on DataSet vs Col­lec­tions, and there was a very good arti­cle in MSDN mag­a­zine on just that:

http://msdn.microsoft.com/en-gb/magazine/cc163751.aspx#S7

To add to the Dark Sides of DataSet, there is a lit­tle known feature/bug/annoyance in the DataTable.Select() method — every time you call the Select() method it cre­ates a new index implic­it­ly with­out you hav­ing any con­trol over it, and the index is not cleared until you call DataTable.AcceptChanges().

If your appli­ca­tion has to deal with a large amount of data and have to use the Select() method repeat­ed­ly with­out call­ing AcceptChanges() then you might have a prob­lem! Why? Con­sid­er these two fac­tors:

1. the big­ger the DataT­able, the big­ger the index, and if the index object is big­ger than 85kb it gets allo­cat­ed to the Large Object Heap which are not cleared auto­mat­i­cal­ly by the Garbage Collector/takes much longer to clear than small objects

2. in a 32-bit win­dows sys­tem, there’s a 2GB Vir­tu­al Address Space lim­it for each process, and in prac­tice, you will usu­al­ly get an Out­OfMem­o­ryEx­cep­tion when your process has used around 1.2GB — 1.5GB of RAM

com­bine them and it’s not hard to imag­ine a sce­nario where your process might actu­al­ly run out of mem­o­ry and crash out before it com­pletes its task! (Believe me, it was a hard learned les­son from my per­son­al expe­ri­ence!)

Solu­tions:

1. unless you actu­al­ly need some of the fea­tures DataSet offers such as the abil­i­ty to keep mul­ti­ple ver­sions of the same row (Orig­i­nal, Cur­rent, etc.) you might be bet­ter off with using POCO (plain old CLR object) instead which are sim­ple, light­weight and you can use LINQ to Objects with i4o to get some impres­sive per­for­mance improve­ments. After I imple­ment­ed this change, my appli­ca­tion went from crash­ing out with Out­OfMem­o­ryEx­cep­tion to max­ing out at 70MB through­out its life­time and fin­ished in about 15% of the time it’d have tak­en using DataSet.

2. if get­ting rid of DataSet alto­geth­er takes a lit­tle too much time and effort than you can afford, then there’s a quick workaround by using a DataView and dynam­i­cal­ly change the Fil­ter string every time you intend to call the Select() method.

If you wish to learn more about Garbage Col­lec­tion in gen­er­al, you should read Maoni’s WebLog which cov­ers all things CLR Garbage Col­lec­tor! He also wrote a nice arti­cle focused on Large Object Heap back in June 2008 which is well worth a read:

http://msdn.microsoft.com/en-us/magazine/cc534993.aspx