One of the most well-known code smells in .Net development is explicitly messing with the Garbage Collector. In my coding travels, I’ve had at least a couple of situations where a performance/memory consumption issue was traced back to a GC.Collect() call. In my own applications, I’ve generally followed Rico Mariani’s Rule #1, “don’t use it.” I figure the .Net Framework team has a much better idea of how to handle that sort of stuff than I do, and for the most part, their ideas work just fine.
I’ve been working on a project for the last month or so that requires the use of an OCR API to read type-written words from TIFF images and then run them up against a set of regular expression patterns looking for useful data. There are several OCR libraries available for purchase, but after significant testing and analysis, we decided that the Microsoft Office Document Imaging (MODI) library that comes with Office 2003 works about as well as anything else, and has the added benefit of being free, since our workstations haven’t been upgraded to Office XP. I spent some time figuring out how to use MODI for OCR, then found a nice tutorial on Code Project that would have saved me some time.
Our TIFF images are stored in a homegrown database/file system setup that, after years of working with FileNet, is a cool breeze on a hot day. Through more prototyping, I found that MODI OCR’s performance was best if I copied each multipage TIFF file to the local system, split them into single page files, and then ran each file through the OCR process, cleaning everything up afterward. Here’s a chopped down version:
1: MODI.Document modiDoc = null;
2: MODI.Image modiImage = null;
3: MODI.Word modiWord = null;
4: List<String> filesToProcess = null;
5: try
6: {
7: filesToProcess = SplitTif(inputFile, workingFolder);
8: foreach (var fileToProcess in filesToProcess)
9: {
10: try
11: {
12: modiDoc = new MODI.Document();
13: modiDoc.Create(fileToProcess);
14: modiDoc.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true);
15: for (var i = 0; i < modiDoc.Images.Count; i++)
16: {
17: modiImage = (MODI.Image)modiDoc.Images[i];
18: for (var j = 0; j < modiImage.Layout.Words.Count; j++)
19: {
20: modiWord = (MODI.Word)modiImage.Layout.Words[j];
21: if (System.Text.RegularExpressions.Regex.IsMatch(modiWord.Text, regexPattern))
22: {
23: //Do stuff that needs to be done when there's a hit
24: }
25: }
26: }
27: }
28: finally
29: {
30: if (modiWord != null)
31: DisposeCom(modiWord);
32: if (modiImage != null)
33: DisposeCom(modiImage);
34: if (modiDoc != null)
35: DisposeCom(modiDoc);
36: modiWord = null; //This may be redundant...?
37: modiImage = null;
38: modiDoc = null;
39: }
40: }
41: }
42: catch (Exception ex)
43: {
44: //handle it
45: }
46: finally
47: {
48: //Cleanup temp files
49: if(filesToProcess != null && rdoNIS.Checked)
50: filesToProcess.ForEach(System.IO.File.Delete);
51: }
Notice all of the cleanup. I’ve learned the hard way that it’s never a bad idea to be very explicit about cleaning up COM objects, especially since a lot of COM objects don’t implement anything close to IDisposable. So, with this code, I thought I was good to go.
And for the most part, I was. My company has two main data centers: one in the building where I work, in a city I’ll call St. Small, and one that’s roughly 1,300 miles away, in a city I’ll call Sunville. The TIFF images and the database I’m persisting data to are in Sunville. The workstations that I started out running my client application for the OCR process are all in St. Small. They worked fine, albeit slowly due to network latency and crappy hardware. But I wanted the process to run faster, since we have a lot of images to sift through, so I nabbed up the only workstation I could find in Sunville, set up my client, and started to run a batch.
I was quite surprised when I repeatedly and inconsistently received OutOfMemory exceptions from the MODI OCR method. I checked all of the system resources, running programs, and RAM, and everything looked fine. It’s running a dual-core processor at 2.4 GHz with 2 GB of RAM, which should be totally adequate for the MODI process, right? Wrong. No matter how hard I tried, I could not get these exceptions to go away. What was even more interesting is that I wasn’t getting the errors on the workstations in St. Small.
So what’s the difference? Duh. Since the Sunville workstation doesn’t have nearly as much network latency to deal with, it does run quite a bit faster. Since it’s able to run faster, it’s creating and destroying MODI COM instances much more frequently than the copies running on the St. Small systems. So I did some more web research, more testing, and with a cringe, I added this line of code (and the comment) to my finally block:
1: //THIS IS VERY VERY BAD YOU BAD BOY
2: GC.Collect();
I tested it on my workstation and performance didn’t seem to suffer too much. So I dropped the new version of my application on the Sunville workstation and fired it up, thinking I’d solved the problem.
Nope. The Sunville system still threw random OutOfMemory exceptions.
What gives? I thought GC.Collect() was the magic baseball bat that beat the crap out of everything? If it isn’t, why is it so terrible to use it? Well, the answer is, it is terrible to use. Read Rico’s post and the many other articles on Garbage Collection, if you don't believe me. But in my situation it seems necessary, since we have so many images to process and I don’t want to babysit every instance of the application. I still had the problem, though. Why wasn’t GC.Collect() working?
Because I wasn’t using it correctly, that’s why. The client application I wrote was a quick and dirty Windows Forms app, so all of the MODI OCR calls were synchronous. GC.Collect(), on the other hand, is not. However, you can force it to be synchronous by adding one line of code, which I did, and now my application runs wherever I want it to.
1: //THIS IS STILL VERY BAD AND YOU ARE STILL A BAD BOY
2: GC.Collect();
3: GC.WaitForPendingFinalizers();
It’s funny—I can actually see the points when the code execution is sitting on that line. It doesn’t happen often, but it clears up all of my errors. If anyone knows a better way, I would love to hear it. I don’t want to use it, but for this project, I’ve come to believe that forcing garbage collection is a necessary evil.