this article is a continuation of an earlier post
after gorging myself over the holidays, i got to thinking about the .Net framework handles string types (what goes better w/ turkey than thinking about code?). there is a tool located here that i sometimes use to profile libraries i write/hijack for performance and just to see what's going on under the hood. it's the .Net equivalent to SQL Server's query analyzer execution plan, though in my opinion much more powerful. the 2 biggest performance mongers in .Net (expectedly) are boxing/unboxing, and string concat'ing. i will save boxing/unboxing for another post, wanted to touch on how .Net handles strings.
all primitives in .Net are stack based value types except for the primitive type 'string'...despite the value type syntax for instantiation/assignation, it is in fact a heap based reference type. the fact that .Net has a built in type for strings comes as a great relief for C/C++ guys as strings in those languages were nothing more than an array of characters (while not entirely different in .Net, at least the wrapper 'string' class has already been written for us allowing us to focus on more important things than writing our own). here is the caveat of .Net string types though: they are immutable. take the following code for instance:
string s1 = “a string”; string s2 = s1; Console.WriteLine(”s1 is “ + s1); Console.WriteLine(”s2 is “ + s2); s1 = “another string”; Console.WriteLine(”s1 is now “ + s1); Console.WriteLine(”s2 is now “ + s2); Console.ReadLine();
the output of from this is:
s1 is a string s2 is a string s1 is now another string s2 is a string
in other words, changing the value of s1 had no effect on s2, which is contrary to what we'd expect with a reference type. once a string type is initialized with a value, that particular string type will always maintain that value, it will never change. which brings me to my second point, the overhead required for concat'ing strings. so what happens when you declare, initialize, and then try to concat another string to an existing string? an object of type System.String is created and initialized to hold just enough memory for how many ever characters are in the string. when you code up a concat (someString += “some more text here“), it would appear syntactically that “some more text here” is simply being tacked on to the end of someString, however this is not the case. what is really happening is that an entirely new string is being created with just enough memory allocated to store the combined text (if someString.Length = 40 and you want to append 60 more chars to it, the CLR creates an entirely new string that is 100 chars long), then the memory address is updated to point to the new string (and assigns the var name to someString), and the old string is orphaned and will be cleaned up when the GC comes along (which for the most part is out of our control). i won't supply an example here, but clearly if your application makes extensive use of text processing (such as the project i am working on right now, more on that later), it will run into some pretty severe performance problems. so, fortunately for us MS has supplied the StringBuilder class that lives in the System.Text namespace which alleviates most of the memory issues associated with repeated string concat's. i won't bore anyone with specifics as to the inner workings of the StringBuilder, but basically when the Append method is called, the same block of memory that the StringBuilder occupies is updated with the new string, and if the StringBuilder runs out of allocation space, it roughly doubles its' own size transparently and keeps on truckin' without mangling your application's memory usage.
one of the projects i am working on right now is creating a flat file for our end of year tax application to consume (fortunately our parent company is actually processing this file for us, unfortunately their system is written in COBOL so it expects a very specific file format with large amounts of whitespace/zeros between fields, the file itself is 2071 bytes wide and each and every byte needs to be occupied by a specific piece of data, be it something pulled from our DB, or whitespace). i initially set out to write it as a procedural class in C# utilizing lots of calls to StringBuilder.Append, however this quickly became very tedious as literally in hundreds of places i was making calls like “StringBuilder.Append(” “) <-- in this case appending 20 spaces as a field delimeter for the file (in some places the specification calls for literally 200 spaces between fields, extremely tedious to code by hand, of course i can code a loop for that, but it's redundant to do this in hundreds of places). i don't like redundant code, and i especially don't like redundant loops, code can become obsfucated using this approach, plus keying in all the spaces by hand is error prone, not to mention that the specification is written per byte and is inclusive (in other words, if it calls for bytes 200-220 to be keyed as spaces, it's actually 21 spaces, not 20). so i set out to write my own text processing library to handle all of this for me. the result has tidied up the original file building application by hundreds of lines of code, and the memory savings by using a StringBuilder is substantial.
the functionality of the StringManagement class i wrote encompasses the following behavior: provided with an array of the delimeting characters, the number of times to append these characters, and an array of strings...return a nicely formatted string or StringBuilder object (with the trailing delimeters removed from the final string in the provided array, otherwise you'd end up with the following if you wanted 5 spaces delimiting your text “some text some more text even more text “<-- note the 5 spaces after the last string, these will be removed by this library). here is the source code.
Note that the BuildString method accepts an IDictionary parameter...i recommend using a ListDictionary, i originally wrote client code using a HashTable object, however you have no control in what order the key-value pairs will be read during enumeration, so the delimters will more than likely be written in a different order than you would like. a SortedList of course will be sorted by the key, and index positions can change internally, so that doesn't work either. i am in the process of finding another suitable collection item as ListDictionary is recommended only for 10 values or less. i will add overloaded methods as i think of them, recommendations are always welcome. sample client code is in the /// summary section of the BuildString method.