CSharp Cookbook-CSharp Cookbook

Recipe 10.1 Creating a Hash Code for a Data Type

Problem

You have created a class or structure that will be used as a key in a Hashtable. You need to overload the GetHashCode method in order to return a good distribution of hash values to be used in a Hashtable (the Discussion section defines a good distribution of hash values). You also need to choose the best hash code algorithm to use in the GetHashCode method of your object.

Solution

The following procedures implement hash code algorithms and can be used to override the GetHashCode method. Included in the discussion of each method are the pros and cons of using it, as well as why you would want to use one instead of another.

In addition, it is desirable, for performance reasons, to use the return value of the GetHashCode method to determine whether the data contained within two objects is equal. Calling GetHashCode to return a hash value of two objects and comparing their hash values can be faster than calling Equals, which individually tests the equality of all pertinent data within two objects. In fact, some developers even opt to compare hash code values returned from GetHashCode, within their overloaded Equals method.

The simple hash

This hash accepts a variable number of integer values and XORs each value to obtain a hash code. This simple algorithm has a good chance of producing an adequate distribution and good performance. Remember to profile and measure it to confirm that it works as well for your particular data set. It fails when you need to integrate more than just numeric values equal or smaller in size to an integer. Its code is:

public int SimpleHash(params int[] values)
{
    int hashCode = 0;
    if (values != null)
    {
        foreach (int val in values)
        {
            hashCode ^= val;
        }
    }

    return (hashCode);
}

The folding hash

This hash allows you to integrate the long data type into a hash algorithm. It takes the upper 32 bits of the long value and folds them over the lower 32 bits of this value. The actual process of folding the two values is implemented by XORing them and using the result. Once again, this is a good performing algorithm with good distribution properties, but, again, it fails when you need to go beyond the long data type. A sample implementation is:

public int FoldingHash(params long[] values)
{
    int hashCode = 0;
    if (values != null)
    {
        int tempLowerVal = 0;
        int tempUpperVal = 0;
        foreach (long val in values)
        {
            tempLowerVal = (int)(val & 0x000000007FFFFFFF);
            tempUpperVal = (int)((val >> 32) & 0xFFFFFFFF);
            hashCode^= tempLowerVal ^ tempUpperVal;
        }
    }

    return (hashCode);
}

The contained object cache

This hash obtains the hash codes from a variable number of object types. The only types that should be passed in to this method are reference type fields contained within your object. This method XORs all the values returned by the GetHashCode method of each object. Its source code is:

public int ContainedObjHash(params object[] values)
{
    int hashCode = 0;
    if (values != null)
    {
        foreach (int val in values)
        {
            hashCode ^= val.GetHashCode( );
        }
    }

    return (hashCode);
}

The CryptoHash method

Potentially the best method of obtaining a hash value for an object is to use the hashing classes built in to the FCL. The CryptoHash method returns a hash value for some input using the MACTripleDES class. This method returns a very good distribution for the hash value, although you may pay for it in performance. If you do not require a near perfect hash value and are looking for an excellent distribution, consider using this approach to calculate a hash value:

public int CryptoHash(string strValue)
{
    int hashCode = 0;
    if (strValue != null)
    {
        byte[] encodedUnHashedString = 
                Encoding.Unicode.GetBytes(strValue);

        // Replace the following Key with your own 
        // key value
        byte[] Key = new byte[16] {1,122,3,11,65,7,9,45,42,98,
                                        77,34,99,45,167,211};

        MACTripleDES hashingObj = new MACTripleDES(Key);
        byte[] code = 
                hashingObj.ComputeHash(encodedUnHashedString);

        // use the BitConverter class to take the 
        // first 4 bytes and use them as an int for
        // the hash code
        hashCode = BitConverter.ToInt32(code,0);    
    }

    return (hashCode);
}

The CryptoHash method using a nonstring

This method shows how other, nonstring data types can be used with the built-in hashing classes to obtain a hash code. This method converts a numeric value to a string and then to a byte array. The array is then used to create the hash value using the SHA256Managed class. Finally, each value in the byte array is added together to obtain a hash code. The code is:

public int CryptoHash(long intValue)
{
    int hashCode = 0;
    byte[] encodedUnHashedString = 
                Encoder.Unicode.GetBytes(intValue.ToString( ));

    SHA256Managed hashingObj = new SHA256Managed( );
    byte[] code = hashingObj.ComputeHash(encodedUnHashedString);

    // use the BitConverter class to take the 
    // first 4 bytes and use them as an int for
    // the hash code
    hashCode = BitConverter.ToInt32(code,0);    

    return (hashCode);
}

The shift and add hash

This method uses each character in the input string, strValue, to determine a hash value. This algorithm produces a good distribution of hash codes even when this method is fed similar strings. However, this method will break down when long strings that end with the same characters are passed. While this may not happen many times with your applications, it is something to be aware of. If performance is critical, this is an excellent method to use. Its code is:

public int ShiftAndAddHash (string strValue)
{
    int hashCode = 0;
    long workHashCode = 0;

    if (strValue != null)
    {
        for (int counter=0; counter<strValue.Length; counter++)
        {
            workHashCode = (workHashCode << (counter % 4)) + 
                       (int)strValue[counter];
        }
        workHashCode = workHashCode % (127);
    }
    hashCode = (int)workHashCode;

    return (hashCode);
}

The calculated hash

This method is a rather widely accepted method of creating a good hash value that accepts several different data types and uses a different algorithm to compute the hash value for each. It calculates the hash code as follows:

It assigns an arbitrary odd primary number to the HashCode variable. This variable will eventually contain the final hash code. Good primary numbers to use are 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, or 67. Obviously, others exist beyond this set, but this should give you a good starting point.
For numeric types equal to or less than the size of an int and char data types, it multiplies the current HashCode by the primary number selected and then adds to this value the value of the numeric type cast to an integer.
For numeric types greater than the size of an int, it multiplies the current HashCode by the primary number selected and then adds to this the folded version of this numeric value. (For more information on folding, see The folding hash method earlier in this recipe.)
For char, floating point, or decimal data types, it multiplies the current HashCode by the primary number selected, casts the numeric value to an integer, and then uses the folding method to calculate its value.
For bool data types, it multiplies the current HashCode by the primary number selected and then adds a 1 for true and 0 for false (you can reverse this behavior if you wish).
For object data types, it multiplies the current HashCode by the primary number selected and then adds the return value of GetHashCode called on this object. If an object is set to null, use the value 0 in your calculations.
For an array or collection, it determines the contained type(s) and uses each element of the array or collection to calculate the hash value, as follows (in the case of an integer array named MyArray):
```
foreach (int element in myArray)
{
    hashCode = (hashCode * 31) + element;
}
```

This algorithm will produce a good distributed hash code for your object and has the added benefit of the flexibility to employ any type of data type. This is a high performing algorithm for simple, moderately complex, and even many complex objects. However, for extremely complex objects—ones that contain many large arrays, large Hashtables, or other objects that use a slower hash code algorithm—this algorithm will start performing badly. In this extreme case, you may want to consider switching to another hash code algorithm to speed performance or simply paring down the amount of fields used in the calculation. Be careful if you choose this second method to increase performance; you could inadvertently cause the algorithm to produce similar values for differing objects. The code for the calculated hash method is:

public int CalcHash(short someShort, int someInt, long someLong,
                    float someFloat, object someObject)
{
    int hashCode = 7;
    hashCode = hashCode * 31 + (int)someShort;
    hashCode = hashCode * 31 + someInt;
    hashCode = hashCode * 31 + 
                        (int)(someLong ^ (someLong >> 32));
    long someFloatToLong = (long)someFloat;
    hashCode = hashCode * 31 + 
             (int)(someFloatToLong ^ (someFloatToLong >> 32));

    if (someObject != null)
    {
        hashCode = hashCode * 31 + 
                            someObject.GetHashCode( );
    }

    return (hashCode);
}

The string concatenation hash

This technique converts its input into a string, and then uses that string's GetHashCode method to automatically generate a hash code for an object. It accepts an integer array, but you could substitute any type that can be converted into a string. You could also use several different types of arguments as input to this method. This method iterates through each integer in the array passed as an argument to the method. The ToString method is called on each value to return a string. The ToString method of an int data type returns the value contained in that int. Each string value is appended to the string variable HashString. Finally, the GetHashCode method is called on the HashString variable to return a suitable hash code.

This method is simple and efficient, but it does not work well with objects that have not overridden the ToString method to return something other than their data type. It may be best to simply call the GetHashCode method on each of these objects individually. You should use your own judgment and the rules found in this recipe to make your decision:

public int ConcatStringGetHashCode(int[] someIntArray)
{
    int hashCode = 0
    StringBuilder hashString = new StringBuilder( );

    if (someIntArray != null)
    {
        foreach (int i in someIntArray)
        {
            hashString.Append(i.ToString( ) + "^");
        }
    }
    hashCode = hashString.GetHashCode( );

    return (hashCode);
}

The following using directives must be added to any file containing this code:

using System;
using System.Text;
using System.Security.Cryptography;

Discussion

The GetHashCode method is called when you are using an instance of this class as the key in a Hashtable object. Whenever your object is added to a Hashtable as a key, the GetHashCode method is called on your object to obtain a hash code to be used by the Hashtable. A hash code is also obtained from your object when a search is performed for your object in the Hashtable.

The following class implements the SimpleHash algorithm for the overloaded GetHashCode method:

public class SimpleClass
{
    private int x = 0;
    private int y = 0;

    public override int GetHashCode( )
    {
        return(SimpleHash(x, y));
    }

    public int SimpleHash(params int[] values)
    {
        int hashCode = 0;
        if (values != null)
        {
            foreach (int val in values)
            {
                hashCode ^= val;
            }
        }

        return (hashCode);
    }
}

This class could then be used as a key in a Hashtable through the following code:

SimpleClass simpleClass = new SimpleClass( );

Hashtable hashTable = new Hashtable( );
hashTable.Add(simpleClass, 100);

There are several rules for writing a good GetHashCode method and a good hash code algorithm:

This method should return the same value for two different objects that have value equality. Value equality means that two objects contain the same data.
The hash algorithm should return a good distribution of values for the best performance in a Hashtable. A good distribution of values means that the hash values returned by the GetHashCode method are usually different for objects of the same type, unless those objects have value equality. Note that objects containing very similar data should also return a unique hash value. This distribution allows the Hashtable to work more efficiently and thus perform better.
This method should not throw an exception.
Both the Equals method and GetHashCode method should be overridden together.
The GetHashCode method should compute the hash code using the exact set of variables that the overridden Equals method uses when calculating equality.
The hash algorithm should be as fast as possible to speed up the process of adding and searching for keys in a Hashtable.
When creating structures, it is always best to override the GetHashCode method if you think this hash code will ever be used, since the ValueType.GetHashCode method is slow. (ValueType.GetHashCode makes use of reflection to obtain a hash value from a type's internal fields.)
Unless the base class is object, use the base class's GetHashCode value when calculating the hash code.
Use the GetHashCode values of any contained objects when calculating the hash code of the parent object.
Use the GetHashCode values of all elements of an array when calculating the array's hash code.

The System.Int32, System.UInt32, and System.IntPtr data types in the FCL use an additional hash code algorithm not covered in the Solution section. Basically, these data types return the value that they contain as a hash code. Most likely, your objects will not be so simple as to contain a single numeric value, but if they are, this method works extremely well.

You may also want to combine specific algorithms to suit your purposes. For instance, if your object contains one or more string types and one or more long data types, you could combine the ContainedObjHash method and the FoldingHash method to create a hash value for your object. The return values from each method could either be added or XORed together.

Once an object is in use as a key in a Hashtable, it should never return a different value for the hash code. Originally, it was documented that hash codes must be immutable, as the authors of Hashtable thought that this should be dealt with by whomever writes GetHashCode. It doesn't take much thought to realize that for mutable types, if you require both that the hashcode never changes, and that Equals represents the equality of the mutable objects, and that if a.Equals(b), then a.GetHashCode( ) == b.GetHashCode( ), then the only possible value implementation of GetHashCode is one that returns the same integer constant for all values.

The GetHashCode method is called when you are using this object as the key in a Hashtable object. Whenever your object is added to a Hashtable as a key, the GetHashCode method is called on your object to obtain a hash code. This hash code must not change while your object is a key in the Hashtable. If it does, the Hashtable will not be able to find your object.

From the perspective of memory consumption, this object will not be able to have its memory freed until the Hashtable is collected causing a noteworthy degree of memory retention.