Hashtables

A hashtable is a data structure for storing unique keys. Usually these keys have values associated with them which themselves do not have to be unique.

Hashtables are commonly used to implement Dictionaries, a data structure for mapping unique keys to values. Hashtables can also be used to implement sets and multisets.

Hashtables can only store unique keys, but I want to store a key and a value for every element in a hashtable. In order to do that, I will make a simple class to store the pair, and overload the comparison operators to only compare keys:

template <typename Key, typename Value>
class Pair {
    Key key;
    mutable Value value;

    Pair() : key(), value() {}
    Pair(const Key &k, const Value &v) : key(k), value(v) {}

    Pair& operator = (const Pair &p)
        { key = p.key; value = p.value; return *this; }
    bool operator == (const Pair &pair) const
        { return key == pair.key; }
    bool operator != (const Pair &pair) const
        { return key != pair.key; }
};

Each time a key is inserted into a hashtable, the index of the where to store the key in the underlying array gets calculated using a special hash function:

template <typename T>
unsigned int hash(const T &key);

Similarly, if you want to lookup a key in a hashtable, just pass the key into the hash function to retrieve it’s index. This theoretically makes all hashtable operations perform in constant time, making it the fastest data structure (that I know of).

The hash function should be designed to output a unique integer for every key passed to it. So if A != B, then hash(A) != hash(B). And if A = C, then hash(A) = hash(C).

The default hash function for GNU C++ is actually collection of template hash functions. For all integer related types (bool, char, short, int, long), a specialized template hash function will simply return it’s value. For pointer types, the pointer will be returned casted to a size_t. For all other types, the function will work by iterating over the bytes that make up that value. GNU C++ 4.6 implements it similarly to this:

template <typename T>
size_t hash(const T &key, size_t size) {
    size_t hash = 2166136261UL, len = size;
    const char *cptr = reinterpret_cast<const char*>(&key);
    for (; len; --len) hash = (hash * 131) + *cptr++;
    return hash;
}

Even though it’s supposed to be unlikely, it is very possible that a hash function could output the same value for different keys. This is known as a hash collision. If this happens, hash tables usually store both of the keys anyway. This can be done by having the array store a list of keys in each slot (rather than a single key); otherwise known as a bucket. This type of solution to hash collisions is known as separate chaining. The type of list to use with separate chaining is up to you, but they are usually arrays or linked lists. So the underlying data type of a hashtable with separate chaining is an array of buckets of keys.

And the following steps must be taken to lookup a key in a hashtable:

  1. Pass the key to the hash function to obtain an index.
  2. Go to the slot at that index in the array.
  3. Iterate over the keys stored in that slot until you’ve found the matching key.

This function will find a given key in a hashtable and return a pointer to where it was found:

template <typename Key>
const Key* Hashtable<Key>::find(const Key &key) const {
    size_t hash;
    const bucket_t *bucket;
    typename bucket_t::const_iterator it;
    hash = Hash(key) % table.size();
    bucket = &table[hash];
    if (bucket->empty()) return NULL;
    for (it = bucket->begin(); it != bucket->end(); ++it) {
        if (*it == key) return &(*it);
    }
    return NULL;
}

Notice that the modulo of the hash function output and the table size was used. This is because the large range of the hash function must be minimized to match the limited range of the array in the hashtable.

If a dictionary was implemented on top of such a hash table with Pair used as the key, the following function can be used to lookup a value associated with a key:

template <typename Key, typename Value>
Value& Dictionary<Key, Value>::find(const Key &key) const {
    return hashtable.find(Pair<Key, Value>(key, Value())->value;
}

Unless your hashtable has a fixed size, the most costly operation is resizing the table and performing a rehash of every element. Most implementations perform a rehash when the hashtable reaches 70 to 80% capacity. When this happens, a new table will be allocated which usually twice the size of the original table. Then all elements in the old table are iterated and their new indices for the new hashtable are calculated with the modulo of the hash function and the new hashtable size before being inserted in the new hashtable.

A rehash function might look like this:

template <typename Key>
void Hashtable<Key>::rehash(size_t new_size) {
    size_t hash;
    table_t new_table(new_size);
    typename table_t::iterator table_iter;
    typename bucket_t::iterator bucket_iter;
    for (table_iter = table.begin(); table_iter != table.end(); ++table_iter) {
        if (table_iter->empty()) continue;
        for (bucket_iter = table_iter->begin(); bucket_iter != table_iter->end(); ++bucket_iter) {
            hash = Hash(*bucket_iter) % new_size;
            new_table[hash].push_back(*bucket_iter);
        }
    }
    table = new_table;
}

Other than the occasional rehash, the constant time for inserting, deleting, and searching on a hashtable makes it the data structure of choice if you have a need for speed. A clear disadvantage to hashtables though, is the large range of slots allocated in the table takes up a lot of memory. A more conservative approach to implementing dictionaries is with a binary search tree, using significantly less memory and performance in logarithmic time.

I will talk about binary search trees in the next post.

Leave a comment