C++ HashMap

Contents

Introduction

The reason for this experiment? I was interested in doing some more C++ templates stuff. Not because it is particularily nice, but because boost offers some interesting stuff and you really need to understand the concepts if you want to use it in your daily work. I am- by heart - an object oriented programmer. I am not a big fan of templates (I did my first experiments with C++ templates in 1994) , anyway they can be quite useful for software structure (but not software architecture). Therefore - as an exercise - I decided to write an standard-compliant1 container. I was also interested in doing some experiments with string hashing algs, which I looked up in my copies of the CLR and Sedgewick's famous book. These books cover the theory of hash maps quite sufficiently and therefore I will not provide any theory.

Scope

Some hash_map or rather unordered_map implementations are a bit inefficient. Anyway, until lately they weren't standardized at all - they were called sgi extensions. I wanted my class to run on Linux and Win32.

This is not an TR-1 drop-in replacement, I have no time to ensure library quality code for some stuff I put up on my homepage in my spare time. This code is work in progress and will be updated every now and then. At present this thing is incomplete and unoptimized, and resembles the sgi hashmap. Hell, I do not have enough time.

I actually also want to create a LRU-type of CacheMap, and it should share most of the code of the original HashMap implementation. This is an idea for the future. At present the whole fiddling to make this container somehow standard-compliant is ruining 200% of the fun I had while writing the initial straightforward implementation.

Implementation

A little story about the original (not this) implementation of this class... Ok, In my initial design I wanted to know how good a straightforward implementation using standard containers would perform. Or rather, how BADLY... The creation of huge number of buckets as objects was bound to kill performance, I knew that right along. The main problem is that an stl container owns the objects and the resizing of the buckets in my implementation enforced the invocation of millions of object constructions and destructions along with heap allocations. I basically wrote a benchmark for small object allocations on the heap, something that you need to take care of in C++. I knew this, I just wanted to know how expensive it would be. Well, in the case of 1000000 bucket-entries of std::pair<std::string,whatever> it had to be expensive, so I dumped it.

Back to school

Or, as Joel Spolsky said, we are "going back to ones and zeros". Since we do have a von Neumann computer architecture, I decided to use it efficiently. Since I did care about insertion speed, I had to tweak for performance from the very beginning. Massive object constructions had be avoided.

The current Implementation uses a vector of Pointer to Buckets (vector<Bucket*> and beats the living crap out of the original implementation. Still, heap allocation schemes vastly differ among platforms (Win32 and Linux) and STL vector implementations (Dinkumware/Win32, STLPort/Win32, libstd++/Linux) and significantly affect runtime performance. But then again, when you are up to creating a hashmap of 1 million entries for fast lookups and writes you already are considering a special use case and not the general one, and therefore special modifications can be feasible, given these requirements.

Some other points:

  • Not threadsafe
  • There are tons of considerations for writing a real library-quality iterator. That's why the boost iterator library exists.
  • Does not remotely represent TR-1 state.

Download

Download zip sourcecode: ContainerPrj.zip. Since the class fits in one header file, I've put it online for a quick glimpse: HashMap.h (Warning: I did not heavily document that class yet because Its not complete stable yet and I want to keep changes as cheap as possible. Once the software structure has stabilized I will. Also, the interfaces to the sgi hash_map or the unordered_map are publicly available.)

Here are the requirements:

On Win32 you can simply use the provided solution (.sln) file. On Linux you have to boostrap and configure first, f.e.

chmod +x ./bootstrap
./bootstrap
./configure
make all
./src/ContainerPrj

I used GNU Debian Sarge and gcc3.3.5. On Win32 I achieved some performance boosts by using STLPort 5.02. This software is released under a BSD-style license.

Test Program (ContainerPrj)

The main target, ContainerPrj will do some tests and checks. It tests the string->long hashmap use case with std::string. I wanted to test the worst case: lots of collisions and lots of insertions, in comparison to std::map. it compares our hashmap with the corresponding std::map like this:

  1. always test with one million entries {std::string --> long}
  2. test insertions and queries with names that (probably) are bound to collide (i.e. 'Name 0', 'Name 1', 'Name 3443').
  3. test insertions and queries with random alphanumeric strings.
  4. run both tests again with presized buckets.

Changes

2006-10-14:

Build00054 - dumped old design / page rewrite

2006-09-24:

Build00053 - hashmap2 initial impl

2006-09-18:

Fixes to make this compile with gcc3.4.x (ouch!)

2006-09-16:

First official release (Build00051)

2006-09-09:

Text on homepage

[1] Actually I am going to use the word STL container here; I know its the standard library and not the STL anymore.