|
|
|
C++ HashMap
The reason for this experiment? I was interested in doing some more
C++ templates stuff. Not because it is particularily nice, but
because boost offers some interesting stuff and you really need to
understand the concepts if you want to use it in your daily work. I
am- by heart - an object oriented programmer. I am not a big fan of
templates (I did my first experiments with C++ templates in 1994) ,
anyway they can be quite useful for software structure (but not
software architecture). Therefore - as an exercise - I decided to
write an standard-compliant container. I was also interested in
doing some experiments with string hashing algs, which I looked up in
my copies of the CLR and Sedgewick's famous book. These books
cover the theory of hash maps quite sufficiently and therefore I will
not provide any theory.
Some hash_map or rather unordered_map implementations are a bit
inefficient. Anyway, until lately they weren't standardized at all -
they were called sgi extensions. I wanted my class to run on Linux and
Win32.
This is not an TR-1 drop-in replacement, I have no time to ensure
library quality code for some stuff I put up on my homepage in my
spare time. This code is work in progress and will be updated every
now and then. At present this thing is incomplete and unoptimized,
and resembles the sgi hashmap. Hell, I do not have enough time.
I actually also want to create a LRU-type of CacheMap, and it should
share most of the code of the original HashMap implementation. This is
an idea for the future. At present the whole fiddling to make this
container somehow standard-compliant is ruining 200% of the fun I had
while writing the initial straightforward implementation.
A little story about the original (not this) implementation of this
class... Ok, In my initial design I wanted to know how good a
straightforward implementation using standard containers would
perform. Or rather, how BADLY... The creation of huge number of
buckets as objects was bound to kill performance, I knew that right
along. The main problem is that an stl container owns the objects
and the resizing of the buckets in my implementation enforced the
invocation of millions of object constructions and destructions along
with heap allocations. I basically wrote a benchmark for small object
allocations on the heap, something that you need to take care of in
C++. I knew this, I just wanted to know how expensive it would
be. Well, in the case of 1000000 bucket-entries of
std::pair<std::string,whatever> it had to be expensive, so I dumped
it.
Or, as Joel Spolsky said, we are "going back to ones and zeros".
Since we do have a von Neumann computer architecture, I decided to use
it efficiently. Since I did care about insertion speed, I had to
tweak for performance from the very beginning. Massive object
constructions had be avoided.
The current Implementation uses a vector of Pointer to Buckets
(vector<Bucket*> and beats the living crap out of the original
implementation. Still, heap allocation schemes vastly differ among
platforms (Win32 and Linux) and STL vector implementations
(Dinkumware/Win32, STLPort/Win32, libstd++/Linux) and significantly
affect runtime performance. But then again, when you are up to
creating a hashmap of 1 million entries for fast lookups and writes
you already are considering a special use case and not the general
one, and therefore special modifications can be feasible, given these
requirements.
Some other points:
- Not threadsafe
- There are tons of considerations for writing a real library-quality
iterator. That's why the boost iterator library exists.
- Does not remotely represent TR-1 state.
Download zip sourcecode: ContainerPrj.zip. Since the class fits in
one header file, I've put it online for a quick glimpse: HashMap.h
(Warning: I did not heavily document that class yet because Its not
complete stable yet and I want to keep changes as cheap as possible.
Once the software structure has stabilized I will. Also, the
interfaces to the sgi hash_map or the unordered_map are publicly
available.)
Here are the requirements:
- Linux: gnu toolchain (gcc,make,autoconf,automake). Tested with
g++ 3.3.5, g++ 3.4.x. g++ 4.x is probably problematic
- Win32: VS.NET 2003 / VS.NET 2005 (Express edition)
On Win32 you can simply use the provided solution (.sln) file. On
Linux you have to boostrap and configure first, f.e.
chmod +x ./bootstrap
./bootstrap
./configure
make all
./src/ContainerPrj
I used GNU Debian Sarge and gcc3.3.5. On Win32 I achieved some
performance boosts by using STLPort 5.02. This software is released
under a BSD-style license.
The main target, ContainerPrj will do some tests and checks. It tests
the string->long hashmap use case with std::string. I wanted to test
the worst case: lots of collisions and lots of insertions, in
comparison to std::map. it compares our hashmap with the corresponding
std::map like this:
- always test with one million entries {std::string --> long}
- test insertions and queries with names that (probably) are bound to
collide (i.e. 'Name 0', 'Name 1', 'Name 3443').
- test insertions and queries with random alphanumeric strings.
- run both tests again with presized buckets.
2006-10-14: | Build00054 - dumped old design / page rewrite
|
2006-09-24: | Build00053 - hashmap2 initial impl
|
2006-09-18: | Fixes to make this compile with gcc3.4.x (ouch!)
|
2006-09-16: | First official release (Build00051)
|
2006-09-09: | Text on homepage
|
|
|