[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[gawk] hashing
From: |
Bonzini |
Subject: |
[gawk] hashing |
Date: |
Fri, 7 Jun 2002 23:28:52 +0200 |
Bruno Haible and I recently played with hash functions in gettext, and he
observed that gawk's function performs terribly on keys like 'ABCD' or
'1234'. Hashing into a 2k table the 1296 strings [A-F][A-F][A-F][A-F] only
used 228 buckets.
If you are interested I can plug into gawk a different hash function from
Bob Jenkins that performs extremely well. As a complementary advantage,
this hash function supports hash table sizes that are powers of two because
it scrambles the bits very well (no need for a slow modulo-prime operation
to do further scrambling).
Here are some statistics for this hash function (in this case it was used in
a double hashing scheme, but that does not matter much). First I hashed the
4602 msgid strings from gcc.pot, using a table size of 8192 for Jenkins'
hash and 8089 for others.
Strings with no collisions: your hash 3277, gawk 3912, Jenkins 3922
Extra lookups needed: your hash 3605, gawk 1535, Jenkins 1518
(re. the second figure: 1 is added when there is a collision on the primary
hash only, 2 when there is also one on the secondary hash, 3 when there are
two collisions on the secondary hash, etc.)
gawk's hash made a good job on English text.
The 4-letter words from {A,B,C,D,E,F} (e.g. "ABAF") are 6^4=1296 strings. I
used 2048 for Jenkins hash and 2029 for others (2027 is prime). This is
indeed a worst case for gawk's hash, which used only 10% of the available
buckets:
Strings with no collisions: your hash 931, gawk 228, Jenkins 1180
Extra lookups needed: your hash 1199, gawk 3605, Jenkins 262
Are you interested in me making a patch for this?
Paolo Bonzini
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [gawk] hashing,
Bonzini <=