[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

scm_string_hash suboptimal?

From: Florian Weimer
Subject: scm_string_hash suboptimal?
Date: 20 May 2001 13:01:27 +0200
User-agent: Gnus/5.090004 (Oort Gnus v0.04) Emacs/20.7

Looking for a good hash function for strings, I stumbled across
scm_string_hash guile-core/libguile, but it doesn't look too well:

     unsigned long 
     scm_string_hash (const unsigned char *str, scm_sizet len)
       if (len > 5)
           scm_sizet i = 5;
           unsigned long h = 264;
/* (1) */  while (i--)                                
/* (2) */    h = (h << 8) + (unsigned) str[h % len];
           return h;
           scm_sizet i = len;
           unsigned long h = 0;
           while (i)
             h = (h << 8) + (unsigned) str[--i];
           return h;

The loop (1) is executed five times, but on most targets, only four
iterations really contribute to the hash value (because of the size of
an unsigned long).  The division in (2) is quite expensive.  If it's
avoided, probably more characters of the string could be incorporated
into the hash, and the computation would still be faster.

In addition, a few tests revealed that this hash functions tends
to produce lots of collisions, both on decimal numbers and
(lowercase/titlecase) English words, and that, especially with English
words, a few bit patterns are returned extremly often due to resonance
effects.  For example, each word starting with 'p' which is eight
characters long is hashed to the same value, 0x70707070, because both
264 % 8 == 0 and 'p' % 8 == 0.  Similar effects occur for words of
eight characters starting with other letters as well.

I don't know how critical scm_string_hash is for the performance of
GUILE, but I guess you could gain a few cycles by choosing a better
hash function.

If you want to play around with some other hash function, here's the
one from Python.  (Beware that the Python license is probably not
compatible with the GPL, so this code cannot be incorporated into
GUILE.)  Among the English words I use for tests, there is only one
collision, and among the first 100000 decimal numbers, there's only
one, too.  On my machine, the hash function is 5% to 10% faster than
the GUILE one.

unsigned long 
scm_string_hash (const unsigned char *p, scm_sizet len)
  int l = len;
  unsigned long x = *p << 7;
  while (--l >= 0)
    x = (1000003*x) ^ *p++;
  x ^= len;
  return x;

reply via email to

[Prev in Thread] Current Thread [Next in Thread]