hash resizing bug (was: [PATCH] hash: declare some functions with the wa

bug-gnulib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

hash resizing bug (was: [PATCH] hash: declare some functions with the wa

From:	Eric Blake
Subject:	hash resizing bug (was: [PATCH] hash: declare some functions with the warn_unused_result attribute)
Date:	Wed, 17 Jun 2009 18:21:45 +0000 (UTC)
User-agent:	Loom/3.14 (http://gmane.org/)

Jim Meyering <jim <at> meyering.net> writes:

> > Additionally, it looks like hash_rehash has a memory leak - if new_table
> > is allocated, but we later fail to allocate new_entry, the function
> > returns immediately without reclaiming new_table.  Which means that even
> > if an application is prepared to deal with a false return by trying other
> > means to reduce memory usage rather than just giving up with xalloc_die,
> > the leaked memory from the previous attempt will interfere.
> 
> Good catch.
> That is definitely a bug.

I found another bug in hash.c.  hash_rehash can get the table into a state 
where it will NEVER rehash again, regardless of how much else you call 
hash_insert.  I triggered this bug in m4 by hashing the 
strings "m1", "m2", ... "m100000"; m4 uses the default hash_tuning, and an 
initial capacity request of 511.  It also uses the following hash function (the 
hash is over the string plus its length, to allow hashing embedded NUL bytes):

static size_t
symtab_hasher (const void *entry, size_t buckets)
{
  const symbol *sym = (const symbol *) entry;
  return hash (SYMBOL_NAME (sym), SYMBOL_NAME_LEN (sym)) % buckets;
}
static size_t
hash (const char *s, size_t len)
{
  size_t val = len;

  /* This algorithm was originally borrowed from GNU Emacs, but has
     been modified to allow embedded NUL.  */
  while (len--)
    val = (val << 7) + (val >> (sizeof val * CHAR_BIT - 7)) + to_uchar (*s++);
  return val;
}

Note what happens in the call to hash_rehash, just after everything has been 
copied to the new table, but before the old table is freed:

(gdb) p *table
$1 = {bucket = 0x716318, bucket_limit = 0x717720, n_buckets = 641, 
  n_buckets_used = 513, n_entries = 3289, tuning = 0x443cd8, 
  hasher = 0x417ab5 <symtab_hasher>, 
  comparator = 0x417af1 <symtab_comparator>, 
  data_freer = 0x417b45 <symtab_free_entry>, free_entry_list = 0x0}
(gdb) p *new_table
$2 = {bucket = 0x798970, bucket_limit = 0x79a5c8, n_buckets = 907, 
  n_buckets_used = 907, n_entries = 0, tuning = 0x443cd8, 
  hasher = 0x417ab5 <symtab_hasher>, 
  comparator = 0x417af1 <symtab_comparator>, 
  data_freer = 0x417b45 <symtab_free_entry>, free_entry_list = 0x75b100}

Before the rehash, about 20% of the buckets were empty.  But after changing the 
prime factor, the hasher function scatters better, and _every single bucket_ is 
occupied.  Which means ALL subsequent calls to hash_insert will trigger an 
overflow, rather than inserting into a bucket head.  And since the resize 
checking logic in hash_rehash is only triggered after insertion into a bucket 
head, the hash table has effectively been locked into a fixed size that will 
never grow again, no matter how many hash_insert calls are made, even though 
the table is certainly exceeding the requested threshold of 20% empty buckets.  
Ouch.

In other words, this statement in hash_rehash:

  /* If the growth threshold of the buckets in use has been reached, increase
     the table size and rehash.  There's no point in checking the number of
     entries:  if the hashing function is ill-conditioned, rehashing is not
     likely to improve it.  */

is not quite right.  A hashing function may be somewhat ill-conditioned for one 
bucket size, but much better conditioned for another size.  At any rate, it 
looks like the correct fix is to check for rehashing thresholds based on number 
of entries, not number of buckets, and that this check needs to happen on every 
insertion, not just insertions that land in an empty bucket.  Or, at the very 
least, check for threshold PRIOR to insertion, rather than after, such that in 
the above case with m4, the very next call to hash_insert would notice the 
table was at 100%, and force another resize above 907 buckets.

Thoughts, before I implement something along these lines?

-- 
Eric Blake

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH] hash: declare some functions with the warn_unused_result attribute, (continued)

Prev by Date: alt hash implementation (was: [PATCH] hash: declare some functions with the warn_unused_result attribute)
Next by Date: Re: alt hash implementation
Previous by thread: Re: hash_rehash allocation failure
Next by thread: Re: hash resizing bug
Index(es):
- Date
- Thread