Re: linkedhash-list vs. hash

bug-gnulib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: linkedhash-list vs. hash

From:	Bruno Haible
Subject:	Re: linkedhash-list vs. hash
Date:	Wed, 23 Jul 2008 13:40:14 +0200
User-agent:	KMail/1.5.4

Eric Blake wrote:
> both linkedhash-list and hash appear to fit the bill

The main difference between linkedhash-list and hash is that the former
implements a list in its own right, i.e. iterating over the elements always
returns them in the defined order. Whereas in 'hash', you are effectively
iterating over a set, not a list; i.e. the elements come out in random order.

> Does it really matter whether the set size is prime vs. 2^n-1 in how
> likely a modulo operation in the hash is to cause collisions?

Yes, I think. The usual hash functions use every bit in the input just once
or twice - much less than cryptographic hashes like MD5. If the input has
characteristic patterns, such as a difference of 56 between subsequent
input addresses, the hash values will likely exhibit similar patterns.
Then you have - more often than you would like to - hash values that differ
only by multiples of 3 or 7. Then, when your table size is also a multiple
of 3 or 7, you are filling only 1/3rd or 1/7th of the table and get lots
of collisions.

I would not compromise on this - a modulo operation is fast on all hardware
since 1995, and the time to search for the next prime around n is
O(sqrt(n)*log(n)) - much less than the O(n) that you need for resizing the
table.

Also, when your hash table size is 2^n-1, you are calling malloc(8*(2^n-1)),
which - in some malloc implementations - may end up allocating twice as much
memory, because sizeof(malloc_header_size) + 8*(2^n-1) is just slightly larger
than a power of two.

> Another difference: linkedlist-hash is hard-coded on its growth
> parameters, while hash allows the user to specify both fullness threshold
> and growth factors

Whether you need this or not, depends on the ratio between the number of
lookups in the table vs. the number of insertions in the table. If this
ratio is low, you profit from a larger growth factor (maybe around 1.5 or 2.0);
if it is high, a smaller growth factor (around 1.2 or 1.4) is likely better.
Anyway, I would first do some profiling to see whether it matters at all.

> Is the memory cost of
> saving the size_t hash worth this speedup, or are collisions and resizing
> generally rare enough that the number of extra callbacks is in the noise
> compared to the memory savings?

That too depends on the ratio between lookups and insertions.

> It seems like linkedhash-list has some other speed improvements over hash.
> ... In linkedhash-list, the hash function needs only compute a size_t value,
> which is then cached alongside the entry

linkedhash-list also uses more memory: 5 memory words per hash entry, in
contrast to 2 memory words in the 'hash' module.

I would say, the main advantage of linkedhash-list is that it has all the
list operations already coded, from insertion at a specified index to the
reverse-order iterator. If you already know that you want a fast hash table
(rather than a flexible data structure), then the 'hash' module is certainly
better.

> how either gnulib module could borrow ideas from each other?

If someone convincingly shows that in linkedhash-list the resize factor of 1.2
is too bad in some situations, I could make it parametrizable.

Bruno

[Prev in Thread]

Current Thread

[Next in Thread]

linkedhash-list vs. hash, Eric Blake, 2008/07/22
- Re: linkedhash-list vs. hash, Ben Pfaff, 2008/07/23
- Re: linkedhash-list vs. hash, Jim Meyering, 2008/07/23
  - Re: abundant memory?, Bruno Haible, 2008/07/23
    - Re: abundant memory?, Jim Meyering, 2008/07/24
    - Re: abundant memory?, James Youngman, 2008/07/26
- Re: linkedhash-list vs. hash, Bruno Haible <=

Prev by Date: Re: FYI, sync canonicalize-lgpl.c from glibc: Hurd fix
Next by Date: [PATCH] mktime.c: normalize tp->tm_isdst value to -1/0/1.
Previous by thread: Re: abundant memory?
Next by thread: FYI, sync canonicalize-lgpl.c from glibc: Hurd fix
Index(es):
- Date
- Thread