[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: linkedhash-list vs. hash

From: Jim Meyering
Subject: Re: linkedhash-list vs. hash
Date: Wed, 23 Jul 2008 12:32:10 +0200

Eric Blake <address@hidden> wrote:
> I plan on adding automatic table resizing in m4.  I could borrow from the

Hi Eric,

That sounds good!

> master branch, which was patched several years ago to always using 2^n-1
> as the new bucket count.  Or I could use a gnulib module - both
> linkedhash-list and hash appear to fit the bill.  But there are some
> differences:
> Does it really matter whether the set size is prime vs. 2^n-1 in how
> likely a modulo operation in the hash is to cause collisions?  Both gnulib

It's hard to argue with Knuth about hash function design ;-)
The gaps between successive values of 2^n-1 are too large for my taste.

> modules always use a prime (linkedhash-list via a hard-coded table of
> primes in gl_anyhash_list2.h, hash via a runtime check next_prime in
> hash.c).  Of the options, I don't like the speed penalty of a runtime
> computation.

I would expect the time required to resize a hash table to dwarf that
required to call next_prime, and I don't recall ever seeing next_prime
anywhere but at the bottom when profiling.  So I prefer the maybe-slower-
but-precise approach to the fast-but-sloppy (up to 20% overallocation).

> It seems like linkedhash-list has some other speed improvements over hash.
> ~ In hash, the hashing callback is given the number of buckets, and must
> perform the modulo itself (not to mention that the sample function
> hash_string, when !USE_DIFF_HASH, computes a division on every byte of the
> string, rather than once at the end).  Since the hash depends on the
> number of buckets, the hash callback must be called every time the table
> is resized, and it does not make sense to cache the hash value, so every
> comparison of table elements must invoke the comparison callback.  In
> linkedhash-list, the hash function needs only compute a size_t value,
> which is then cached alongside the entry; that way, resizing the table
> does not need to call the hash callback, and when doing element
> comparisons, the comparison callback need only be called in the unlikely
> case that the cached hash values do not match.  Is the memory cost of

This sounds like an obvious win for linkedhash-list, in these days of
inexpensive RAM and the 8GB hobby system.  Back when I wrote hash.c,
data structure size was the primary constraint in the computational
geometry applications I cared about.

WRT resizing, it'd be a big win only if you're resizing frequently,
but if you're resizing frequently, you've already lost the game,
or you're dealing with pathological inputs (example below).
Hash tables are not optimized for rehash speed.

However, in some applications (esp interactive ones) making the
expensive-but-infrequent resize operation less onerous is worth a lot.
Though if worst case insertion speed is an issue, you should be using
a different data structure.

> saving the size_t hash worth this speedup, or are collisions and resizing
> generally rare enough that the number of extra callbacks is in the noise
> compared to the memory savings?
> Another difference: linkedlist-hash is hard-coded on its growth
> parameters, while hash allows the user to specify both fullness threshold
> and growth factors (and even shrink factors, as elements are removed).
> I'm not sure if m4 needs the flexibility of dynamic growth factors, so I
> could go either way.

It depends on how often you expect to have to resize tables.  I'd do
a few benchmarks to see if you can find a reasonable table size that's
large enough for most existing applications, and then not worry (much)
about the cost of resizing, since it will then happen infrequently.

> The hash module has a few more convenience wrappers, such as
> hash_do_for_each or hash_get_entries.  On the other hand, linkedlist-hash
> allows the notion of a sorted hash table, where the elements can be
> retrieved in sorted order rather than having to call qsort at the end, at
> the expense of a slower insertion.
> m4 is already using linkedhash-list (indirectly, via its use of
> avltree-oset), so at this point, I'm leaning towards ditching m4's
> hand-rolled implementation and using linkedhash-list for minimal code
> size.  Any other comments on recommended hash implementations or how
> either gnulib module could borrow ideas from each other?

You've outlined a few ways in which hash.c can be improved.
Thanks for doing that.  If anyone is interested in implementing
them, patches would be most welcome.

One case in which I'd expect a marked difference is when using "rm -rf"
to remove a pathologically deep tree, say 100K-1M levels deep.
In that case, rm spends a lot of its time adding to an ever-growing
directory-loop-detection hash table, and hence rehashing periodically.
When I profiled cases like that, rehashing was near the top.
But how often does this happen in "real life"?  Is it worth


reply via email to

[Prev in Thread] Current Thread [Next in Thread]