[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
linkedhash-list vs. hash
linkedhash-list vs. hash
Tue, 22 Jul 2008 20:48:30 -0600
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:184.108.40.206) Gecko/20080421 Thunderbird/220.127.116.11 Mnenhy/0.7.5.666
-----BEGIN PGP SIGNED MESSAGE-----
I've noticed some sluggishness in m4's hand-written symbol hash table when
the user has lots of symbols, particularly after some recent work on
autoconf experimenting with O(n) unique set entry (compared to the current
O(n^2) m4_append_uniq) by creating a new witness macro for each unique
entry. In m4 1.4.x, the symbol table is hardcoded to 509 buckets unless
you use -H, but autom4te does not use it. Therefore, with coreutils'
configure.ac having more than 450 AC_SUBST, along with all the other
m4sugar macros, the pigeonhole principle says that running autoconf on
coreutils is suffering from quite a few hash collisions, such that hash
lookup is no longer O(1).
I plan on adding automatic table resizing in m4. I could borrow from the
master branch, which was patched several years ago to always using 2^n-1
as the new bucket count. Or I could use a gnulib module - both
linkedhash-list and hash appear to fit the bill. But there are some
Does it really matter whether the set size is prime vs. 2^n-1 in how
likely a modulo operation in the hash is to cause collisions? Both gnulib
modules always use a prime (linkedhash-list via a hard-coded table of
primes in gl_anyhash_list2.h, hash via a runtime check next_prime in
hash.c). Of the options, I don't like the speed penalty of a runtime
It seems like linkedhash-list has some other speed improvements over hash.
~ In hash, the hashing callback is given the number of buckets, and must
perform the modulo itself (not to mention that the sample function
hash_string, when !USE_DIFF_HASH, computes a division on every byte of the
string, rather than once at the end). Since the hash depends on the
number of buckets, the hash callback must be called every time the table
is resized, and it does not make sense to cache the hash value, so every
comparison of table elements must invoke the comparison callback. In
linkedhash-list, the hash function needs only compute a size_t value,
which is then cached alongside the entry; that way, resizing the table
does not need to call the hash callback, and when doing element
comparisons, the comparison callback need only be called in the unlikely
case that the cached hash values do not match. Is the memory cost of
saving the size_t hash worth this speedup, or are collisions and resizing
generally rare enough that the number of extra callbacks is in the noise
compared to the memory savings?
Another difference: linkedlist-hash is hard-coded on its growth
parameters, while hash allows the user to specify both fullness threshold
and growth factors (and even shrink factors, as elements are removed).
I'm not sure if m4 needs the flexibility of dynamic growth factors, so I
could go either way.
The hash module has a few more convenience wrappers, such as
hash_do_for_each or hash_get_entries. On the other hand, linkedlist-hash
allows the notion of a sorted hash table, where the elements can be
retrieved in sorted order rather than having to call qsort at the end, at
the expense of a slower insertion.
m4 is already using linkedhash-list (indirectly, via its use of
avltree-oset), so at this point, I'm leaning towards ditching m4's
hand-rolled implementation and using linkedhash-list for minimal code
size. Any other comments on recommended hash implementations or how
either gnulib module could borrow ideas from each other?
Don't work too hard, make some time for fun as well!
Eric Blake address@hidden
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
-----END PGP SIGNATURE-----