[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Dictionary data structures for path name strings
From: |
Stefan Vargyas |
Subject: |
Re: Dictionary data structures for path name strings |
Date: |
Wed, 21 Sep 2016 09:39:03 -0700 |
Thank you Pádraig for your reply!
> >
> > The results obtained so far are not spectacular, yet not negligible either:
> > for two sets of path names, the gains in *node* memory consumption relative
> > to (1) the size of the input and to (2) the node memory usage of Gnulib's
> > generic hash table implementation (which is currently used by 'cp') were as
> > shown by the following table:
> >
> > ----- ----------------- ----- -----
> > set platform (1) (2)
> > ----- ----------------- ----- -----
> > 1 32-bit pointers 35% 42%
> > 2 32-bit pointers 27% 36%
> > ----- ----------------- ----- -----
> > 1 32-bit offsets 23% 38%
> > 2 32-bit offsets 13% 32%
> > ----- ----------------- ----- -----
>
> Cool. So about a third less mem.
The percents above are referring to *node* memory consumption only.
I admit this to be a bit misleading, since the values seen above does
not represent the total amount of memory used by the data structures.
E.g. in the case of Gnulib' hash table, one should also account for
the amount of memory consumed by the hash table itself: the struct
'hash_table', the array of 'hash_entry' structs, 'bucket', and the
list of 'hash_entry' structs, 'free_entry_list'.
Maybe Path-Set should define for each dictionary structure a new stat
parameter 'total-mem', that besides 'total-node-mem' includes all the
other kinds of memory claimed by the respective structure from the free
store.
> I presume that's dependent on the depth of the paths in question?
Indeed, it seems so. The experiments I carried out with Path-Set indicate
that there are quite a few combinatorial factors that are influencing the
memory consumption implied by these data structures.
A first cominatorial case that come to one's attention -- please see
'doc/path-set.txt' -- is that of 64-bit platforms using 64-bit pointers,
where *no* memory consumption gain is obtained at all. This is the reason
of using 32-bit offsets instead of 64-bit pointers in the node structures
on such platforms (see section 1 of 'doc/path-set.txt').
Another such combinatorial case: these data structures -- including the
one providing the positive outcomes seen in the table above -- are not
fulfilling the initiating goal (see it defined in section 1.a of the
README file) for smallish input path name sets:
$ shopt -s extglob
$ . src/commands.sh
$ cd tests/2016-08-28
$ size-fmt() { local self='size-fmt'; local p=1; local s='K'; case "$1" in
''|-k) ;; -m) p=2; s='M';; *) error "invalid arg '$1'"; return 1;; esac; bc -q
<(echo "scale=1; print read() / 1024^$p, \"$s\n\""); }
Random '100000' path strings out of ~400K on a 32-bit platform:
$ path-set-test-grep -Re --separators='/' \
--input=tests/files.txt --config-32bit --stat-line-mem -t 100000|
datamash --header-in mean 3|size-fmt -m
6.2M
$ path-set-test-grep -Fe --separators='/' \
--input=tests/files.txt --config-32bit \
--stat-total-node-mem={6.2M,7.1M} \
--not --ternary-tree -t 100000
struct-type set-type total-node-mem 6.2M 7.1M
path-trie linear-hash 4.6M 26% 36%
path-trie gnulib-hash 5.1M 17% 28%
plain-set linear-hash 6.7M -9% 5%
plain-set gnulib-hash 7.1M -15% 0%
Random '10000' path strings out of ~400K on a 32-bit platform:
$ path-set-test-grep -Re --separators='/' \
--input=tests/files.txt --config-32bit --stat-line-mem -t 10000|
datamash --header-in mean 3|size-fmt -k
642.0K
$ path-set-test-grep -Fe --separators='/' \
--input=tests/files.txt --config-32bit \
--stat-total-node-mem={642.0K,728.4K} \
--not --ternary-tree -t 10000
struct-type set-type total-node-mem 642.0K 728.4K
path-trie linear-hash 603.6K 6% 17%
plain-set linear-hash 690.8K -8% 5%
path-trie gnulib-hash 691.6K -8% 5%
plain-set gnulib-hash 728.4K -13% 0%
Random '1000' path strings out of ~400K on a 32-bit platform:
$ path-set-test-grep -Re --separators='/' \
--input=tests/files.txt --config-32bit --stat-line-mem -t 1000|
datamash --header-in mean 3|size-fmt -k
64.2K
$ path-set-test-grep -Fe --separators='/' \
--input=tests/files.txt --config-32bit \
--stat-total-node-mem={64.2K,73.1K} \
--not --ternary-tree -t 1000
struct-type set-type total-node-mem 64.2K 73.1K
plain-set linear-hash 69.1K -8% 6%
plain-set gnulib-hash 73.1K -14% 0%
path-trie linear-hash 85.8K -34% -17%
path-trie gnulib-hash 99.9K -56% -37%
> > [4] Path-Set: Dictionary Data Structures for Path Name Strings
> >
> > http://www.nongnu.org/path-set/
>
> 404 ?
Still waiting for Savannah to approve hosting Path-Set.
> Lots of info here. Good work!
> Having this available as an abstracted data structure seems very useful.
> Would it make sense to include it in gnulib for easier
> integration with various projects?
I think this indeed makes sense. But I must honestly also acknowledge
that there is quite some work to be done inside Path-Set prior to that.
Only one example: the winner structure -- the linear hash path trie --
as currently implemented *is not* suited for a production environment:
it lacks a required 'rehash' operation for to handle the amortized growth
of the inner hash table. However, rehashing a linear hash table is (far?)
more costly than to rehash a chained hash table as Gnulib's is. (One has
only to take into account that rehashing a linear hash table amounts to
create from scratch a new table out of the old one by inserting one by
one each element of the old table into the new table; rehashing a chained
hash table implies significantly less work than that -- see 'hash_rehash'
in 'lib/gnulib/hash.c').
Consequently, I conceive that a new chained hash table has to come to life,
one based on Gnulib's hash table and which implements all the optimization
applied to Path-Set's linear hash table (section 3 of 'doc/path-set.txt').
Sincerely,
Stefan V.