Re: Dictionary data structures for path name strings

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Dictionary data structures for path name strings

From:	Stefan Vargyas
Subject:	Re: Dictionary data structures for path name strings
Date:	Wed, 21 Sep 2016 09:39:03 -0700

Thank you Pádraig for your reply!

> > 
> > The results obtained so far are not spectacular, yet not negligible either:
> > for two sets of path names, the gains in *node* memory consumption relative
> > to (1) the size of the input and to (2) the node memory usage of Gnulib's
> > generic hash table implementation (which is currently used by 'cp') were as
> > shown by the following table:
> > 
> >   ----- ----------------- ----- -----
> >    set       platform      (1)   (2)
> >   ----- ----------------- ----- -----
> >     1    32-bit pointers   35%   42%
> >     2    32-bit pointers   27%   36%
> >   ----- ----------------- ----- -----
> >     1    32-bit offsets    23%   38%
> >     2    32-bit offsets    13%   32%
> >   ----- ----------------- ----- -----
> 
> Cool. So about a third less mem.

The percents above are referring to *node* memory consumption only.

I admit this to be a bit misleading, since the values seen above does
not represent the total amount of memory used by the data structures.

E.g. in the case of Gnulib' hash table, one should also account for
the amount of memory consumed by the hash table itself: the struct
'hash_table', the array of 'hash_entry' structs, 'bucket', and the
list of 'hash_entry' structs, 'free_entry_list'.

Maybe Path-Set should define for each dictionary structure a new stat
parameter 'total-mem', that besides 'total-node-mem' includes all the
other kinds of memory claimed by the respective structure from the free
store.

> I presume that's dependent on the depth of the paths in question?

Indeed, it seems so. The experiments I carried out with Path-Set indicate
that there are quite a few combinatorial factors that are influencing the
memory consumption implied by these data structures.

A first cominatorial case that come to one's attention -- please see
'doc/path-set.txt' -- is that of 64-bit platforms using 64-bit pointers,
where *no* memory consumption gain is obtained at all. This is the reason
of using 32-bit offsets instead of 64-bit pointers in the node structures
on such platforms (see section 1 of 'doc/path-set.txt').

Another such combinatorial case: these data structures -- including the
one providing the positive outcomes seen in the table above -- are not
fulfilling the initiating goal (see it defined in section 1.a of the
README file) for smallish input path name sets:

  $ shopt -s extglob

  $ . src/commands.sh

  $ cd tests/2016-08-28

  $ size-fmt() { local self='size-fmt'; local p=1; local s='K'; case "$1" in 
''|-k) ;; -m) p=2; s='M';; *) error "invalid arg '$1'"; return 1;; esac; bc -q 
<(echo "scale=1; print read() / 1024^$p, \"$s\n\""); }

Random '100000' path strings out of ~400K on a 32-bit platform:

  $ path-set-test-grep -Re --separators='/' \
  --input=tests/files.txt --config-32bit --stat-line-mem -t 100000|
  datamash --header-in mean 3|size-fmt -m
  6.2M

  $ path-set-test-grep -Fe --separators='/' \
  --input=tests/files.txt --config-32bit \
  --stat-total-node-mem={6.2M,7.1M} \
  --not --ternary-tree -t 100000
  struct-type  set-type     total-node-mem  6.2M  7.1M
  path-trie    linear-hash            4.6M   26%   36%
  path-trie    gnulib-hash            5.1M   17%   28%
  plain-set    linear-hash            6.7M   -9%    5%
  plain-set    gnulib-hash            7.1M  -15%    0%

Random '10000' path strings out of ~400K on a 32-bit platform:

  $ path-set-test-grep -Re --separators='/' \
  --input=tests/files.txt --config-32bit --stat-line-mem -t 10000|
  datamash --header-in mean 3|size-fmt -k
  642.0K

  $ path-set-test-grep -Fe --separators='/' \
  --input=tests/files.txt --config-32bit \
  --stat-total-node-mem={642.0K,728.4K} \
  --not --ternary-tree -t 10000
  struct-type  set-type     total-node-mem  642.0K  728.4K
  path-trie    linear-hash          603.6K      6%     17%
  plain-set    linear-hash          690.8K     -8%      5%
  path-trie    gnulib-hash          691.6K     -8%      5%
  plain-set    gnulib-hash          728.4K    -13%      0%

Random '1000' path strings out of ~400K on a 32-bit platform:

  $ path-set-test-grep -Re --separators='/' \
  --input=tests/files.txt --config-32bit --stat-line-mem -t 1000|
  datamash --header-in mean 3|size-fmt -k
  64.2K

  $ path-set-test-grep -Fe --separators='/' \
  --input=tests/files.txt --config-32bit \
  --stat-total-node-mem={64.2K,73.1K} \
  --not --ternary-tree -t 1000
  struct-type  set-type     total-node-mem  64.2K  73.1K
  plain-set    linear-hash           69.1K    -8%     6%
  plain-set    gnulib-hash           73.1K   -14%     0%
  path-trie    linear-hash           85.8K   -34%   -17%
  path-trie    gnulib-hash           99.9K   -56%   -37%

> > [4] Path-Set: Dictionary Data Structures for Path Name Strings
> > 
> >     http://www.nongnu.org/path-set/
> 
> 404 ?

Still waiting for Savannah to approve hosting Path-Set.

> Lots of info here. Good work!
> Having this available as an abstracted data structure seems very useful.
> Would it make sense to include it in gnulib for easier
> integration with various projects?

I think this indeed makes sense. But I must honestly also acknowledge
that there is quite some work to be done inside Path-Set prior to that.

Only one example: the winner structure -- the linear hash path trie --
as currently implemented *is not* suited for a production environment:
it lacks a required 'rehash' operation for to handle the amortized growth
of the inner hash table. However, rehashing a linear hash table is (far?)
more costly than to rehash a chained hash table as Gnulib's is. (One has
only to take into account that rehashing a linear hash table amounts to
create from scratch a new table out of the old one by inserting one by
one each element of the old table into the new table; rehashing a chained
hash table implies significantly less work than that -- see 'hash_rehash'
in 'lib/gnulib/hash.c'). 

Consequently, I conceive that a new chained hash table has to come to life,
one based on Gnulib's hash table and which implements all the optimization
applied to Path-Set's linear hash table (section 3 of 'doc/path-set.txt').

Sincerely,

Stefan V.

[Prev in Thread]

Current Thread

[Next in Thread]

Dictionary data structures for path name strings, Stefan Vargyas, 2016/09/21
- Re: Dictionary data structures for path name strings, Pádraig Brady, 2016/09/21
  - Re: Dictionary data structures for path name strings, Stefan Vargyas <=

Prev by Date: Re: Dictionary data structures for path name strings
Next by Date: who.c vs gcc-7's improved buffer overrun check
Previous by thread: Re: Dictionary data structures for path name strings
Next by thread: who.c vs gcc-7's improved buffer overrun check
Index(es):
- Date
- Thread