[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Dictionary data structures for path name strings
From: |
Stefan Vargyas |
Subject: |
Dictionary data structures for path name strings |
Date: |
Wed, 21 Sep 2016 01:53:39 -0700 |
Dear GNU'ers,
Early at the start of thread [2] by Rasmus Borup Hansen, I was thinking
of playing a bit around the problem of online dictionary data structures
for path name strings, using Bentley and Sedgewick's [1] ternary search
trees (or tries).
Path name sets obtained e.g. from 'find' have an interesting property which
TSTs can make use of: the strings belonging to such sets share a great deal
of common prefixes. Later, on the same discussion thread, Pádraig Brady [3]
noted this fact too.
However, only this summer, I actually did give substance to these ideas [4].
The results obtained so far are not spectacular, yet not negligible either:
for two sets of path names, the gains in *node* memory consumption relative
to (1) the size of the input and to (2) the node memory usage of Gnulib's
generic hash table implementation (which is currently used by 'cp') were as
shown by the following table:
----- ----------------- ----- -----
set platform (1) (2)
----- ----------------- ----- -----
1 32-bit pointers 35% 42%
2 32-bit pointers 27% 36%
----- ----------------- ----- -----
1 32-bit offsets 23% 38%
2 32-bit offsets 13% 32%
----- ----------------- ----- -----
The first of the path name sets has cardinality 396K and size 26M, while
the second one -- cardinality 811K and size 47M.
The '32-bit offset' platform is to be understood as a 64-bit platform where
Path-Set is using 32-bit offsets instead of 64-bit pointers for referencing
the node structures within the node structure themselves.
Read the story to its full extent in the file 'doc/path-set.txt' of [4].
I will gladly provide you guys with more insights into the innards of the
implementation code and the choices made therein.
Happy hacking!
Sincerely,
Stefan Vargyas.
References:
[1] Jon L. Bentley, Robert Sedgewick:
Fast Algorithms for Sorting and Searching Strings
8th Symposium on Discrete Algorithms, 1997, 360-369.
http://www.cs.princeton.edu/~rs/strings/paper.pdf
[2] Rasmus Borup Hansen: My experience with using cp to copy a lot of files
(432 millions, 39 TB)
Coreutils Archives, 11 Aug 2014
https://lists.gnu.org/archive/html/coreutils/2014-08/msg00012.html
[3] Pádraig Brady: Re: My experience with using cp to copy a lot of files
(432 millions, 39 TB)
Coreutils Archives, 14 Sep 2014
http://lists.gnu.org/archive/html/coreutils/2014-09/msg00020.html:
BTW there is a lot of redundancy in all the stored paths,
which could be improved by using a prefix compression method.
[4] Path-Set: Dictionary Data Structures for Path Name Strings
http://www.nongnu.org/path-set/
For about three weeks, Savannah administation seems to have stalled its
activity. The second package I submitted for Savannah's approval can be
obtained from the site by:
$ wget -O path-set-Sep-16-2016.tar.bz2 \
https://savannah.gnu.org/support/download.php?file_id=38530
- Dictionary data structures for path name strings,
Stefan Vargyas <=