coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Dictionary data structures for path name strings


From: Stefan Vargyas
Subject: Dictionary data structures for path name strings
Date: Wed, 21 Sep 2016 01:53:39 -0700

Dear GNU'ers,

Early at the start of thread [2] by Rasmus Borup Hansen, I was thinking
of playing a bit around the problem of online dictionary data structures
for path name strings, using Bentley and Sedgewick's [1] ternary search
trees (or tries).

Path name sets obtained e.g. from 'find' have an interesting property which
TSTs can make use of: the strings belonging to such sets share a great deal
of common prefixes. Later, on the same discussion thread, Pádraig Brady [3]
noted this fact too.

However, only this summer, I actually did give substance to these ideas [4].

The results obtained so far are not spectacular, yet not negligible either:
for two sets of path names, the gains in *node* memory consumption relative
to (1) the size of the input and to (2) the node memory usage of Gnulib's
generic hash table implementation (which is currently used by 'cp') were as
shown by the following table:

  ----- ----------------- ----- -----
   set       platform      (1)   (2)
  ----- ----------------- ----- -----
    1    32-bit pointers   35%   42%
    2    32-bit pointers   27%   36%
  ----- ----------------- ----- -----
    1    32-bit offsets    23%   38%
    2    32-bit offsets    13%   32%
  ----- ----------------- ----- -----

The first of the path name sets has cardinality 396K and size 26M, while
the second one -- cardinality 811K and size 47M.

The '32-bit offset' platform is to be understood as a 64-bit platform where
Path-Set is using 32-bit offsets instead of 64-bit pointers for referencing
the node structures within the node structure themselves. 

Read the story to its full extent in the file 'doc/path-set.txt' of [4].

I will gladly provide you guys with more insights into the innards of the
implementation code and the choices made therein.

Happy hacking!

Sincerely,

Stefan Vargyas.


References:

[1] Jon L. Bentley, Robert Sedgewick:
    Fast Algorithms for Sorting and Searching Strings
    8th Symposium on Discrete Algorithms, 1997, 360-369.

    http://www.cs.princeton.edu/~rs/strings/paper.pdf

[2] Rasmus Borup Hansen: My experience with using cp to copy a lot of files
      (432 millions, 39 TB)
    Coreutils Archives, 11 Aug 2014

    https://lists.gnu.org/archive/html/coreutils/2014-08/msg00012.html

[3] Pádraig Brady: Re: My experience with using cp to copy a lot of files
      (432 millions, 39 TB)
    Coreutils Archives, 14 Sep 2014

    http://lists.gnu.org/archive/html/coreutils/2014-09/msg00020.html:

      BTW there is a lot of redundancy in all the stored paths,
      which could be improved by using a prefix compression method.

[4] Path-Set: Dictionary Data Structures for Path Name Strings

    http://www.nongnu.org/path-set/

    For about three weeks, Savannah administation seems to have stalled its
    activity. The second package I submitted for Savannah's approval can be
    obtained from the site by:

      $ wget -O path-set-Sep-16-2016.tar.bz2 \
      https://savannah.gnu.org/support/download.php?file_id=38530






reply via email to

[Prev in Thread] Current Thread [Next in Thread]