[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#42162: Recovering source tarballs

From: zimoun
Subject: bug#42162: Recovering source tarballs
Date: Thu, 27 Aug 2020 11:41:24 +0200


On Wed, 26 Aug 2020 at 17:11, Timothy Sample <samplet@ngyro.com> wrote:
> zimoun <zimon.toutoune@gmail.com> writes:
>> One question is how this database scales?
>> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
>> for ~14k packages and then an increase of ~700MB per year, both with the
>> Ludo’s code [1].
>> [1] <http://issues.guix.gnu.org/issue/42162#11>
> It’s a good question.  A good part of the size comes from the
> representation rather than the data.  Compression helps a lot here.  I
> have a database of 3,912 packages.  It’s 295M uncompressed (which is a
> little better than your estimation).  If I pass each file through Lzip,
> it shrinks down to 60M.  That’s more like 15.5K per package, which is
> almost an order of magnitude smaller than the estimation you used
> (120K).  I think that makes the numbers rather pleasant, but it comes at
> the expense of easy storing in Git.

Thank you for these numbers.  Really interesting!

First, I do not know if the database needs to be stored with Git.  What
should be the advantage? (naive question :-))

On SWH T2430 [1], you explain the “default-header” trick to cut down the
size.  Nice!

Moreover, the format is a long list, e.g.,

--8<---------------cut here---------------start------------->8---
    ((name "raptor2-2.0.15/")
     (mode 493)
     (mtime 1414909500)
     (chksum 4225)
     (typeflag 53))
    ((name "raptor2-2.0.15/build/")
     (mode 493)
     (mtime 1414909497)
     (chksum 4797)
     (typeflag 53))
    ((name "raptor2-2.0.15/build/ltversion.m4")
     (size 690)
     (mtime 1414908273)
     (chksum 5958))

--8<---------------cut here---------------end--------------->8---

which is human-readable.  Is it useful?

Instead, one could imagine shorter keywords:

    ((na "raptor2-2.0.15/")
     (mo 493)
     (mt 1414909500)
     (ch 4225)
     (ty 53))

which using your database (commit fc50927) reduces from 295MB to 279MB.

Or even plain list:

   (\x00 "raptor2-2.0.15/" 493 1414909500 4225 53)
   (\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958)

where the first element provides the “type” of list to ease the reader.

Well, the 2 naive questions are: does it make sense to
 - have the database stored under Git?
 - have an human-readable format?

Thank you again for pushing forward this topic. :-)

All the best,

[1] https://forge.softwareheritage.org/T2430#47522

reply via email to

[Prev in Thread] Current Thread [Next in Thread]