[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#42162: Recovering source tarballs

From: zimoun
Subject: bug#42162: Recovering source tarballs
Date: Wed, 26 Aug 2020 12:04:55 +0200

Dear Timothy,

On Thu, 30 Jul 2020 at 13:36, Timothy Sample <samplet@ngyro.com> wrote:

> I call the thing “Disarchive” as in “disassemble a source code archive”.
> You can find it at <https://git.ngyro.com/disarchive/>.  It has a simple
> command-line interface so you can do
>     $ disarchive save software-1.0.tar.gz
> which serializes a disassembled version of “software-1.0.tar.gz” to the
> database (which is just a directory) specified by the “DISARCHIVE_DB”
> environment variable.  Next, you can run
>     $ disarchive load hash-of-something-in-the-db
> which will recover an original file from its metadata (stored in the
> database) and data retrieved from the SWH archive or taken from a cache
> (again, just a directory) specified by “DISARCHIVE_DIRCACHE”.

Really nice!  Thank you!

>> I think we’d have to maintain a database that maps tarball hashes to
>> metadata (!).  A simple version of it could be a Git repo where, say,
>> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would
>> contain the metadata above.  The nice thing is that the Git repo itself
>> could be archived by SWH.  :-)
> You mean like <https://git.ngyro.com/disarchive-db/>?  :)


> This was generated by a little script built on top of “fold-packages”.
> It downloads Gzip’d tarballs used by Guix packages and passes them on to
> Disarchive for disassembly.  I limited the number to 100 because it’s
> slow and because I’m sure there is a long tail of weird software
> archives that are going to be hard to process.  The metadata directory
> ended up being 13M and the directory cache 2G.

One question is how this database scales?

For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
for ~14k packages and then an increase of ~700MB per year, both with the
Ludo’s code [1].

[1] <http://issues.guix.gnu.org/issue/42162#11>

> I could remove most of the Guix stuff so that it would be easy to
> package in Guix, Nix, Debian, etc.  Then, someone™ could write a service
> that consumes a “sources.json” file, adds the sources to a Disarchive
> database, and pushes everything to a Git repo.  I guess everyone who
> cares has to produce a “sources.json” file anyway, so it will be very
> little extra work.  Other stuff like changing the serialization format
> to JSON would be pretty easy, too.  I’m not well connected to these
> other projects, mind you, so I’m not really sure how to reach out.

This service could be really useful.  Yes, it could be easy to update
the database each time Guix produces a new “sources.json”.

As mentioned [2], should this service be part of SWH (download cooking
task)?  Or project side?

[2] <https://forge.softwareheritage.org/T2430#47486>

Thank you again for this piece for work.

All the best,

reply via email to

[Prev in Thread] Current Thread [Next in Thread]