[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#42162: Recovering source tarballs
bug#42162: Recovering source tarballs
Wed, 05 Aug 2020 19:14:12 +0200
Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
Timothy Sample <email@example.com> skribis:
> Ludovic Courtès <firstname.lastname@example.org> writes:
>> Wooohoo! Is it that time of the year when people give presents to one
>> another? I can’t believe it. :-)
> Not to be too cynical, but I think it’s just the time of year that I get
> frustrated with what I should be working on, and start fantasizing about
> green-field projects. :p
>> Timothy Sample <email@example.com> skribis:
>>> The header and footer are read directly from the file. Finding the
>>> compressor is harder. I followed the approach taken by the pristine-tar
>>> project. That is, try a bunch of compressors and hope for a match.
>>> Currently, I have:
>>> • gnu-best
>>> • gnu-best-rsync
>>> • gnu
>>> • gnu-rsync
>>> • gnu-fast
>>> • gnu-fast-rsync
>>> • zlib-best
>>> • zlib
>>> • zlib-fast
>>> • zlib-best-perl
>>> • zlib-perl
>>> • zlib-fast-perl
>>> • gnu-best-rsync-1.4
>>> • gnu-rsync-1.4
>>> • gnu-fast-rsync-1.4
>> I would have used the integers that zlib supports, but I guess that
>> doesn’t capture this whole gamut of compression setups. And yeah, it’s
>> not great that we actually have to try and find the right compression
>> levels, but there’s no way around it it seems, and as you write, we can
>> expect a couple of variants to be the most commonly used ones.
> My first instinct was “this is impossible – a DEFLATE compressor can do
> just about whatever it wants!” Then I looked at pristine-tar and
> realized that their hack probably works pretty well. If I had infinite
> time, I would think about some kind of fully general, parameterized LZ77
> algorithm that could describe any implementation. If I had a lot of
> time I would peel back the curtain on Gzip and zlib and expose their
> tuning parameters. That would be nicer, but keep in mind we will have
> to cover XZ, bzip2, and ZIP, too! There’s a bit of balance between
> quality and coverage. Any improvement to the representation of the
> compression algorithm could be implemented easily: just replace the
> names with their improved representation.
Yup, it makes sense to not spend too much time on this bit. I guess
we’d already have good coverage with gzip and xz.
>> (BTW the code I posted or the one in Disarchive could perhaps replace
>> the one in Gash-Utils. I was frustrated to not see a ‘fold-archive’
>> procedure there, notably.)
> I really like “fold-archive”. One of the reasons I started doing this
> is to possibly share code with Gash-Utils. It’s not as easy as I was
> hoping, but I’m planning on improving things there based on my
> experience here. I’ve now worked with four Scheme tar implementations,
> maybe if I write a really good one I could cap that number at five!
Heh. :-) The needs are different anyway. In Gash-Utils the focus is
probably on simplicity/maintainability, whereas here you really want to
cover all the details of the wire representation.
>>> To avoid hitting the SWH archive at all, I introduced a directory cache
>>> so that I can store the directories locally. If the directory cache is
>>> available, directories are stored and retrieved from it.
>> I guess we can get back to them eventually to estimate our coverage ratio.
> It would be nice to know, but pretty hard to find out with the rate
> limit. I guess it will improve immensely when we set up a
> “sources.json” file.
Note that we have <https://guix.gnu.org/sources.json>. Last I checked,
SWH was ingesting it in its “qualification” instance, so it should be
ingesting it for good real soon if it’s not doing it already.
>>> You mean like <https://git.ngyro.com/disarchive-db/>? :)
>> Woow. :-)
>> We could actually have a CI job to create the database: it would
>> basically do ‘disarchive save’ for each tarball and store that using a
>> layout like the one you used. Then we could have a job somewhere that
>> periodically fetches that and adds it to the database. WDYT?
> Maybe.... I assume that Disarchive would fail for a few of them. We
> would need a plan for monitoring those failures so that Disarchive can
> be improved. Also, unless I’m misunderstanding something, this means
> building the whole database at every commit, no? That would take a lot
> of time and space. On the other hand, it would be easy enough to try.
> If it works, it’s a lot easier than setting up a whole other service.
One can easily write a procedure that takes a tarball and returns a
<computed-file> that builds its database entry. So at each commit, we’d
just rebuild things that have changed.
>> I think we should leave room for other hash algorithms (in the sexps
>> above too).
> It works for different hash algorithms, but not for different directory
> hashing methods (like you mention below).
>> So it does mean that we could pretty much right away add a fall-back in
>> (guix download) that looks up tarballs in your database and uses
>> Disarchive to recontruct it, right? I love solved problems. :-)
>> Of course we could improve Disarchive and the database, but it seems to
>> me that we already have enough to improve the situation. WDYT?
> I would say that we are darn close! In theory it would work. It would
> be much more practical if we had better coverage in the SWH archive
> (i.e., “sources.json”) and a way to get metadata for a source archive
> without downloading the entire Disarchive database. It’s 13M now, but
> it will likely be 500M with all the Gzip’d tarballs from a recent commit
> of Guix. It will only grow after that, too.
If we expose the database over HTTP (like over cgit), we can arrange so
that (guix download) simply GETs db.example.org/sha256/xyz. No need to
fetch the whole database.
It might be more reasonable to have a real database and a real service
around it, I’m sure Chris Baines would agree ;-), but we can choose URLs
that could easily be implemented by a “real” service instead of cgit in
> Of course those are not hard blockers, so ‘(guix download)’ could start
> using Disarchive as soon as we package it. I’ve starting looking into
> it, but I’m confused about getting access to Disarchive from the
> “out-of-band” download system. Would it have to become a dependency of
Yes. It could be a behind-the-scenes dependency of “builtin:download”;
it doesn’t have to be a dependency of each and every fixed-output
> I was imagining an escape hatch beyond this, where one could look up a
> provenance record from when Disarchive ingested and verified a source
> code archive. The provenance record would tell you which version of
> Guix was used when saving the archive, so you could try your luck with
> using “guix time-machine” to reproduce Disarchive’s original
> computation. If we perform database migrations, you would need to
> travel back in time in the database, too. The idea is that you could
> work around breakages in Disarchive automatically using the Power of
> Guix™. Just a stray thought, really.
Seems to me it Shouldn’t Be Necessary? :-)
I mean, as long as the format is extensible and “future-proof”, we’ll
always be able to rebuild tarballs and then re-disassemble them if we
need to compute new hashes or whatever.
>> If you feel like it, you’re welcome to point them to your work in the
>> discussion at <https://forge.softwareheritage.org/T2430>. There’s one
>> person from NixOS (lewo) participating in the discussion and I’m sure
>> they’d be interested. Perhaps they’ll tell whether they care about
>> having it available as JSON.
> Good idea. I will work out a few more kinks and then bring it up there.
> I’ve already rewritten the parts that used the Guix daemon. Disarchive
> now only needs a handful Guix modules ('base32', 'serialization', and
> 'swh' are the ones that would be hard to remove).
An option would be to use (gcrypt base64); another one would be to
bundle (guix base32).
I was thinking that it might be best to not use Guix for computations.
For example, have “disarchive save” not build derivations and instead do
everything “here and now”. That would make it easier for others to
adopt. Wait, looking at the Git history, it looks like you already
addressed that point, neat. :-)