[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#42162: Recovering source tarballs

From: zimoun
Subject: bug#42162: Recovering source tarballs
Date: Wed, 15 Jul 2020 18:55:21 +0200

Hi Ludo,

Well, you enlarge the discussion to more than the issue of the 5
url-fetch packages on gforge.inria.fr :-)

First of all, you wrote [1] ``Migration away from tarballs is already
happening as more and more software is distributed straight from
content-addressed VCS repositories, though progress has been relatively
slow since we first discussed it in 2016.'' but on the other hand Guix
uses more than often [2] "url-fetch" even if "git-fetch" is available
upstream.  Other said, I am not convinced the migration is really

The issue would be mitigated if Guix transitions from "url-fetch" to
"git-fetch" when possible.

1: https://forge.softwareheritage.org/T2430#45800
2: https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00224.html

Second, trying to do some stats about the SWH coverage, I note that
non-neglectible "url-fetch" are reachable by "lookup-content".  The
coverage is not straightforward because of the 120 request per hour rate
limit or unexpected server error.  Another story.

Well, I would like having numbers because I do not know what is
concretely the issue: how many "url-fetch" packages are reachable?  And
if they are unreachable, is it because they are not in yet? or is it
because Guix does not have enough info to lookup them?

On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:

> For the now, since 70% of our packages use ‘url-fetch’, we need to be
> able to fetch or to reconstruct tarballs.  There’s no way around it.

Yes, but for example all the packages in gnu/packages/bioconductor.scm
could be "git-fetch".  Today the source is over url-fetch but it could
be over git-fetch with https://git.bioconductor.org/packages/flowCore or

Another example is the packages in gnu/packages/emacs-xyz.scm and the
ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for
example using

So I would be more reserved about the "no way around it". :-)  I mean
the 70% could be a bit mitigated.

> In the short term, we should arrange so that the build farm keeps GC
> roots on source tarballs for an indefinite amount of time.  Cuirass
> jobset?  Mcron job to preserve GC roots?  Ideas?

Yes, preserving source tarballs for an indefinite amount of time will
help.  At least all the packages where "lookup-content" returns #f,
which means they are not in SWH or they are unreachable -- both is
equivalent from Guix side.

What about in addition push to IPFS?  Feasible?  Lookup issue?

> For the future, we could store nar hashes of unpacked tarballs instead
> of hashes over tarballs.  But that raises two questions:
>   • If we no longer deal with tarballs but upstreams keep signing
>     tarballs (not raw directory hashes), how can we authenticate our
>     code after the fact?

Does Guix automatically authenticate code using signed tarballs?

>   • SWH internally store Git-tree hashes, not nar hashes, so we still
>     wouldn’t be able to fetch our unpacked trees from SWH.
> (Both issues were previously discussed at
> <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.)
> So for the medium term, and perhaps for the future, a possible option
> would be to preserve tarball metadata so we can reconstruct them:
>   tarball = metadata + tree

There is different issues at different levels:

 1. how to lookup? what information do we need to keep/store to be able
    to query SWH?
 2. how to check the integrity? what information do we need to
    keep/store to be able to verify that SWH returns what Guix expects?
 3. how to authenticate? where the tarball metadata has to be stored if
    SWH removes it?

Basically, the git-fetch source stores 3 identifiers:

 - upstream url
 - commit / tag
 - integrity (sha256)

Fetching from SWH requires the commit only (lookup-revision) or the
tag+url (lookup-origin-revision) then from the returned revision, the
integrity of the downloaded data is checked using the sha256, right?

Therefore, one way to fix lookup of the url-fetch source is to add an
extra field mimicking the commit role.

The easiest is to store a SWHID or an identifier allowing to deduce the

I have not checked the code, but something like this:


and at package time, this identifier is added, similarly to integrity.

Aside, does Guix use the authentication metadata that tarballs provide?

( BTW, I failed [3,4] to package swh.model so if someone wants to give a
3: https://lists.gnu.org/archive/html/help-guix/2020-06/msg00158.html
4: https://lists.gnu.org/archive/html/help-guix/2020-06/msg00161.html )

> After all, tarballs are byproducts and should be no exception: we should
> build them from source.  :-)


> The code below can “disassemble” and “assemble” a tar.  When it
> disassembles it, it generates metadata like this:


> The ’assemble-archive’ procedure consumes that, looks up file contents
> by hash on SWH, and reconstructs the original tarball…

Where do you plan to store the "disassembled" metadata?
And where do you plan to "assemble-archive"?

I mean,

 What is pushed to SWH? And how?
 What is fetched from SWH? And how?

(Well, answer below. :-))

> … at least in theory, because in practice we hit the SWH rate limit
> after looking up a few files:

Yes, it is 120 request per hour and 10 save per hour.  Well, I do not
think they will increase much these numbers in general.  However,
they seem open for specific machines.  So, I do not want to speak for
them, but we could ask an higher rate limit for ci.guix.gnu.org for
example.  Then we need to distinguish between source substitutes and
binary substitutes.  And basically, when an user runs "guix build foo",
if the source is not available upstream nor already on ci.guix.gnu.org,
then ci.guix.gnu.org fetch the missing sources from SWH and delivers it
to the user.

>   https://archive.softwareheritage.org/api/#rate-limiting
> So it’s a bit ridiculous, but we may have to store a SWH “dir”
> identifier for the whole extracted tree—a Git-tree hash—since that would
> allow us to retrieve the whole thing in a single HTTP request.

Well, the limited resources of SWH is an issue but SWH is not a mirror
but an archive. :-)

And as I wrote above, we could ask to SWH to increase the rate limit for
specific machine such as ci.guix.gnu.org

> I think we’d have to maintain a database that maps tarball hashes to
> metadata (!).  A simple version of it could be a Git repo where, say,
> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would
> contain the metadata above.  The nice thing is that the Git repo itself
> could be archived by SWH.  :-)

How this database that maps tarball hashes to metadata should be
maintained?  Git push hook?  Cron task?

What about foreign channels?  Should they maintain their own map?

To summary, it would work like this, right?

at package time:
 - store an integrity identiter (today sha256-nix-base32)
 - disassemble the tarball
 - commit to another repo the metadata using the path (address)
 - push to packages-repo *and* metadata-database-repo

at future time: (upstream has disappeared, say!)
 - use the integrity identifier to query the database repo
 - lookup the SWHID from the database repo
 - fetch the data from SWH
 - or lookup the IPFS identifier from the database repo and fetch the
   data from IPFS, for another example
 - re-assemble the tarball using the metadata from the database repo
 - check integrity, authentication, etc.

Well, right it is better than only adding an identifier for looking up
as I described above; because it is more general and flexible than only
SWH as fall-back.

The format of metadata (disassemble) that you propose is schemish
(obviously! :-)) but we could propose something more JSON-like.

All the best,

reply via email to

[Prev in Thread] Current Thread [Next in Thread]