guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: intrinsic vs extrinsic identifier: toward more robustness?


From: Maxime Devos
Subject: Re: intrinsic vs extrinsic identifier: toward more robustness?
Date: Mon, 6 Mar 2023 13:22:24 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.7.2

Op 05-03-2023 om 21:21 schreef Simon Tournier:
Whatever the intrinsic identifier we consider – even ones based on very
weak cryptographic hash function as MD5, or based on non-crytographic
hash function as Pearson hashing, etc. – the integrity check is
currently done by SHA256.

How about using the hash of the integrity check as an intrinsic
identifier, like is done currently?  I mean, we hash it anyway with
sha256 for the integrity check anyway, might as reuse it.

Maybe ask GNUnet folk to address by NAR+SHA256 instead on their
specification. ;-)

Obviously, Guix should replace NAR+SHA256 by GNUnet FS URIs /j.

Kidding aside, your comment rises two points of view:

  1. Guix is fetching data from elsewhere and this elsewhere is not using
     NAR+SHAR256 intrinsic identifier.  Therefore, the question is how to
     adapt the source origin for taking into account this elsewhere?

  2. Replace the NAR+SHA256 integrity checksum by what content-addressed
     systems use as intrinsic identifier.  IMHO, that’s a bad idea for
     two reasons: (a) security, for instance SHA1 as used by SWH is not
     secure and (b) it will be unmanageable in practise.

I was thinking of (1), not (2).
All that’s said, Guix uses extrinsic identifiers for almost all origins,
if not all.  Even for ’git-fetch’ method.

For git-fetch, the value of the 'commit' field is intrinsic (except when
it's a tag instead).

No, it is imprecise.  The exception is *not* label tag as value for the
’commit’ field but the exception is Git commit hash as value.

Are you referring to the fact that currently, the 'commit' field usually contains a tag name, and that it containing a commit is the exception?
If so, that doesn't contradict my claim.

This can be solved by placing the actual commit in the 'commit' field of
git-reference, instead of the tag name, then things are completely
unambiguous -- this and its opposite were discussed in ‘On raw strings
in <origin> commit field’ (*), IIRC.

The thread you are referencing [1] is based on misunderstandings.  I
would like to move forward, hence my detailed email. :-)

1: 
<https://yhetil.org/guix/6e451a878b749d4afb6eede9b476e5faabb0d609.camel@gmail.com/#r>

Your email is about intrinsic identifiers and more robustness, yet it doesn't mention using git commits more anywhere. As such, I do not follow ‘hence my detailed email’ -- it contains detail, but it misses some relevant detail that I pointed out in my previous response.

Also, with ‘move forward’, do you mean ‘move forward’, or ‘maintain status quo’? Because given that you are replying to the proposed solution (that even avoids problems pointed out in those threads) by saying nothing of technical importance and by pointing to some contentious things, it really appears the latter to me.

(*) Also maybe that thread about tricking peer review.

I didn't understand the position that commit field should contain the
(indirect, fragile) tag instead of the (direct, robust) commit, but
those differences could be sidestepped by having both a 'tag' field and
a 'commit' field, IIUC.

I would not frame this way.  My view is not to replace something by
something else, instead, is to add something and/or several things.

I was thinking of adding the commit (intrinsic) to the git-reference, instead of only having a tag (extrinsic) in the git-reference as is mostly done currently.

I also want to mention that, except of a general notion of 'more robustness' and a specific command "guix freeze -m manifest.scm" and such, you never mentioned what your view was, so I had to guess.

The problem then was to somehow map the NAR hash to the FS identifier.

Yes, that’s the problem. :-) GNUnet FS identifier is one case.  And my
discussion here is: could we augment source origin to be able to deal
with various identifier?


A straightforward solution would be to just replace the https:// by
gnunet:// in the origin (like in https://issues.guix.gnu.org/44199,
except that patch doesn't support fallbacks to other URLs like url-fetch
does).

Somehow, your proposition would be to have a list as URI, right?

      (origin
        (method gnunet-fetch)
        (uri
         (list
           (string-append "mirror://gnu/hello/hello-" version
                            ".tar.gz")
           
"gnunet://fs/chk/TY48PGS5RVX643NT2B7GDNFCBT4DWG692PF4YNHERR96K6MSFRZ4ZWRPQ4KVKZV29MGRZTWAMY9ETTST4B6VFM47JR2JS5PWBTPVXB0.8A9HRYABJ7HDA7B0"
           "shw:1:dir:9c1eecffa866f7cb9ffdd56c32ad0cecb11fcf2a"
        (file-name "gnunet-hello-2.10.tar.gz")
        (sha256
         (base32
          "0ssi1wpaf7plaswqqjwigppsg5fyh99vdlb9kzl7c9lng89ndq1i")

Yes, though in a proper version of 44199 (which doesn't exist yet) it would just be integrated into url-fetch instead of having a separate gnunet-fetch.

It is not affordable, neither wanted, to switch from the current
extrinsic identification to a complete intrinsic one.  Although it would
fix many issues. ;-)

How about in-between: include both an intrinsic identifier (the
sha256sum) and an extrinsic identifier (the URLs to locate the object
at), like the status quo.

That’s what I am proposing between the lines. :-)

I recommend being explicit.

The question is which design.  For instance, it could go under the field
’properties’ similarly as “upstream name” or potentially other
“metadata”.  Or it could go under the source origin field.

Well, however as you pointed, being a ’properties’ would not be as
easy.  And as you also pointed, the integrity field could be something
else than ’sha256’, so maybe we could have a list here.

To be clear, my comment on Guix supporting other things than sha256 was just a statement of fact, not a proposal to use that mechanism (and neither a proposal to not use that mechanism).

The discussion could also fit how to distribute using ERIS.

ERIS is not a method on its own; you need to combine it with a P2P
network that uses ERIS.  I do not understand the special focus on ERIS.

Yes, indeed.  However, to my knowledge, each P2P can use its own
identifier and from my understanding, ERIS relies on whatever P2P.
Therefore, willing guix-daemon being able to use ERIS, it somehow
implies a discussion about the identifiers used by the P2P networks.

Do I miss something?

I don't have any issue with ERIS itself (*). The issue I have with ERIS, is that it often appears to be treated as some panacea that transcends all P2P systems and is fundamentally different from other identifiers used by other P2P systems, but <https://xkcd.com/927/> applies here -- while it might become some universal standard, it isn't yet.

Hence, ‘I do not understand the __special__ focus on ERIS’ (emphasis added). As long as the ERIS identifier is treated as one among many instead of somehow being considered special, it's fine to me.

(*) Besides several technical issues in its current implementation -- the implementation of ERIS is optimised for classical transports instead of P2P transports, ERIS is only implemented for IPFS currently and ERIS doesn't have a deduplication system for directories. (In GNUnet and BitTorrent, and I think in IPFS and BitTorrent too, if two directories (e.g. store items) that have a file in common were put into the P2P, then for the P2P's purposes these two files are the same file, so availability of one store item aids the availability of another store item.)

At some point, I was thinking to have something like “guix freeze -m
manifest.scm” returning a map of all the sources from the deep bootstrap
to the leaf packages described in manifest.scm.  However, maybe
something is poor in the metadata we collect at package time.

That sounds like "guix build --sources=transitive' to me, except for
being even more transitive.  I propose making this an additional option
for the --sources argument instead.

No.  “guix build --sources=transitive” returns an archive containing all
the sources.  Instead, I would like the all various identifiers (URL,
NAR, SWHID, GNUnet, etc.) of all the transitive sources.

I do not see how making a list of all identifiers helps with robustness -- you need the object the identifiers point to, not the identifier itself.

Unless the goal is to use the map of package->identifiers to determine which packages are currently lacking redundancy (i.e., have few identifiers), which to be clear seems reasonable to me.
Cheers,
simon

PS:

However the fields ’swhid’ and the other SHA256 ’digest’ are different
from above.  That’s because the dots [...] part.  It probably comes from
the normalization process. Well, I am not sure to deeply understand why
it is different but that’s another story. :-)

The reason for the normalisation was something about SWH only providing
tarballs whose contents are equal to the ingested tarball; the tarballs
are not bit-for-bit identical to the ingested tarball.  But Guix needs
bit-for-bit identical tarballs, so Disarchive contains the information
that was stripped-out by SWH to complement the tarballs provided by
Disarchive.

SWH is not in the picture with the example I provided. :-)  Yes, the
dots part is related to some normalization and “metadata”.

Your question was about where the differences come from. The answer is ‘because SWH normalisation stuff’. As such, SWH is in the picture.

What I do not understand is, if “guix build hello -S” is manually
uncompressed and untar, the content corresponds to:

     $ guix hash -S git -H sha256 -f hex hello-2.12.1
     cc7d5c45cfa1f5fba96c8b32d933734b24377a3c1ac776650044e497469affd4

The tool ’disarchive’ dissembles the compressed archive; it first
provides the hash of the compressed archive (.tar.gz), then store
metadata about compression level, algorithm etc, then provides the hash
of the uncompressed archive (.tar), then store metadata about files and
last it provides the hash of the tree, it reads,

     (input (directory-ref
              (version 0)
              (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1")
              (addresses
                (swhid "swh:1:dir:9c1eecffa866f7cb9ffdd56c32ad0cecb11fcf2a"))
              (digest
                (sha256
                  
"1cb6effd40736b441a2a6dd49e56b3dfd4f6550e8ae1a8ac34ed4b1674097bc0"))))))))

and I do not understand why it is not the same as manually computed; see
above.   Well, that’s a detail and not relevant to the current
discussion since it is part of how Disarchive works internally.

You are hashing the 'hello-2.12.1' directory, which is the only directory in the tarball. However, while it is considered bad practice, a tarball can contain multiple top-level entries. As such, you should consider the tarball as an encoding of a directory that happens to contain the 'hello-2.12.1' directory, and hash the wrapper directory instead of its member hello-2.12.1:

$ mkdir a
$ cd a
$ tar -xf /gnu/store/3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar.gz
$ guix hash -Sgit -H sha256 -f hex .
1cb6effd40736b441a2a6dd49e56b3dfd4f6550e8ae1a8ac34ed4b1674097bc0

Using these steps, the value in the (digest (sha256 ...)) is recovered.

Greetings,
Maxime.

Attachment: OpenPGP_0x49E3EE22191725EE.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]