bug#42162: Recovering source tarballs

bug-guix

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#42162: Recovering source tarballs

From:	Bengt Richter
Subject:	bug#42162: Recovering source tarballs
Date:	Thu, 27 Aug 2020 20:06:51 +0200
User-agent:	Mutt/1.10.1 (2018-07-13)

Hi,

On +2020-08-27 11:41:24 +0200, zimoun wrote:
> Hi,
> 
> On Wed, 26 Aug 2020 at 17:11, Timothy Sample <samplet@ngyro.com> wrote:
> > zimoun <zimon.toutoune@gmail.com> writes:
> >
> >> One question is how this database scales?
> >>
> >> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
> >> for ~14k packages and then an increase of ~700MB per year, both with the
> >> Ludo’s code [1].
> >>
> >> [1] <http://issues.guix.gnu.org/issue/42162#11>
> >
> > It’s a good question.  A good part of the size comes from the
> > representation rather than the data.  Compression helps a lot here.  I
> > have a database of 3,912 packages.  It’s 295M uncompressed (which is a
> > little better than your estimation).  If I pass each file through Lzip,
> > it shrinks down to 60M.  That’s more like 15.5K per package, which is
> > almost an order of magnitude smaller than the estimation you used
> > (120K).  I think that makes the numbers rather pleasant, but it comes at
> > the expense of easy storing in Git.
> 
> Thank you for these numbers.  Really interesting!
> 
> First, I do not know if the database needs to be stored with Git.  What
> should be the advantage? (naive question :-))
> 
> 
> On SWH T2430 [1], you explain the “default-header” trick to cut down the
> size.  Nice!
> 
> Moreover, the format is a long list, e.g.,
> 
> --8<---------------cut here---------------start------------->8---
> (headers

How about
    (X-v1-headers
(borrowing from rfc2045 MIME usage indicating as-yet-not-a-formal-standard)
The idea is to make it easy to script the change to "(headers" once
there is consensus for declaring a new standard. The "v1-" part could allow
a simultaneous "(X-v2-headers" alternative for zimoun's concise suggestion,
or even a base64 of a compressed format. There's lots that could be borrowed 
from
the MIME rfc's :)

--8<---------------cut here---------------start------------->8---
6.3.  New Content-Transfer-Encodings

   Implementors may, if necessary, define private Content-Transfer-
   Encoding values, but must use an x-token, which is a name prefixed by
   "X-", to indicate its non-standard status, e.g., "Content-Transfer-
   Encoding: x-my-new-encoding".  Additional standardized Content-
   Transfer-Encoding values must be specified by a standards-track RFC.
   The requirements such specifications must meet are given in RFC 2048.
   As such, all content-transfer-encoding namespace except that
   beginning with "X-" is explicitly reserved to the IETF for future
   use.

   Unlike media types and subtypes, the creation of new Content-
   Transfer-Encoding values is STRONGLY discouraged, as it seems likely
   to hinder interoperability with little potential benefit
--8<---------------cut here---------------end--------------->8---

>     ((name "raptor2-2.0.15/")
>      (mode 493)
If you want to be more human-readable with mode, I would put
a chmod argument in place of 493 :)

--8<---------------cut here---------------start------------->8---
$ printf "%o\n" 493
755
$ 
--8<---------------cut here---------------end--------------->8---

Hm, could this be a security risk??
I mean, could a mode typo here inadvertently open a door for a nasty mod
by oportunistic code buried in a later-executed apparently unrelated app?

>      (mtime 1414909500)
One of these might be more human-recognizable :)
--8<---------------cut here---------------start------------->8---
$ date --date='@1414909497' -Is
2014-11-02T07:24:57+01:00
$ date --date='@1414909497' -uIs
2014-11-02T06:24:57+00:00
$ TZ=America/Buenos_Aires date --date='@1414909497' -Is
2014-11-02T03:24:57-03:00
$
$ date --date='@1414909497' -u '+%Y%m%d_%H%M%S'
20141102_062457
# vs 1414909497, which, yes, costs 5 chars less
$ 
--8<---------------cut here---------------end--------------->8---

>      (chksum 4225)
>      (typeflag 53))
>     ((name "raptor2-2.0.15/build/")
>      (mode 493)
>      (mtime 1414909497)
>      (chksum 4797)
>      (typeflag 53))
>     ((name "raptor2-2.0.15/build/ltversion.m4")
>      (size 690)
>      (mtime 1414908273)
>      (chksum 5958))
> 
>      […])
> --8<---------------cut here---------------end--------------->8---
> 
> which is human-readable.  Is it useful?
> 
> 
> Instead, one could imagine shorter keywords:
>
(X-v2-headers  ;; ;-)
>     ((na "raptor2-2.0.15/")
>      (mo 493)
>      (mt 1414909500)
>      (ch 4225)
>      (ty 53))
> 
> which using your database (commit fc50927) reduces from 295MB to 279MB.
> 
> Or even plain list:
>
(X-v3-headers
>    (\x00 "raptor2-2.0.15/" 493 1414909500 4225 53)
>    (\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958)
> 
> where the first element provides the “type” of list to ease the reader.
> 
> 
> Well, the 2 naive questions are: does it make sense to
>  - have the database stored under Git?
>  - have an human-readable format?
> 
> 
> Thank you again for pushing forward this topic. :-)
> 
> All the best,
> simon
> 
> [1] https://forge.softwareheritage.org/T2430#47522
> 
> 
> 

Prefixing "X-" can obviously be used with any tentative name for anything.

I am suggesting it as a counter to premature (and likely clashing) bindings
of valuable names, which IMO is as bad as premature optimization :)

Naming is too important to be defined by first-user flag-planting, ISTM.
-- 
Regards,
Bengt Richter

[Prev in Thread]

Current Thread

[Next in Thread]

bug#42162: Recovering source tarballs, Timothy Sample, 2020/08/03
- bug#42162: Recovering source tarballs, Ludovic Courtès, 2020/08/05
  - bug#42162: Recovering source tarballs, Timothy Sample, 2020/08/05
    - bug#42162: Recovering source tarballs, Ludovic Courtès, 2020/08/23
- bug#42162: Recovering source tarballs, Ricardo Wurmus, 2020/08/03
- bug#42162: Recovering source tarballs, zimoun, 2020/08/26
  - bug#42162: Recovering source tarballs, Timothy Sample, 2020/08/26
    - bug#42162: Recovering source tarballs, zimoun, 2020/08/27
    - bug#42162: Recovering source tarballs, Ludovic Courtès, 2020/08/27
    - bug#42162: Recovering source tarballs, Bengt Richter <=

Prev by Date: bug#42162: Recovering source tarballs
Next by Date: bug#43013: System reconfigure: expection reached when executing `start` on "file-system-/sys/kernel/debug" // and an unrelated "nouveau"-bug?
Previous by thread: bug#42162: Recovering source tarballs
Next by thread: bug#42702: catch2 build failure on ARM
Index(es):
- Date
- Thread