[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Identical files across subsequent package revisions

From: Ludovic Courtès
Subject: Re: Identical files across subsequent package revisions
Date: Wed, 06 Jan 2021 10:27:28 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)


pukkamustard <> skribis:

> Your research inspired me to do conduct some experiments towards
> de-duplication.
> For two similar packages (emacs-27.1 and emacs-26.3) I was able to
> de-duplicate ~12% using EROFS and ERIS. Still far from the ~85%
> similarity, but an attempt I'd like to share.
> The two main ingredients:
> - EROFS (Enhanced Read-Only File-System) is a read-only,   compressed
>  file-system comparable to SquashFS. It has some properties that
>  make
>  it more suitable than SquashFS (it aligns content to fixed block
>  size). EROFS is in mainline Linux Kernel since v5.4.
> - ERIS (Encoding for Robust Immutable Storage) is an encoding of
>   content
>  into uniformly sized blocks that I've been working on. It
>  de-couples
>  encoding of content from storage and transport layer. Transport
>  layers
>  can be things like IPFS, GNUNet, Named Data Network or just a   plain
>  old HTTP service.
> I make EROFS images of the packages and encode them with ERIS, which
> de-duplicates blocks as part of the encoding process.
> With this I manage to de-duplicate between 12-17% (depending on some
> parameters).

Very nice!  I wonder what the file-level similarity is compared to the
block-level similarity.

> This could allow:
> - Directly mounting packages instead of unarchiving (a la distri)

Yeah, I’m not sure about this part.  It would be a radical change for
Guix in terms of code, and I also wonder about the efficiency: sure
mounting the package would be instantaneous, but subsequent reads would
be slowed down compared to the current approach.  Maybe the slowdown is
only on the first hit though, and maybe it’s hardly measurable, dunno.

> - Peer-to-peer distribution of packages (that's what ERIS is for)

Yup, looking forward to that.

> - De-duplicating common content in packages to a certain extent
>   (topic
>  of this thread)
> A more in-depth write-up:

Great writeup and nice tooling that you have here!


reply via email to

[Prev in Thread] Current Thread [Next in Thread]