[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Identical files across subsequent package revisions

From: Ludovic Courtès
Subject: Identical files across subsequent package revisions
Date: Tue, 22 Dec 2020 23:01:17 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)

Hello Guix!

Every time a package changes, we end up downloading complete substitutes
for itself and for all its dependents, even though we have the intuition
that a large fraction of the files in those store items are unchanged.

I wanted to evaluate that by looking at store items corresponding to
subsequent revisions of a package (be it different versions or rebuilds
induced by dependencies), and this is what the program below does.

Here are preliminary results for a few packages:

PNG image

It reads as follows:

  • Pairwise similarities of subsequent revisions of IceCat contain
    identical files representing ~10% of its total size (it’s the ratio
    of the size of the identical files to the total file size.)

    The first bar represents similarity between the first and the second
    revision; the second bar shows the similarity between the second one
    and the third one, etc.

  • For Emacs, identical files between subsequent revisions represent
    ~85% of its total size.  Intuitively, this is because the package
    contains mostly source code (.el.gz files) and bytecode, which
    remains unchanged.

  • Diffoscope is written in Python so it’s similar to Emacs: its .py
    file remain unchanged across revisions, and they represent ~60% of
    its total size.

  • GIMP has lots of headers, locale data, and icons that don’t change.

  • For R, we see the effect of the upgrades to 4.0.2 and then 4.0.3,
    where similarity drops to ~25% instead of ~80% when changes are in

  • For Open MPI, which is compiled C + headers, ~25% is shared across

The reason I’m looking at this is to understand how much would be gained
in terms of bandwidth usage if we were able to avoid downloading
individual files already in the store.  It would seem to be rather

Below is the program that does that.  It grabs revision history from, fetches nars from, computes a
“digest” (list of files along with their hash and size), compares
package digests pairwise, and plots the result with Guile-Charting.
Example REPL session:

--8<---------------cut here---------------start------------->8---
scheme@(similarities)>  (pairwise-similarities (package-archive-contents 
"icecat" #:max 4))
updating substitutes from ''... 100.0%
$86 = (17363915/196387883 11380615/98152193)
scheme@(similarities)> (map exact->inexact $86)
$87 = (0.08841642740249916 0.11594865740799087)


scheme@(similarities)> ,pp (at-most 10 (package-instances "r-minimal") )
$100 = (#<<package-instance> version: "4.0.3" output: 
 #<<package-instance> version: "4.0.3" output: 
 #<<package-instance> version: "4.0.3" output: 
 #<<package-instance> version: "4.0.2" output: 
 #<<package-instance> version: "4.0.2" output: 
 #<<package-instance> version: "4.0.2" output: 
 #<<package-instance> version: "4.0.1" output: 
 #<<package-instance> version: "4.0.1" output: 
 #<<package-instance> version: "4.0.0" output: 
 #<<package-instance> version: "4.0.0" output: 
$101 = (#<<package-instance> version: "3.6.3" output: 
 #<<package-instance> version: "3.6.2" output: 
 #<<package-instance> version: "3.6.1" output: 
 #<<package-instance> version: "3.6.1" output: 
 #<<package-instance> version: "3.6.1" output: 
 #<<package-instance> version: "3.6.1" output: 
 #<<package-instance> version: "3.6.1" output: 
 #<<package-instance> version: "3.6.0" output: 
 #<<package-instance> version: "3.6.0" output: 
 #<<package-instance> version: "3.6.0" output: 
 #<<package-instance> version: "3.6.0" output: 
 #<<package-instance> version: "3.5.3" output: 
 #<<package-instance> version: "3.5.3" output: 
 #<<package-instance> version: "3.5.1" output: 
 #<<package-instance> version: "3.5.1" output: 
 #<<package-instance> version: "3.5.1" output: 
scheme@(similarities)> (similarity-chart '("icecat" "gimp" "openmpi" "emacs" 
"diffoscope" "r-minimal") "/tmp/t.png" #:max 8)
updating substitutes from ''... 100.0%
$102 = #<cairo-surface 7f502c7adb10>
--8<---------------cut here---------------end--------------->8---

Thoughts?  :-)


;;; Copyright © 2020 Ludovic Courtès <>
;;; Hereby placed under the GNU General Public License, version 3 or later.

(define-module (similarities)
  #:use-module (json)
  #:use-module ((gcrypt hash) #:select (open-sha256-port))
  #:use-module ((guix store) #:select (%default-substitute-urls))
  #:use-module ((guix progress) #:hide (dump-port*))
  #:use-module (guix http-client)
  #:use-module (guix serialization)
  #:use-module (guix scripts substitute)
  #:use-module (srfi srfi-1)
  #:use-module (srfi srfi-9)
  #:use-module (srfi srfi-11)
  #:use-module (rnrs bytevectors)
  #:use-module (charting)
  #:use-module (ice-9 match))

;;; Data Service client.

(define-json-mapping <package-instance> make-package-instance
  (version     package-instance-version)
  (output      package-instance-output "derivation"))

(define %data-service-base-url
  (make-parameter "";))

(define* (package-instances package #:key (branch "master"))
  "Return a list of <package-instance> representing instances of PACKAGE over
time known to the Data Service."
  (match (assoc-ref (json->scm
                     (http-fetch (string-append (%data-service-base-url)
                                                branch "/package/" package
    (#(lst ...)
     (map json->package-instance lst))

;;; Similarity measurement.

(define (port-sha256* port size)               ;from (guix scripts challenge)
  ;; Like 'port-sha256', but limited to SIZE bytes.
  (let-values (((out get) (open-sha256-port)))
    (dump-port* port out size)
    (close-port out)

(define-record-type <file-info>
  (file-info name type size sha256)
  (name   file-info-name)
  (type   file-info-type)
  (size   file-info-size)
  (sha256 file-info-sha256))

(define (archive-contents port)
  "Return a list of <file-info> records from the nar read from PORT."
  ;; As in (guix scripts challenge), but return <file-info> records that
  ;; include file size and ignore symlinks.
  (fold-archive (lambda (file type contents result)
                  (match type
                    ((or 'regular 'executable)
                     (match contents
                       ((port . size)
                        (cons (file-info file type size
                                         (port-sha256* port size))
                    ('directory result)
                    ('directory-complete result)
                    ('symlink result)))

(define (narinfo-contents narinfo)             ;from (guix scripts challenge)
  "Fetch the nar described by NARINFO and return a list representing the file
it contains."
  ((@@ (guix scripts challenge) call-with-nar) narinfo archive-contents))

(define (at-most max-length lst)              ;from (guix scripts substitute)
  "If LST is shorter than MAX-LENGTH, return it and the empty list; otherwise
return its MAX-LENGTH first elements and its tail."
  (let loop ((len 0)
             (lst lst)
             (result '()))
    (match lst
       (values (reverse result) '()))
      ((head . tail)
       (if (>= len max-length)
           (values (reverse result) lst)
           (loop (+ 1 len) tail (cons head result)))))))

(define* (package-archive-contents package
                                   #:key (max 10)
                                   (substitute-urls %default-substitute-urls))
  "Look at the MAX latest instances of PACKAGE, fetch them, and return a
summary of their contents as returned by 'narinfo-contents'."
  (let ((instances (at-most max (package-instances package))))
    (map narinfo-contents
         (lookup-narinfos (first substitute-urls)
                          (map package-instance-output instances)))))

(define (similarity contents1 contents2)
  "Return two values: the ratio of identical bytes between CONTENTS2 and
CONTENTS2, and the ratio of identical files."
  (define (matches name)
    (lambda (info)
      (string=? (file-info-name info) name)))

  (let ((files (delete-duplicates
                (append (map file-info-name contents1)
                        (map file-info-name contents2)))))
    (let loop ((files files)
               (seen 0)
               (identical 0)
               (seen-bytes 0)
               (identical-bytes 0))
      (match files
         (values (/ identical-bytes seen-bytes)
                 (/ identical seen)))
        ((head . tail)
         (let ((file1 (find (matches head) contents1))
               (file2 (find (matches head) contents2)))
           (cond ((not file1)
                  (loop tail
                        (+ seen 1) identical
                        (+ seen-bytes (file-info-size file2))
                 ((not file2)
                  (loop tail
                        (+ seen 1) identical
                        (+ seen-bytes (file-info-size file1))
                  (let ((identical?
                         (and (= (file-info-size file1)
                                 (file-info-size file2))
                              (bytevector=? (file-info-sha256 file1)
                                            (file-info-sha256 file2)))))
                    (loop tail
                          (+ seen 1)
                          (if identical?
                              (+ identical 1)
                          (+ seen-bytes
                             (max (file-info-size file1)
                                  (file-info-size file2)))
                          (if identical?
                              (+ identical-bytes (file-info-size file1))

(define (pairwise-similarities contents)
  (let loop ((contents contents)
             (similarities '()))
    (match contents
      ((or () (_))
       (reverse similarities))
      ((a b . tail)
       (loop (cons b tail)
             (cons (similarity a b) similarities))))))

(define* (similarity-chart packages file
                           #:key (max 20))
   "Similarity across subsequent package revisions"
   (map (lambda (package)
          (let* ((contents     (package-archive-contents package
                                                         #:max max))
                 (similarities (pairwise-similarities contents))
                 (labels       (iota (length similarities))))
               ,@(map (lambda (label ratio)
                        `(,(* ratio 100.) ,(number->string label)))
                      labels similarities))))
   #:chart-params '(#:x-axis-label "package"
                    #:y-axis-label "similarity ratio (%)")
   #:write-to-png file))

reply via email to

[Prev in Thread] Current Thread [Next in Thread]