bug-xorriso
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-xorriso] generating reproducible ISOs with xorriso


From: Daniel Kahn Gillmor
Subject: Re: [Bug-xorriso] generating reproducible ISOs with xorriso
Date: Thu, 04 Jun 2015 19:34:32 -0400
User-agent: Notmuch/0.20.1 (http://notmuchmail.org) Emacs/24.4.1 (x86_64-pc-linux-gnu)

Hi Thomas--

On Thu 2015-06-04 15:32:43 -0400, Thomas Schmitt wrote:
> The files are grafted into a red-black tree according to
> their inode and device numbers on hard disk. This is done
> to merge hardlinks.
> The tree is then serialized into an array which gets
> sorted according to the weight values. The extent addresses
> are then assigned according to the sorted sequence.

thanks for this summary, it makes sense why it is this way, and it also
makes sense why this does not look good for reproducibility. :/

> Brute force would be a giant weight list which gives
> the sorting rank of each file. Not very appealing to users.

well, -sort-weight-list is an option designed to do exactly that; i
could try it just to validate assumptions.  I have a couple questions
about the -sort-weight-list option:

 * it says each line in the file is a weight and an iso_rr_path.

    - is the weight always interpreted as decimal, or can it be supplied
      in hex (with a 0x prefix)?  (the latter would be easier to convert
      the output of an md5sum file)

    - is the iso_rr_path the raw ISO name (e.g. "BOOT/GRUB/GRUB.CFG;1")
      or will it work from the source name relative to the root
      (e.g. "boot/grub/grub.cfg") ?

 * it says "if iso_rr_path leads to a directory then all regular files
   underneath will get the weight number" -- what if the regular files
   themselves are specified?

 * if a file has a weight specified multiple times, which specified
   weight "wins" -- first or last?


if we can use the normal filesystem path, then the following pipeline
ought to work to generate the -sort-weight-list of the current
directory:

 (find . -type f -print0 | xargs -0 md5sum | sort | cut -f2- -d/ ; find . 
-mindepth 1 \! -type f | sort | cut -f2- -d/ ) | awk '{ N=N+1; print N " " $0 }'

It puts the files first (ordered by md5sum) and then all the non-files
(hopefully directories), ordered by pathname.  This works for me (it
fixes the extents), but i still end up with some variance in the bytes
of the isos:


makewlist() {
 ( cd "$1" &&
   (find . -type f -print0 | xargs -0 md5sum | sort | cut -f2- -d/ ;
    find . -mindepth 1 \! -type f | sort | cut -f2- -d/ ) |
    awk '{ N=N+1; print N " " $0 }'
 )
}

timestamp="$(date -u +%Y%m%d%H%M%S00)"

args="-volume_date c $timestamp -volume_date m $timestamp"
args="$args -volume_date x default -volume_date f default"
args="$args -chmod_r a=rx / --"
args="$args -chown_r 0 / --"
args="$args -alter_date_r b $timestamp / --"

mkisos() {
  local base="$(basename "$1")"
  makewlist "$1" > "$base.weights"
  xorriso -as mkisofs -r -J -o "$base.iso" "$1" -- $args
  sleep 3
  xorriso -as mkisofs -r -J -o "$base.weighted.iso" "$1" --sort-weight-list 
"$base.weights" -- $args
}


cmpiso() {
 printf "comparing %s and %s:\n" "$1" "$2"
 diff -u <(isoinfo -d -i "$1") <(isoinfo -d -i "$2")
 diff -u <(isoinfo -l -R -i "$1") <(isoinfo -l -R -i "$2")
}

testtree() {
 if ! [ -d "$1" ]; then
   echo "needs a directory!"
   return 1
 fi
    
 local base="$(basename "$1")"
  cp -aT "$1" a
  sleep 3
  mkisos "$1"
  sleep 3
  mkisos a
  sleep 3
  sha1sum *.iso
  cmpiso "$base.iso" a.iso | head
  cmpiso "$base.weighted.iso" a.weighted.iso | head
}



I run this in a clean directory (it will create files in the current
working directory) as:

 testtree /boot/grub


and the final output is:


------------
[...]

007a753430906832cd8497b9f799cf4e462e876e  a.iso
9dff629e89ac83f2155e602a1f64edba2cec5504  a.weighted.iso
e7af3a1c173d9ed677f93bba3d459746d3b57de4  grub.iso
d511790827e47d5fd7e61c61e91ee596a28d1ae0  grub.weighted.iso
comparing grub.iso and a.iso:
--- /dev/fd/63  2015-06-04 19:20:27.482645873 -0400
+++ /dev/fd/62  2015-06-04 19:20:27.482645873 -0400
@@ -2,321 +2,321 @@
 Directory listing of /
 d---------   0    0    0            2048 Jun  4 2015 [     19 02]  . 
 d---------   0    0    0            2048 Jun  4 2015 [     19 02]  .. 
-----------   0    0    0          117760 Jun  4 2015 [   1170 00]  CORE.EFI;1 
-----------   0    0    0             127 Jun  4 2015 [     66 00]  
DEVICE.MAP;1 
+----------   0    0    0          117760 Jun  4 2015 [   1632 00]  CORE.EFI;1 
comparing grub.weighted.iso and a.weighted.iso:
-------------

So that's good, the isoinfo listings are identical, and we've solved the
extents issue.

But the iso's still have bytes that differ.  here's a sample:

0 address@hidden:/tmp/cdtemp.NBVnsL$ diff -u <(hd < grub.weighted.iso ) <(hd < 
a.weighted.iso )| head
--- /dev/fd/63  2015-06-04 19:32:40.261987884 -0400
+++ /dev/fd/62  2015-06-04 19:32:40.261987884 -0400
@@ -79,7 +79,7 @@
 00009830  00 00 00 41 6d 01 00 00  00 00 00 00 01 00 00 00  |...Am...........|
 00009840  00 00 00 00 00 00 00 00  00 00 00 00 00 54 46 1a  |.............TF.|
 00009850  01 0e 73 06 04 16 39 1e  00 73 06 04 16 39 1e 00  |..s...9..s...9..|
-00009860  73 06 04 17 1d 04 00 43  45 1c 01 14 00 00 00 00  |s......CE.......|
+00009860  73 06 04 17 1d 0b 00 43  45 1c 01 14 00 00 00 00  |s......CE.......|
 00009870  00 00 14 00 00 00 00 00  00 00 00 ed 00 00 00 00  |................|
 00009880  00 00 ed 00 60 00 13 00  00 00 00 00 00 13 00 08  |....`...........|
0 address@hidden:/tmp/cdtemp.NBVnsL$ 

Any thoughts?  Can you reproduce the same behavior with the scripts
here?


> I will have to think about this.
>
> Maybe determine MD5 of file content and sort according to
> it before sorting for weights ?

You're using the MD5 of the content instead of just the name of the file
because you want to preserve the hardlink mechanism, right?

What would you do for directories?  Since there should be no hardlinked
directories, maybe for those we just hash the name of the directory?  in
that case, you'd want to distinguish between a file with contents
"boot/grub" and the directory "boot/grub", so maybe you need to prefix
both digests:  for files, it's MD5("f"||content) and for directories,
it's MD5("d"||name).

> Time consuming and needs at least 16 extra bytes per file,
> but would work automatically.
> CRC32 would be smaller but the birthday paradox would hit
> us within the expectable number of files.

We definitely don't want CRC32, that's way too small of an input space,
and files with the contents "ehhhhipt" would end up colliding with files
with the contents "bbceqquy" (for example) -- we'd find a collision
within 64Ki files

> Half an MD5 would be good for about 4 billion files.

I think your math is right as well (64 bit keyspace, birthday paradox
suggests a collision at 32 bits, which is 4 billion).  Can you even put
4Gi files in an ISO?  That seems safe against randomness (and malicious
attempts to break reproducibility are nothing to worry about, i think)

> Or optionally omit the red-black tree and rather create
> directly the array from traversing the tree of upcomming
> ISO files ? (That was my idea before Vreixo Formoso implemented
> the red-black tree for hardlink detection.)
> The ISO might get larger that way. Not every user will like.

giving up the hardlink detection would be a sad waste -- plus, i might
want an ISO to have both properties: hardlink detection and
reproducibility.

> Intermediate syntax was that "--" separates options of
> grub-mkrescue and of xorriso -as mkisofs. (Actually the
> better CLI. But the old one was released with GRUB 2.00.)
> If "--" was desired for xorriso, it had to be given twice.

Duly noted, i think we're out of the woods on this already in debian,
but i'll make sure we look out for it.

Regards,

     --dkg



reply via email to

[Prev in Thread] Current Thread [Next in Thread]