[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Binary Diff System in Arch

From: Tom Lord
Subject: Re: [Gnu-arch-users] Binary Diff System in Arch
Date: Wed, 4 Feb 2004 08:59:18 -0800 (PST)

    > From: Sam Phillips <address@hidden>

    >      I'm starting to contemplate work on a project, and I want to use
    >  Arch for revision control.  As part of the project I need to be able to
    >  store non-human readable data inside the tree. 

    >     I remember a thread a while back (it may have been about a year ago
    > now) where people were talking about having some kind of hooks for
    > diff/patch so that you could use Xdelta -- or some other delta'ing tool
    > -- in Arch.  I was wondering if anyone has been working on this and what
    > state this idea is in right now.

    >     If this needs works and is still worth doing I might take a stab at
    > it soon.

You got lots of good replies which could be summarized:

~ nobody is known to be working on it
~ we've talked about the design a big
~ the trick is to get diff and patch to "Do the Right Thing"
~ one way to do that is by wrapping diff and patch, another by
  modifying them
~ people will need your changes in one form or another if they
  are going to be able to read your archives

I'm going to follow up with two additional points:

~ in addition to changing diff and patch, you'll also need to 
  change diff3.

~ enclosed is an explanation of how things work and an outline of what
  needs to be done


* text vs. binary files

  GNU diff currently uses a heuristic to decide that a file is a 
  binary file before it does any further worth.  _As_I_recall_, the
  heuristic is to look at the first 1K (or is it 4K) of the file and,
  if any 0-bytes are found, assume that the file is binary.

  (GNU diff _may_ currently have -- or could be expected to have in
  the future -- additional heuristics.)

  When comparing binary files, rather than emitting useful diffs, diff
  just says "binary files differ" or "binary files are the same" as
  the case may be.

* diff output vs. patch

  Patch reads its input file and tries (tolerantly of extra junk)
  to parse it as one of the output formats of diff.   Of course,
  a string like "binary files differ" means nothing to patch
  and it will exit with an error given such input.

* tla binary file detection

  tla trusts diff to detect binary files.  It uses a crude
  technique: looking for the the string "binary files" at the 
  start of diff output.   Strangely enough, this seems to work
  regardless of how a users locale is set.

  If diff has reported that at least one of the two files which differ
  is a binary file, then rather than storing the diff output,
  tla falls back to the strategy of storing whole-text copies
  of both files.

* critical properties of diff and patch

  If diff _can_ diff two files, tla relies on two key properties 
  of diff and patch:

  Suppose we have two text files A and B.   We can write
  the diff between them:


  Patch provides two operations, forward and backward patching, both
  of which arch relies upon.  These operations have the properties:

        patch (A, diff(A,B)) == B

        reverse_patch (B, diff(A,B)) == A

  Additionally, patch provides for "inexact patching".  So that if we
  have a file C, we get:

        patch (C, diff(A,B)) --> C', modified "similarly" to B
                                 relative to A, plus a .rej file
                                 if conflicts occured

        reverse_patch (C, diff(A,B)) --> C', modified "similarly" to A
                                 relative to B, plus a .rej file
                                 if conflicts occured

  The exit statuses of the diff and patch processes are critical:

    diff exit statuses
        0 -- files are the same
        1 -- files differ
        2 -- an error duing processing

    patch exit statuses
        0 -- patch applied without conflicts
        1 -- patch applied with conflicts
        2 -- an error duing processing

  The output format of diff has a critical property:  it is a 
  plain-text format suitable for displaying on a terminal or
  sending in email.   This is critical for the output of commands
  such as "tla changes".

* critical properties of diff3

  In some situations, tla is operating on three text files which could

  It wants to compute what is in essence:

        patch (MERGED-TO, diff(ANCESTOR, MERGED-FROM))

  but does that in a single step using diff3:


  There are two reasons for doing that.   First, diff3 can sometimes
  do a better job of avoiding conflicts.   Second, if conflicts occur,
  diff3 can generate a kind of in-file conflict markers that many
  people like (comparable to those generated by CVS, for example).

  diff3 exit status is critical and similar to patches:

    diff3 exit statuses
        0 -- merged without conflicts
        1 -- merged with conflicts
        2 -- an error duing processing

  tla will never attempt to use diff3 if diff reports that at least
  one of ANCESTOR and MERGED-FROM are binary files.

* So, What to Do? (part 1)

  When people say "modify [or wrap] diff and patch" the general
  idea they have in mind is to make it so that diff _never_ 
  reports "binary files differ".   Instead, it should emit some
  kind of "binary diff" (such as xdelta might produce).   Similarly,
  patch should know how to apply the binary diff.

  All of the critical properties listed above for diff and patch
  must be preserved:

        xpatch (a.jpg, xdiff(a.jpg, b.jpg)) == a.jpg

        xreverse_patch (b.jpg, xdiff(a.jpg, b.jpg)) == b.jpg

        xpatch (c.jpg, xdiff(a.jpg, b.jpg))
          --> c'.jpg, modified "similarly" to b.jpg with
              a .rej file if necessary

        xreverse_patch (c.jpg, xdiff(a.jpg, b.jpg))
          --> c'.jpg, modified "similarly" to a.jpg with
              a .rej file if necessary

        exit statuses as before

        xdiff output must be plain-text -- suitable for what-changed
        output and email.

  The first two equations -- exact patching -- should be fairly easy.
  A good suggestion was made at one point that the output of

        % xdiff a.jpg b.jpg

  should include md5 checksums for both a.jpg and b.jpg, along with
  the xdelta output.  (I'm assuming that xdelta doesn't already
  include those checksums in its output).

  The format of that xdiff output should be such that it can not be
  mistaken for a textual diff.

  The xpatch program should recognize that format and compare the
  checksum for the file being patched to that of a.jpg.   If they
  match, it should apply the xdelta, producing b.jpg.

  And similarly: the xreverse_patch program should recognize that
  format and compare the checksum for the file being patched to that
  of b.jpg.  If they match, it should apply the xdelta, producing

  What if the checksums don't match?  Unlike textual patching, 
  binary patching can't reasonably "fudge it".   So if the checksums
  don't match, then xpatch and xreverse_patch should leave the
  file being patched unmodified -- and store the entire xdiff output
  as a .rej file.

  xdiff3 is easy, given xdiff.  If ANCESTOR and MERGED-TO are exactly
  the same, then the merged file output should just be a copy of
  MERGED-FROM.   Otherwise, the output should be a copy of MERGED-TO 
  and "xdiff ANCESTOR MERGED-FROM" stored as a .rej file.  (Note -- 
  tla will need a slight modification to how it handles the .rej
  file in this case but nothing major.)

* What's the Result?

  If you configure arch to use xdiff, xpatch, and xdiff3:

  ~ your revisions that involve binary files will, in general,
    _not_ be readable by people not using xdiff, xpatch, and
    xreverse_patch.   (But if xdiff, xpatch, and xreverse_patch are
    good, either they'll be folded into diff/patch or we'll make sure
    that they are distributed with tla -- so this incompatibility
    will be only a temporary problem.)

  ~ archives created by people not using xdiff, xpatch, and xdiff3
    will be readable by you.

  ~ you revisions that don't involve binary files will be readable
    by everyone else.

  ~ if you _merge_ changes involving binary files, and there are 
    conflicts in the binary files,  you'll get conflict markers but
    you _won't_ get a full-text copy of the merged-from file.

    In many situations that's no loss at all.   You can fetch the 
    merged-from file by other means if you need it.

  ~ if you _merge_ changes involving binary files, but there are
    no conflicts (the copy in your tree is the same as in the base
    revision from which the merge changeset was computed) -- that
    will work as expected:  you'll wind up with an accurately modified
    version of the binary file.

  ~ The contents of .rej files will change: sometimes they will
    contain rejected xdeltas instead of rejected text hunks from diff.

* So, What to Do? (part 2)

  The only drawback to all of this is the result that says:

    ~ if you _merge_ changes involving binary files, and there are
      conflicts in the binary files, you'll get conflict markers but
      you _won't_ get a full-text copy of the merged-from file.

  Sometimes that's desirable, sometimes not.   I think it should be
  an option.

  A reasonable way to make it an option to allow users to set a
  parameter in ./{arch} which will turn on the xdelta behavior and
  otherwise, stick the current mechanism of using whole-text copies
  of binary files.    It's easy to modify tla to check for this
  parameter and pass an option to diff.

  The implication for xdiff is that it should take an option
  (something like "--xdelta-if-binary") and, absent that option,
  behave exactly like normal GNU diff.

  Additionally, even if a changeset includes xdelta patches, a 
  user may sometimes want (as a convenience) tla to to produce
  full-texts of merged-from files when conflicts occur.   We 
  can add options for that, too -- though doing so is essentially
  orthogonal to all of the other xdiff/xpatch/xdiff3 work.

* Summary:

  You can get the functionality you're after by writing xdiff, xpatch,
  and xdiff3.   It will work reasonably without any modifications to 
  tla at all.

  When that's done, we can make a few small changes to tla to make 
  the functionality slightly more convenient to use.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]