gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnu-arch-users] Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'


From: Tom Lord
Subject: [Gnu-arch-users] Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'
Date: Wed, 20 Apr 2005 14:32:29 -0700 (PDT)

   From: John A Meinel <address@hidden>

   But I have a question about blobs. They are stored compressed, and
   the sha checksum is for the *compressed* form. I understand this is
   probably for performance reasons. I'm concerned, though, that
   compression routines may not be 100% deterministic across all
   platforms.

That is an *excellent* concern and I implore you to research it further
and report back.

My understanding is still superficial in that detail:  I gather that
zip formats are standardized by a IETF document.  I am not certain
that the spec implies deterministic output.   I am not certain that
the way I'm driving `libz', so as to be compatible with Linus' code,
is the right way to do it.

Please, by all means, dig in and nail details.  The goal here is
to produce the high-quality-gem version of `git' rather than the
rough-and-ready-works-for-me version.

It is desirable to checksum the compressed rather than uncompressed
blobs so that intermediate nodes in a circuit can validate blobs
without having to pay for expanding them.


   Certainly just changing the compression level will
   change the compressed output.

The actual implementation of `libz' is a train-wreck.  It has lots
of subtle bugs.   I am using the `BEST_COMPRESSION' macro to select
the compression method but I won't be surprised if you are right that
this isn't the best choice.  (I'm just copying Linus in that regard,
for speed-of-impl and compatability).

Rewriting or cleaning-up libz would be another great task for someone.
One big problem in the current `libz' is that many of the types used
for various fields are chosen poorly (e.g., `unsigned long' where `size_t'
is the right answer -- that kind of thing).


   Having the handle fixed at 160 bits also seems limiting. It ties the
   entire archive format into exactly one hash.

Yes it does.  That's a longer discussion.   Note that there are only a 
finite number of valid blob contents, too.

The situation admits intense mathematical analysis --- in no small part
because we pick a particular hash and address size.

BTW -- the handles are actually 192 bits.  I've upwards-compatibily
generalized Linus' code to make clearer something that is muddled
in his presentation: the blob size (zip form) is part of the handle
(what I call an "address").  

Also -- I have cleaned up Linus' design by making my spec robust
against the possibility of a small number of successful SHA1 forgeries.
My design *won't* withstand an attack that can turn any text into a 
semantically equivalent text with a desired SHA1 sum.



   I suppose as long as there is a version marker to allow new blob db
   versions, and the specific compression routine parameters are well
   defined. I just want to make sure that is done up front.

Separate concerns.  Blobs themselves are one thing -- blob dbs another.


   Also, this doesn't seem to work really well as a revlib format, it
   probably makes a great archive format, but revlibs need to know the
   contents so they can diff against eachother.

You'll see how it fits :-)

-t




reply via email to

[Prev in Thread] Current Thread [Next in Thread]