[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] [ANNOUNCEMENT] /Arch/ embraces `git'

From: Jacob Gorm Hansen
Subject: Re: [Gnu-arch-users] [ANNOUNCEMENT] /Arch/ embraces `git'
Date: Thu, 21 Apr 2005 17:26:47 -0700
User-agent: Mozilla Thunderbird 1.0 (X11/20050302)

Tom Lord wrote:
`git', by Linus Torvalds, contains some very good ideas and some
very entertaining source code -- recommended reading for hackers.

/GNU Arch/ will adopt `git':

The source code may be entertaining but the idea is bad. No
hash-function is collision-free, not even when you weaken the hash by
adding info such as file-size at the end*. Using a hash-function for
verifying the contents is fine or to speed up things as in a traditional
hash table, using it as a unique identifier IMHO is not, unless there is
a way of detecting and dealing with collisions. I would rather have a
slow than an incorrect and silently failing system.

It also seems unclear how much this actually helps anything
performance-wise. The idea of a traditional hash-table is to reduce the
search space to a size where it will fit in memory to allow for O(1)
lookups, and then deal with collisions by degrading performance to O(n)
in these cases. But with a cryptographic hash, where the output search
space is 2^160 or soon 2^256, this buys you nothing as there is no way
that the output space will ever fit in memory (the directory-layout
discussion further down this thread is good evidence of this problem).
The main problem with hash tables is that they only scale well if you
know the access-pattern up front, which we do not in an SCS or other
type of file-store. The only general solution to 'context-addressing' is
a search-tree using an ordering function, which will be both faster and
more space-efficient.

I am sorry for the sake of the Linux project that Linus Torvalds is
wasting his time toying with this naive idea (not that he is the first
one to have it btw, it was popular in systems research a few years ago,
but the interest seems to have faded, probably due to the problems
described above.)

Tom: Your my-id--archive--etc tuples are right on the mark. You get a
managable, yet collision-free search space. Hashing fails on both of
these accounts. If you need context-addressing, lets use a B-tree or
something similar.


* see also:

reply via email to

[Prev in Thread] Current Thread [Next in Thread]