[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] RE: Support for binary files, scalability and Windows p

From: Asger Kunuk Ottar Alstrup
Subject: [Monotone-devel] RE: Support for binary files, scalability and Windows port
Date: Tue, 20 Jan 2004 09:37:17 +0100

graydon hoare wrote:
>> This is because I do not think this use case has priority over the
>> use case where I revert a change, but another party does not.
> ah, right. again, my background concerns are all about source code, in
> which the concern is to make merging robust and painless (well, and
> doing strong QA, which also benefits from primacy of hashed
> identifiers), so I very much feel that it takes priority.

OK, of course that is a sensible position to take. 

It is difficult to say which is the most common, but in source code, it
does happen relatively frequently that two people make the same change
independently, while reverting a file is probably not as frequent. 

> in any case, reversion of a manifest can be represented as a back-edge
> in the ancestry version graph or a cancellation of the forward edge
> (and note: you'd have to revert the entire manifest, not just a file,
> because only manifests are chained together in a history graph).

Well, I am not convinced that this is robust in the most general sense.

Consider a file which toggles between the content A and B continuously.
This could be a setup.h file, which defines where the product is
compiled in debugging mode or release mode. During the lifetime of the
source code, this content can be changed many times, and independently
by many people that only synchronise from time to time.

In order to represent this accurately, in the face of distributed use, I
think you need to represent every single change as an edge in the graph
somehow, and the order in which they happened. In other words, you
effectively have to record an ordering of your back-edges or
cancellation edges.

So, in effect, you end up with another way of representing a DAG, where
each instance of a files content is a separate node, even if it is the
same contents as earlier.

I can probably work out a concrete example for you to test your data
structure on if the above does not convince you - the gist of it is to
consider such a toggling file, which is synchronised randomly, and
randomly changed to other contents and sometimes back again, potentially
introducing conflicts. Can you reliably detect conflicts in such a
scenario with the proposed data structure?

> these, to me, are heavy costs.

Yes, I agree that the costs of separating the versioning history from
the representation requires a link between the versions and the
representation, and that this introduces yet another point of failure.
However, I don't see how you can avoid this, given the above discussion.

> let me present an alternative which,
> from my perspective, shifts the workload to the user who has this (imo
> unusual) need to version-control large video files without ever
> hashing or merging them:

That is a good proposal, and that might work for video files. I think I
need to give you a little background of where I am coming from.

We are currently using CVS for picture files which are editing
continuously. The same goes with sample files, and video files. Also, we
also version control the binaries we build, along with CD-images of the
final production CDs. Basically, we put all the electronic artifacts,
including Word documents, graphics, schematics, project plans and
everything else we need for our production, under version control.

We have 15 software developers in Copenhagen, 15 in Bangalore, 15 in
Russia, and 5 in Norway that all work on the same source code. In
addition, we have a number of medical doctors that work on
specifications and patient descriptions which also use CVS.

Our main CVS repository for source code and builds is 28 Gb, while the
picture CVS alone is 99 GB, and growing by 50 MB each week - we have
additional 15 people working in India on picture manipulation. We are
continuously saturating the highest bandwidth internet connection we can
get there with an rsync job to get the files back to Copenhagen.

Next week, we will have ten people record samples simultaneously in ten
different hotel rooms for ten hours for ten days in a row, and we need
to put the resulting files under version control, in order to make sure
that we do not loose any data. These files will be postprocessed in a
number of steps by different people, some of them might be re-recorded,
and so on.

When we do picture shootings, we take 40 GB of JPG pictures in three
days using 4 different cameras in a custom-built studio.

So, we put a lot of files under version control - most of the binary
files do not allow much compression from a delta-representation, but we
need the history and the distributed environment, and we do not mind
paying for the necessary hard disc and RAID capacity. However, we have
to work under internet bandwidth limitations, and since everybody works
with these bigs files all the time, the system has to be efficient
working with big binary files. And we need to make sure that the system
is robust enough to be reliable.

We are getting by with CVS right now, and with discipline and the proper
rsync's, it kind of works. But a reference based solution would not work
for us.

We have investigated commercial options, and our conclusion is that the
best bet, BitKeeper, can not do the job for us - it does not handle
binary files to the extent needed at all. And secondly, the licensing
price for BitKeeper is so high that we think we can more cost
efficiently help an open source project to meet our requirements anyway.

Concretely, within the next month or two, we will have free software
development resources that could start working on monotone, or another
project, full-time for maybe 6 months, maybe more depending on what it
takes. I am the person that makes that decision, so therefore, I'm now
investigating the different options in some detail to get a clearer
picture of what we should do.

Best regards,
Asger Ottar Alstrup

reply via email to

[Prev in Thread] Current Thread [Next in Thread]