info-cvs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cvs (or something!) on very large scales with non-source code object


From: Donald Sharp
Subject: Re: cvs (or something!) on very large scales with non-source code objects
Date: Fri, 1 Feb 2002 11:14:06 -0500
User-agent: Mutt/1.2.4i

I disagree.  TIF images are binary files.  cvs does not efficiently
store binary files.  Also as the size of the repository grows
the slower the entire system gets.  It sounds like this guy
is going to be adding *huge* numbers of files, each month. 

I would recommend looking at a different system than cvs...

donald
On Fri, Feb 01, 2002 at 09:56:06AM -0600, Daniels, David wrote:
> I think CVS would probably do quite well for the system you're describing.
> You're already doing a primitive form of versioning when you rename the
> files to FILE.yyyy.mm.dd.hh.mm.
> 
> 
> -----Original Message-----
> From: Nigel Kerr [mailto:address@hidden
> Sent: Friday, February 01, 2002 9:25 AM
> To: address@hidden
> Subject: cvs (or something!) on very large scales with non-source code
> objects
> 
> 
> 
> good folk,
> 
> i ask this forum because i'm not at all sure where start looking for
> ideas on how to address my problems.  cvs may not be the right tool
> for what i have, but any ideas or suggestions or redirections to other
> fora are welcome and desired.
> 
> i have several million objects ("very large scales"): roughly half of
> them are bitonal TIFF files, scanned page images of printed material;
> the other half are OCR'd text of those same TIFF files.  there are a
> relatively small number of other kinds of files: metadata about chunks
> of these data, and auxilliary images of parts of some of the pages.
> right now the top level chunks of this corpus number about 3,000, with
> sub-chunks inside those top-level chunks.
> 
> at any moment, it might be discovered that there is an error or
> problem with any of these objects, that will need to be fixed:
> 
>     the TIFF file might be bad/corrupt/unclear
>     the ocr'd text might be bad/corrupt/unclear
>     the metadata might be found to be wrong
>     the auxilliary images might be bad/corrupt/unclear
> 
> we might make a change to a small number of things at a time, we might
> also make a batch change to thousands of things at a time.  back when
> we had less than 500 top-level chunks, our life was relatively easy:
> we had a home-grown edit-history-type system that basically:
> 
>     moved the old file FILE to FILE.yyyy.mm.dd.hh.mm
> 
>     moved the new version of FILE into place
> 
>     wrote in a date-stamped log file a message meaning "i changed
>     this!", where the message phrased differently depending on what
>     got changed.
> 
>     used the doughty mirror perl script on our different machines to
>     get the changed data from the master to the slave machines.
> 
> we're still using that system.  we get about 400,000 new items a month
> in between 30-50 new top-level chunks (a top-level varies in size
> considerably).  the increases in size of our corpus will never slow
> down.
> 
> our stated *goals* for using this system are two-fold:
> 
>     a method for communicating from the master to the slave machines
>     about what has changed, and what they should try to update.
> 
>     a record of what all has changed ever, so that if we had to start
>     from original source media (the cd-roms the data arrive to us on),
>     we could, and only update what needed updating.
> 
> i don't have much problem with the first goal: we need some
> communication method from master to slave.  i am increasingly nervous
> about the second goal as we get larger and larger, and am looking for
> other ways to address or consider that problem.
> 
> it might be that we:
> 
>     give up on "record of what all has changed ever", and try to go
>     for "record of what all has changed since the last time we had a
>     complete checkpoint of our corpus", and keep using our change
>     system, and give up on the "restore from original media" idea.
> 
>     use a version control system that can handle millions of things
>     (which would be?!) changing, and the master-to-slave transport of
>     changes efficiently.
> 
>     keep going about things as we have, and just hope we never have to
>     restore from scratch.
> 
>     something else?
> 
> anyone here approached this kind of problem, know someone who has, or
> have any ideas about it?  people/places i can seek advice from?
> anything is appreciated, thank you.
> 
> cheers,
> nigel kerr
> address@hidden
> 
> 
> _______________________________________________
> Info-cvs mailing list
> address@hidden
> http://mail.gnu.org/mailman/listinfo/info-cvs
> 
> _______________________________________________
> Info-cvs mailing list
> address@hidden
> http://mail.gnu.org/mailman/listinfo/info-cvs



reply via email to

[Prev in Thread] Current Thread [Next in Thread]