monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cvssync (was Re: [Monotone-devel] Re: big repositories inconvenience


From: Markus Schiltknecht
Subject: Re: cvssync (was Re: [Monotone-devel] Re: big repositories inconveniences (partial pull?))
Date: Fri, 08 Sep 2006 11:25:38 +0200
User-agent: Thunderbird 1.5.0.5 (X11/20060812)

Hi,

Please excuse the longish mail. I got carried away a little with two or three thoughs...

Christof Petig wrote:
Now I have to come up with a coding for push certificates (which, in the
past were a simple xdiff to a specified .mtn-sync-cvs file). And I have
to think about flagging a revision as synched (a changed attribute might
still indicate that this revision is synched).

I don't want to attach another certificate to each and every revision
(which it would easily gain if certificates flag synchronisation).

Hm.. that makes me think again about how certificates are stored. I know certs can store arbitrary texts, but this cert only stores a flag, i.e. having the cert vs not having it would be enough. Other certs change their text values only very seldom or only parts of it.

To understand how certs are stored, I took a look at schema.sql and found:

CREATE TABLE revision_certs
(
  hash not null unique,   -- hash of remaining fields separated by ":"
  id not null,            -- joins with revisions.id
  name not null,          -- opaque string chosen by user
  value not null,         -- opaque blob
  keypair not null,       -- joins with public_keys.id
  signature not null,     -- RSA/SHA1 signature of "address@hidden:val]"
  unique(name, id, value, keypair, signature)
);

Now, I understand most of it, only what are 'remaining fields'? (Likewise in manifest_certs and public_keys)

I was thinking about delta-compressing cert values, but it gets clear that can't be done that easily. I.e. one would need to choose a good base cert to delta-compress uppon. 'Good' meaning one which is from a revision in the same branch, which gives good compression and which is close to being a base (not nested too deeply with delta compression).

How about only using compression? (Or is the cert value already compressed?)

To get humble and more real now: is this an issue at all? (Except for CVS revision info which should better be stored at other places.) If not, at least I understand monotone better, now ;-)


Another thought I had was using some sort of 'inverted indexes' to store 'flag-certs' (which don't have a value, but are boolean in the sense that attached = true, missing = false), i.e.:

flag cert 'PUSHED' is attached to revisions A, C, D and E,
flag cert 'COMPILES_CLEANLY' is attached to revisions A, B, C and E

but at least that would also need to take into account the keypair, so it would look more like:

flag cert 'PUSHED' with key 'address@hidden'
   is attached to rev A, C and E
flag cert 'PUSHED' with key 'address@hidden'
   is attached to rev B
etc..

And it would not be trivial to implement such an inverted index in sqlite. (Performance problems, as soon as you have lots of revisions to store).

Regarding the CVS information again:

Nathaniel Smith wrote in another mail:
> E.g., if monotone's tree has ~1800 files, and if it were created by
> importing from cvs, writing down such a cert would take on the order
> of 64kB.  (Calculated by 'mtn ls known | wc -c' to get filename
> lengths, plus some fudge for the version numbers.)  Certs are not
> delta compressed nor, in the current implementation, even gzipped.
> My database has ~7000 revisions in it.  If every revision in it had
> such a cert on it (again, as if it were imported from CVS), then that
> would come to ~450 megabytes of certs, so almost 7 times more data
> than the entire history combined.

The filenames of all files are already stored in the manifest of the revision, right? Why not cut them from the calculation above and only store RCS versions in the cert, in the same order as the files appear in the manifest. I.e:

sample manifest:

   format_version "1"

   dir ""

   dir "fs"

      file "fs/readdir.c"
   content [f2e5719b97...]

      file "fs/read_write.c"
   content [fe238a9d34...]

CVS revision cert:

   1.6
   1.1

That would reduce the amount of CVS history data stored per revision to the minimum required.

Or even better: use the manifest to store that information... AFAICT manifests are delta compressed and store the filenames and file revisions anyway. Why not store 'origin VCS' information from imports there? Per revision that would be, looks like a much better fit for other VCS like svn and git, too. I.e:

sample manifest:

   format_version "1"

   dir ""

   dir "fs"

      file "fs/readdir.c"
   content [f2e5719b97...]
   RCS_rev "1.6"

      file "fs/read_write.c"
   content [fe238a9d34...]
   RCS_rev "1.1"

   CVS_server_path ":pserver:address@hidden:/foo/cvsroot"
   CVS_module "monotone"
   CVS_revision_date_start "02/07/1997 13:41:05"
   CVS_revision_date_end "02/07/1997 13:41:12"

   CVS_revision_conflicts_with [e6a903d31...]


That would suffice to store all the information known at import time. Of course this can not be changed later on, but for most CVS imports that's not necessary.

Regards

Markus





reply via email to

[Prev in Thread] Current Thread [Next in Thread]