monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] Re: long RFC: "contexts"


From: Jerome Fisher
Subject: [Monotone-devel] Re: long RFC: "contexts"
Date: Thu, 27 May 2004 07:48:55 +0200
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040514

graydon hoare wrote:

the idea -- in case you missed it in all the other VC systems! -- would be to add a textual object to monotone which describes (all at once) the contents of a number of certs and a certain amount of currently-synthetic information:


What do you mean by "currently-synthetic information"? I think you're referring to storing the changes to each parent, which currently will be derived from manifest comparison and file rename certs. However, I'd like to be sure I'm not misinterpreting this.

manifest: <manifest-sha1>
date: <contents-of-current-date-cert>
author: <contents-of-current-author-cert>
summary: "line of text"
parent: <first-parent-context-sha1> {
  manifest: <manifest-sha1>
  renames: [<filename>, <filename>]
  adds: [<filename>, <file-sha1>] ...
  dels: <filename> ...
  patches: [<filename> <file-sha1> <file-sha1>] ...
}
parent: <second-parent-context-sha1> {
  manifest: <manifest-sha1>
  renames: [<filename>, <filename>] ...
  adds: [<filename>, <file-sha1>] ...
  dels: <filename> ...
  patches: [<filename> <file-sha1> <file-sha1>] ...
}

remainder is changelog
^D


I have a few problems with this specific textual representation of a context (commas, braces, etc.), and the names of some elements, but I don't think that needs to be discussed yet. I think your main aim was in showing what information would be included, anyway.

I think that getting the definition of a context right the first time is quite important. Context definitions and IDs are going to be so pervasively used that it will be very difficult to change them in future without great disturbance. I think it's best to keep only essential information, and eliminate - as much as is practical - that which does not directly relate to the primary goals. I consider these goals to be:

- Uniquely identifying a location in the history DAG.
- Allowing the associated changes to be accurately determined.
- Allowing the resultant state to be determined.


Essential properties, as I see it:

(1) Referencing each parent context.
- In the case of merges, this partially addresses the question of how the new state was reached. - It has the effect that contexts with different ancestry will have different IDs, which is more or less essential for reasons that have been covered in other mails.

(2) Specifying, for each parent context, whatever changes were performed to get from its state to the new state that CAN'T be derived by simply comparing those states.
   (currently only renames)
- This provides a partial set of changes between states. Extra information regarding these types of changes would otherwise be lost. - It has the effect that changes to the same parent(s) resulting in the same new state but produced in different ways will result in different context IDs. This is almost certainly a good thing.
   - See "EXPLICIT CHANGES" below.

(3a) Specifying, for each parent context, whatever changes were performed to get from its state to the new state that CAN be derived by simply comparing those states.
   (currently adds, dels and patches)
   - This allows the full set of changes to be known immediately.
- It's redundant if you can determine every parent's state and the new state. - You never have to go through the expense of working out the changes through state comparison. This speeds up operations like netsync and log.

OR

(3b) Referencing the absolute representation (manifest) of the new state.
   - This allows the new state to be to be known immediately.
- It's redundant if you have full knowledge of the changes and can determine the state of one parent (if there are any parents). - You never have to go through the expense of applying the changes to a parent state to determine the new one. - It allows for the stripping of old contexts, manifests and file data to save space.


Additional properties in your proposal:

(4) Specifying the author of the change, the author's idea of time when making the change, the author's summary of the change, and the author's full description of the change. - This has the effect that exactly the same changes to the same parent(s) will result in multiple nodes in the history DAG if any of these attributes differ. This will happen quite often (especially with people auto-merging), and I consider this to be unnecessary and probably bad. The "badness" suspicion is mostly gut feeling, but I'm thinking about being able to correct, append to or enhance this change metadata later - this shouldn't have to use a completely different system like certs, and certainly shouldn't result in a change of context ID.

(5) Referencing, for each parent context, the absolute representation (manifest) of its state. - I don't see how this is useful at all unless the parent's context is stripped or not yet downloaded, and then what do you want with the manifest ID? I think I'm missing something (I have no clue about the internals of netsync, or anything else in monotone for that matter).


Only one of (3a) and (3b) is strictly necessary. As each provides very important benefits, I think they should both remain.

Unless there's a good reason to have them that I'm not aware of, I think (4) and (5) are unnecessary, and in the case of (4) possibly evil.

So I would suggest:
- Removing the "manifest" field from the "parent" sections.
- Removing the "date", "author" and "summary" fields, and the changelog area. - Attaching the "date", "author", "summary" and "changelog" information to the context independently (using certs).


EXPLICIT CHANGES

I think it's important to note that it's highly desirable to store as well as possible the changes that _were actually_ performed, not merely changes that _can be_ performed to get from one state to another. It's a subtle but important distinction. The only place where we currently recognise this is in the support of "rename". It would be possible to define rename in terms of "add" and "delete", but we would then lose important information on what the author of the change actually did.

In future, for example, we might have:

replaces: [<filename>, <file-sha1>] ...
for completely replacing a file (meaning that the files are not related, they just have the same path - diffs and auto-merges don't make sense).*

copies: [<original_filename>, <copy_filename>]
for cloning a file. This is important for merging as well as documenting the author's intention.

cherrypicks: [<context>, <parent_context>] ...
for auto-merging all changes from an edge into the current state. Unlike the other examples, this potentially affects multiple files.

And 3rd party change types like:

xyzzypatches: [<filename>, <xyzzypatch-sha1>] ...
for when a file's changes have been stored in a magic patch format that accurately documents exactly what a user did (e.g. renamed this variable, added a parameter to this function). Generation of these patches would be done by the author's tools (e.g. a refactoring editor). It would not necessarily be possible to extract the same information on what was changed, how and why by generic textual comparison (e.g. diff) of the former and latter states.

Note that the order in which changes are applied is significant, and the same change type could be used multiple times with different change types in between. It may be clearer (though less efficient) to define change types in the singular and list them one by one separately.

* The "replaces" change type could equally well be represented by a "dels" of the filename, followed by an "adds" of the same filename with the new hash. It's just an example.

   be it. the only remaining "missing" concept would be "file GUIDs",
   which I consider mostly meaningless anyways; imo if you have enough
   shared history to have a shared GUID, you probably have enough to
   work out the naming relationship by tracing through rename history.


I agree with this, though currently it's not possible to do things like "resurrect" a file in a way that allows accurately tracking of that file through history (though unreliable heuristics could be used). There are ways to do this perfectly without file GUIDs, though (e.g. through new change types).

 - make a clear future distinction between certs which are about
   a change (context certs) and certs which are about a particular
   tree state (manifest certs). this difference is evident for example
   in the difference between approval (context) and testresults
   (manifest), but it's not really as clear at the moment.


I'm still not convinced that there's a need for manifest certs... I think even testresults certs should apply to a context. Branch certs can only sensibly apply to a context, not a manifest; different branches can be completely different projects; completely different projects can have completely different procedures for determining testresults. Of course, this example isn't very clever (it's unlikely that you'd get the same manifest in different projects), but there are several other reasons I don't think it makes sense to apply any certs to a manifest.

   - I'd have an excuse to unpack and index the fields which I know
     the substructure of (author, date, ancestor, etc.) which would
     speed and simplify a lot of local operations.


This information could equally well be extracted from certs for indexing, right?

 - take no more space. all these items are generated each time we do
   a commit already, but as *separate* certs. the certs aren't free:
   generally there are about 300 extra bytes of crytographic data
   along for the ride on each one. that makes a commit cost about
   1500 bytes in crypto; this data object would probably weigh no more
   than that, possibly even less.


I don't remember whether I brought this up before, but I think that having a way to bundle certs together is quite important. These "cert bundles" would contain several properties, and be timestamped and signed as a whole. There are a number of reasons I'd like this, the least important of which being that it would reduce the signature overhead.

 - there would be a certain distinction between "core" and "auxiliary"
   metadata: the stuff mentionned in the context will have a seeming
   primacy over additional, 3rd party certs hung on the side. the
   experience so far seems to suggest that nobody ever sticks 3rd party
   author, date, or rename certs on a manifest anyways, so I'm not sure
   how much would be lost there.


I think an awful lot would be lost in flexibility and simplicity. I can think of a whole lot of custom certs I'd like to add myself at commit time. I'd certainly mourn the loss of a consistent approach to metadata.

Jerome

(Graydon: Sorry about the bad quoting in my last email, I was a little overexcited)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]