[Monotone-devel] file ids

monotone-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] file ids

From:	Nathaniel Smith
Subject:	[Monotone-devel] file ids
Date:	Wed, 17 Aug 2005 22:29:52 -0700
User-agent:	Mutt/1.5.9i

Y'all responding to my last post are way ahead of me -- here's the
proposal I didn't have time to finish writing until now :-).
                                                                 
As the subject suggests, I'm going to make some argument for file
ids, though, not exactly the file ids people might be thinking of.
Arch and friends have explicit, globally unique file ids as part of
their history representation.  We've historically argued this isn't
the best approach, and I'm not going to change that here.

What I'm arguing for instead is "just" some local caching of
file-related.  The idea is _not_ to make it part of the external
representations (the stuff that gets hashed); instead, it's purely a
cache, generated entirely locally.  This makes life much easier with
regard to both flexibility (nothing breaks if we change things) and
efficiency (it isn't extra remote data whose correctness we need to
validate).

So, here's the idea: for each revision, store a big manifest-like
file, except instead of storing each file's hash, we store each
file's
metadata.  The metadata in question is:
  -- a unique (local) id identifying the logical file
  -- a note saying the last time the filename was modified (say, as a
     revision string) -- i.e., the *() of the filename
  -- similar notes for *() all the other scalars attached to a file
     (see the tree merging note)

Why do this?  Well... while I did work out how to do tree merging
without this data, it... sucked :-).  The result is certainly doable,
but it requires a lot of elaborate, very finicky code.

OTOH, if we use this data, merging (and other things, like diff)
become extremely simple code, and probably much, much faster.  It's
very easy to calculate this data given a new revision, if we already
have the data for its parents, so this is a net win on code size and
complexity.

It's also much more flexible -- if we enhance things in the future
(e.g., if some really annoying person like, say, me, comes up with a
better (= more complex) merge algorithm, or we manage to support file
resurrection, or suturing, ...), then the other code would probably
have to be completely rethought and rewritten, with all the careful
analysis performed again from scratch.

And perhaps the best reason, that tipped me over from musing about
such things to actively supporting it -- check_sane_history has to
check for file suturing, and will continue to need this even if we
make all other sanity checking saner.  Having file ids should make
this faster by an order of magnitude.


So, that's why we might want to do it... how do we do it efficiently?
The idea is that we already have this more-or-less generic delta
storage mechanism that we use for files and manifests; what we can do
is simply define a new internal file format that contains all the
above info for a single revision, and delta-store these files using
the same logic we already have.  Any improvements there will
automatically apply.

(Another variation is to stop storing manifests per se on disk at
all; instead store some file that has both manifest info and extra
annotations in it.  We don't actually need to _store_ manifests; the
only reason they're really important is that we can generate their
hashes, as an end-to-end integrity check on revision application.
I'm not sure what the right trade-off here is.)

Whenever we write a revision to a db, we generate its metadata and
store that as well; as mentioned, the above metadata is pretty easy
to generate if we already have the metadata for the parents.

The metadata files are indexed by revision_id, and we should also
store a separate hash of the metadata files, just for integrity
checking.

'db check' should regenerate metadata starting from the root(s) of the
DAG, and compare what it gets against what's on disk.

I think that's all.


Any comments?

-- Nathaniel

-- 
So let us espouse a less contested notion of truth and falsehood, even
if it is philosophically debatable (if we listen to philosophers, we
must debate everything, and there would be no end to the discussion).
  -- Serendipities, Umberto Eco

[Prev in Thread]

Current Thread

[Next in Thread]

[Monotone-devel] file ids, Nathaniel Smith <=

Prev by Date: Re: [Monotone-devel] Renaming branches and tags
Next by Date: Re: [Monotone-devel] Renaming branches and tags
Previous by thread: [Monotone-devel] testresult certs, proposed default hook change
Next by thread: [Monotone-devel] i18n
Index(es):
- Date
- Thread