qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.


From: Kevin Wolf
Subject: Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.
Date: Tue, 2 Jul 2013 16:54:46 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

Am 02.07.2013 um 16:42 hat Stefan Hajnoczi geschrieben:
> On Thu, Jun 20, 2013 at 04:26:09PM +0200, Benoît Canet wrote:
> > ---
> >  docs/specs/qcow2.txt |   42 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 42 insertions(+)
> > 
> > diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
> > index 36a559d..a4ffc85 100644
> > --- a/docs/specs/qcow2.txt
> > +++ b/docs/specs/qcow2.txt
> > @@ -350,3 +350,45 @@ Snapshot table entry:
> >          variable:   Unique ID string for the snapshot (not null terminated)
> >  
> >          variable:   Name of the snapshot (not null terminated)
> > +
> > +== Journal ==
> > +
> > +QCOW2 can use one or more instance of a metadata journal.
> 
> s/instance/instances/
> 
> Is there a reason to use multiple journals rather than a single journal
> for all entry types?  The single journal area avoids seeks.
> 
> > +
> > +A journal is a sequential log of journal entries appended on a previously
> > +allocated and reseted area.
> 
> I think you say "previously reset area" instead of "reseted".  Another
> option is "initialized area".
> 
> > +A journal is designed like a linked list with each entry pointing to the 
> > next
> > +so it's easy to iterate over entries.
> > +
> > +A journal uses the following constants to denote the type of each entry
> > +
> > +TYPE_NONE = 0xFF      default value of any bytes in a reseted journal
> > +TYPE_END  = 1         the entry ends a journal cluster and point to the 
> > next
> > +                      cluster
> > +TYPE_HASH = 2         the entry contains a deduplication hash
> > +
> > +QCOW2 journal entry:
> > +
> > +    Byte 0         :    Size of the entry: size = 2 + n with size <= 254
> 
> This is not clear.  I'm wondering if the +2 is included in the byte
> value or not.  I'm also wondering what a byte value of zero means and
> what a byte value of 255 means.
> 
> Please include an example to illustrate how this field works.
> 
> > +
> > +         1         :    Type of the entry
> > +
> > +         2 - size  :    The optional n bytes structure carried by entry
> > +
> > +A journal is divided into clusters and no journal entry can be spilled on 
> > two
> > +clusters. This avoid having to read more than one cluster to get a single 
> > entry.
> > +
> > +For this purpose an entry with the end type is added at the end of a 
> > journal
> > +cluster before starting to write in the next cluster.
> > +The size of such an entry is set so the entry points to the next cluster.
> > +
> > +As any journal cluster must be ended with an end entry the size of regular
> > +journal entries is limited to 254 bytes in order to always left room for 
> > an end
> > +entry which mimimal size is two bytes.
> > +
> > +The only cases where size > 254 are none entries where size = 255.
> > +
> > +The replay of a journal stop when the first end none entry is reached.
> 
> s/stop/stops/
> 
> > +The journal cluster size is 4096 bytes.
> 
> Questions about this layout:
> 
> 1. Journal entries have no integrity mechanism, which is especially
>    important if they span physical sectors where cheap disks may perform
>    a partial write.  This would leave a corrupt journal.  If the last
>    bytes are a checksum then you can get some confidence that the entry
>    was fully written and is valid.
> 
>    Did I miss something?

Adding a checksum sounds like a good idea.

> 2. Byte-granularity means that read-modify-write is necessary to append
>    entries to the journal.  Therefore a failure could destroy previously
>    committed entries.
> 
>    Any ideas how existing journals handle this?

You commit only whole blocks. So in this case we can consider a block
only committed as soon as a TYPE_END entry has been written (and after
that we won't touch it any more until the journalled changes have been
flushed to disk).

There's one "interesting" case: cache=writethrough. I'm not entirely
sure yet what to do with it, but it's slow anyway, so using one block
per entry and therefore flushing the journal very often might actually
be not totally unreasonable.

Another thing I'm not sure about is whether a fixed 4k block is good or
if we should leave it configurable. I don't think making it an option
would hurt (not necessarily modifyable with qemu-img, but as a field
in the file format).

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]