qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qcow2 journalling draft


From: Stefan Hajnoczi
Subject: Re: [Qemu-devel] [RFC] qcow2 journalling draft
Date: Wed, 4 Sep 2013 10:03:52 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote:
> @@ -103,7 +107,11 @@ in the description of a field.
>                      write to an image with unknown auto-clear features if it
>                      clears the respective bits from this field first.
>  
> -                    Bits 0-63:  Reserved (set to 0)
> +                    Bit 0:      Journal valid bit. This bit indicates that 
> the
> +                                image contains a valid main journal starting 
> at
> +                                journal_offset.

Whether the journal is used can be determined from the journal_offset
value (header length must be large enough and journal offset must be
valid).

Why do we need this autoclear bit?

> +Journals are used to allow safe updates of metadata without impacting
> +performance by requiring flushes to order updates to different parts of the
> +metadata.

This sentence is hard to parse.  Maybe something shorter like this:

Journals allow safe metadata updates without the need for carefully
ordering and flushing between update steps.

> +They consist of transactions, which in turn contain operations that
> +are effectively executed atomically. A qcow2 image can have a main image
> +journal that deals with cluster management operations, and additional 
> specific
> +journals can be used by other features like data deduplication.

I'm not sure if multiple journals will work in practice.  Doesn't this
re-introduce the need to order update steps and flush between them?

> +A journal is organised in journal blocks, all of which have a reference count
> +of exactly 1. It starts with a block containing the following journal header:
> +
> +    Byte  0 -  7:   Magic ("qjournal" ASCII string)
> +
> +          8 - 11:   Journal size in bytes, including the header
> +
> +         12 - 15:   Journal block size order (block size in bytes = 1 << 
> order)
> +                    The block size must be at least 512 bytes and must not
> +                    exceed the cluster size.
> +
> +         16 - 19:   Journal block index of the descriptor for the last
> +                    transaction that has been synced, starting with 1 for the
> +                    journal block after the header. 0 is used for empty
> +                    journals.
> +
> +         20 - 23:   Sequence number of the last transaction that has been
> +                    synced. 0 is recommended as the initial value.
> +
> +         24 - 27:   Sequence number of the last transaction that has been
> +                    committed. When replaying a journal, all transactions
> +                    after the last synced one up to the last commit one must 
> be
> +                    synced. Note that this may include a wraparound of 
> sequence
> +                    numbers.
> +
> +         28 -  31:  Checksum (one's complement of the sum of all bytes in the
> +                    header journal block except those of the checksum field)
> +
> +         32 - 511:  Reserved (set to 0)

I'm not sure if these fields are necessary.  They require updates (and
maybe flush) after every commit and sync.

The fewer metadata updates, the better, not just for performance but
also to reduce the risk of data loss.  If any metadata required to
access the journal is corrupted, the image will be unavailable.

It should be possible to determine this information by scanning the
journal transactions.

> +A wraparound may not occur in the middle of a single transaction, but only
> +between two transactions. For the necessary padding an empty descriptor with
> +any number of data blocks can be used as the last entry of the ring.

Why have this limitation?

> +All descriptors start with a common part:
> +
> +    Byte  0 -  1:   Descriptor type
> +                        0 - No-op descriptor
> +                        1 - Write data block
> +                        2 - Copy data
> +                        3 - Revoke
> +                        4 - Deduplication hash insertion
> +                        5 - Deduplication hash deletion
> +
> +          2 -  3:   Size of the descriptor in bytes

Data blocks are not included in the descriptor size?  I just want to
make sure that we don't be limited to 64 KB for the actual data.

> +
> +          4 -  n:   Type-specific data
> +
> +The following section specifies the purpose (i.e. the action that is to be
> +performed when syncing) and type-specific data layout of each descriptor 
> type:
> +
> +  * No-op descriptor: No action is to be performed when syncing this 
> descriptor
> +
> +          4 -  n:   Ignored
> +
> +  * Write data block: Write literal data associated with this transaction 
> from
> +    the journal to a given offset.
> +
> +          4 -  7:   Length of the data to write in bytes
> +
> +          8 - 15:   Offset in the image file to write the data to
> +
> +         16 - 19:   Index of the journal block at which the data to write
> +                    starts. The data must be stored sequentially and be fully
> +                    contained in the data blocks associated with the
> +                    transaction.
> +
> +    The type-specific data can be repeated, specifying multiple chunks of 
> data
> +    to be written in one operation. This means the size of the descriptor 
> must
> +    be 4 + 16 * n.

Why is the necessary?  Multiple data descriptors could be used, is it
worth the additional logic and testing?

> +
> +  * Copy data: Copy data from one offset in the image to another one. This 
> can
> +    be used for journalling copy-on-write operations.

This reminds me to ask what the plan is for journal scope: metadata only
or also data?  For some operations like dedupe it seems that full data
journalling may be necessary.  But for an image without dedupe it would
not be necessary to journal the rewrites to an already allocated
cluster, for example.

> +          4 -  7:   Length of the data to write in bytes
> +
> +          8 - 15:   Target offset in the image file
> +
> +         16 - 23:   Source offset in the image file

Source and target cannot overlap?

> +
> +    The type-specific data can be repeated, specifying multiple chunks of 
> data
> +    to be copied in one operation. This means the size of the descriptor must
> +    be 4 + 20 * n.
> +
> +  * Revoke: Marks operations on a given range in the imag file invalid for 
> all

s/imag/image/

> +    earlier transactions (this does not include the transaction containing 
> the
> +    revoke). They must not be executed on a sync operation (e.g. because the
> +    range in question has been freed and may have been reused for other, not
> +    journalled data structures that must not be overwritten with stale data).
> +    Note that this may mean that operations are to be executed partially.

Example scenario?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]