qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification


From: Kevin Wolf
Subject: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
Date: Mon, 11 Oct 2010 17:50:26 +0200
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.12) Gecko/20100907 Fedora/3.0.7-1.fc12 Thunderbird/3.0.7

Am 11.10.2010 17:30, schrieb Stefan Hajnoczi:
> On Mon, Oct 11, 2010 at 03:58:07PM +0200, Kevin Wolf wrote:
>> Am 08.10.2010 17:48, schrieb Stefan Hajnoczi:
>>> Signed-off-by: Stefan Hajnoczi <address@hidden>
>>> ---
>>>  docs/specs/qed_spec.txt |   94 
>>> +++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 files changed, 94 insertions(+), 0 deletions(-)
>>>  create mode 100644 docs/specs/qed_spec.txt
>>>
>>> diff --git a/docs/specs/qed_spec.txt b/docs/specs/qed_spec.txt
>>> new file mode 100644
>>> index 0000000..c942b8e
>>> --- /dev/null
>>> +++ b/docs/specs/qed_spec.txt
>>> @@ -0,0 +1,94 @@
>>> +=Specification=
>>> +
>>> +The file format looks like this:
>>> +
>>> + +----------+----------+----------+-----+
>>> + | cluster0 | cluster1 | cluster2 | ... |
>>> + +----------+----------+----------+-----+
>>> +
>>> +The first cluster begins with the '''header'''.  The header contains 
>>> information about where regular clusters start; this allows the header to 
>>> be extensible and store extra information about the image file.  A regular 
>>> cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''.  L1 
>>> and L2 tables are composed of one or more contiguous clusters.
>>> +
>>> +Normally the file size will be a multiple of the cluster size.  If the 
>>> file size is not a multiple, extra information after the last cluster may 
>>> not be preserved if data is written.  Legitimate extra information should 
>>> use space between the header and the first regular cluster.
>>> +
>>> +All fields are little-endian.
>>> +
>>> +==Header==
>>> + Header {
>>> +     uint32_t magic;               /* QED\0 */
>>> + 
>>> +     uint32_t cluster_size;        /* in bytes */
>>> +     uint32_t table_size;          /* for L1 and L2 tables, in clusters */
>>> +     uint32_t header_size;         /* in clusters */
>>> + 
>>> +     uint64_t features;            /* format feature bits */
>>> +     uint64_t compat_features;     /* compat feature bits */
>>> +     uint64_t l1_table_offset;     /* in bytes */
>>> +     uint64_t image_size;          /* total logical image size, in bytes */
>>> + 
>>> +     /* if (features & QED_F_BACKING_FILE) */
>>> +     uint32_t backing_filename_offset; /* in bytes from start of header */
>>> +     uint32_t backing_filename_size;   /* in bytes */
>>> + 
>>> +     /* if (compat_features & QED_CF_BACKING_FORMAT) */
>>> +     uint32_t backing_fmt_offset;  /* in bytes from start of header */
>>> +     uint32_t backing_fmt_size;    /* in bytes */
>>
>> It was discussed before, but I don't think we came to a conclusion. Are
>> there any circumstances under which you don't want to set the
>> QED_CF_BACKING_FORMAT flag?
> 
> I suggest the following:
> 
> QED_CF_BACKING_FORMAT_RAW = 0x1
> 
> When set, the backing file is a raw image and should not be probed for
> its file format.  The default (unset) means that the backing image file
> format may be probed.

And it must be set for raw image formats because we'll want to avoid
considering raw for the probing.

Works for me.

>>> + }
>>> +
>>> +Field descriptions:
>>> +* cluster_size must be a power of 2 in range [2^12, 2^26].
>>> +* table_size must be a power of 2 in range [1, 16].
>>
>> Is there a reason why this must be a power of two?
> 
> The power of two makes logical-to-cluster offset translation easy and
> cheap:
> 
> l2_table = get_l2_table(l1_table[(logical >> l2_shift) & l2_mask])
> cluster = l2_table[logical >> l1_shift] + (logical & cluster_mask)

Right, good point.

>>> +* image_size is the block device size seen by the guest and must be a 
>>> multiple of cluster_size.
>>
>> So there are image sizes that can't be accurately represented in QED? I
>> think that's a bad idea. Even more so because I can't see how it greatly
>> simplifies implementation (you save the operation for rounding up on
>> open/create, that's it) - it looks like a completely arbitrary restriction.
> 
> Good point.  I will try to lift this restriction in v3.

Thanks.

>>> +* backing_filename and backing_fmt are both strings in (byte offset, byte 
>>> size) form.  They are not NUL-terminated and do not have alignment 
>>> constraints.
>>
>> A description of the meaning of these strings is missing.
> 
> Update:
> "The backing filename string is given in the
> backing_filename_{offset,size} fields and may be an absolute path or
> relative to the image file."

Looks good.

>>
>>> +
>>> +Feature bits:
>>> +* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
>>> +* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.
>>> +* QED_CF_BACKING_FORMAT = 0x01.  The image has a specific backing file 
>>> format stored.
>>
>> I suggest adding a headline "Compatibility Feature Bits". Seeing 0x01
>> twice is confusing at first sight.
> 
> Updated, thanks.
> 
>>
>>> +
>>> +==Tables==
>>> +
>>> +Tables provide the translation from logical offsets in the block device to 
>>> cluster offsets in the file.
>>> +
>>> + #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
>>> +  
>>> + Table {
>>> +     uint64_t offsets[TABLE_NOFFSETS];
>>> + }
>>> +
>>> +The tables are organized as follows:
>>> +
>>> +                    +----------+
>>> +                    | L1 table |
>>> +                    +----------+
>>> +               ,------'  |  '------.
>>> +          +----------+   |    +----------+
>>> +          | L2 table |  ...   | L2 table |
>>> +          +----------+        +----------+
>>> +      ,------'  |  '------.
>>> + +----------+   |    +----------+
>>> + |   Data   |  ...   |   Data   |
>>> + +----------+        +----------+
>>> +
>>> +A table is made up of one or more contiguous clusters.  The table_size 
>>> header field determines table size for an image file.  For example, 
>>> cluster_size=64 KB and table_size=4 results in 256 KB tables.
>>> +
>>> +The logical image size must be less than or equal to the maximum possible 
>>> size of clusters rooted by the L1 table:
>>> + header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
>>> +
>>> +Logical offsets are translated into cluster offsets as follows:
>>> +
>>> +  table_bits table_bits    cluster_bits
>>> +  <--------> <--------> <--------------->
>>> + +----------+----------+-----------------+
>>> + | L1 index | L2 index |     byte offset |
>>> + +----------+----------+-----------------+
>>> + 
>>> +       Structure of a logical offset
>>> +
>>> + def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
>>> +   l2_offset = l1_table[l1_index]
>>> +   l2_table = load_table(l2_offset)
>>> +   cluster_offset = l2_table[l2_index]
>>> +   return cluster_offset + byte_offset
>>
>> Should we reserve some bits in the table entries in case we need some
>> flags later? Also, I suppose all table entries must be cluster aligned?
> 
> Yes, let's do that.  At least for sparse zero cluster tracking we need a
> bit.  The minimum 4k cluster size gives us 12 bits to play with.

Yes, 12 bits should be enough for a while.

>> What happened to the other sections that older versions of the spec
>> contained? For example, this version doesn't specify any more what the
>> semantics of unallocated clusters and backing files is.
> 
> I removed them because they don't describe the on-disk layout and were
> more of a way to think through the implementation than a format
> specification.  It was more a decision to focus my effort on improving
> the on-disk layout specification than anything else.
> 
> Do you want the semantics in the specification, or is it okay to leave
> that part on the wiki only?

I think, at least we need to describe what an unallocated table entry
looks like (it has a zero cluster offset - but may it have flags?) and
what this means for the virtual disk content (read from the backing
file, zero if there is no backing file). This definitely belongs in the
specification.

Other parts that describe qemu's implementation or best practices can be
left on the wiki only.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]