[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extensio

From: Eric Blake
Subject: Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
Date: Mon, 4 Apr 2016 17:32:34 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.7.1

On 04/04/2016 05:08 PM, Wouter Verhelst wrote:
> On Mon, Apr 04, 2016 at 10:54:02PM +0300, Denis V. Lunev wrote:
>> saying about dirtiness, we would soon come to the fact, that
>> we can have several dirtiness states regarding different
>> lines of incremental backups. This complexity is hidden
>> inside QEMU and it would be very difficult to publish and
>> reuse it.
> How about this then.
> A reply to GET_BLOCK_STATUS containing chunks of this:
> 32-bit length
> 32-bit "snapshot status"
> if bit 0 in the latter field is set, that means the block is allocated
>   on the original device
> if bit 1 is set, that means the block is allocated on the first-level
>   snapshot
> if bit 2 is set, that means the block is allocated on the second-level
>   snapshot

The idea of allocation is orthogonal from the idea of reads as zeroes.
That is, a client may usefully guarantee that something reads as zeroes,
whether or not it is allocated (but knowing whether it is a hole or
allocated will determine whether future writes to that area will cause
file system fragmentation or be at risk of ENOSPC on thin-provisioning).
 If we want to expose the notion of depth (and I'm not sure about that
yet), we may want to reserve bit 0 for 'reads as zero' and bits 1-30 as
'allocated at depth "bit-1"' (and bit 31 as 'allocated at depth 30 or

I don't know if the idea of depth of allocation is useful enough to
expose in this manner; qemu could certainly advertise depth if the
protocol calls it out, but I'm still not sure whether knowing depth
helps any algorithms.

> If all flags are cleared, that means the block is not allocated (i.e.,
> is a hole) and MUST read as zeroes.

That's too strong.  NBD_CMD_TRIM says that we can create holes whose
data does not necessarily read as zeroes (and SCSI definitely has
semantics like this - not all devices guarantee zero reads when you
UNMAP; and WRITE_SAME has an UNMAP flag to control whether you are okay
with the faster unmapping operation at the expense of bad reads, or
slower explicit writes).  Hence my complaint that we have to treat
'reads as zero' as an orthogonal bit to 'allocated at depth X'.

> If a flag is set at a particular level X, that means the device is dirty
> at the Xth-level snapshot.
> If at least one flag is set for a region, that means the data may read
> as "not zero".
> The protocol does not define what it means to have multiple levels of
> snapshots, other than:
> - Any write command (WRITE or WRITE_ZEROES) MUST NOT clear or set the
>   Xth level flag if the Yth level flag is not also cleared at the same
>   time, for any Y > X
> - Any write (as above) MAY set or clear multiple levels of flags at the
>   same time, as long as the above holds
> Having a 32-bit snapshot status field allows for 32 levels of snapshots.
> We could switch length and flags to 64 bits so that things continue to
> align nicely, and then we have a maximum of 64 levels of snapshots.

64 bits may not lay out as nicely (a 12-byte struct is not as efficient
to copy between the wire and a C array as a 8-byte struct).

> (I'm not going to write this up formally at this time of night, but you
> get the general idea)

The idea may make it possible to expose dirty information as a layer of
depth (from the qemu perspective, each qcow2 file would occupy 2 layers
of depth: one if dirty, and another if allocated; then deeper layers are
determined by backing files).  But I'm also worried that it may be more
complicated than the original question at hand (qemu wants to know,  in
advance of a read, which portions of a file are worth reading, because
they are either allocated, or because they are dirty; but doesn't care
to what depth the server has to go to actually perform the reads).

Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]