qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v5 1/4] block: support compressed write at generic layer


From: Max Reitz
Subject: Re: [PATCH v5 1/4] block: support compressed write at generic layer
Date: Tue, 22 Oct 2019 14:56:41 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.1.1

On 22.10.19 14:23, Vladimir Sementsov-Ogievskiy wrote:
> 22.10.2019 14:31, Max Reitz wrote:
>> On 22.10.19 12:46, Vladimir Sementsov-Ogievskiy wrote:
>>> 22.10.2019 13:21, Andrey Shinkevich wrote:
>>>>
>>>> On 22/10/2019 12:28, Max Reitz wrote:
>>>>> On 20.10.19 22:37, Andrey Shinkevich wrote:
>>>>>> To inform the block layer about writing all the data compressed, we
>>>>>> introduce the 'compress' command line option. Based on that option, the
>>>>>> written data will be aligned by the cluster size at the generic layer.
>>>>>>
>>>>>> Suggested-by: Roman Kagan <address@hidden>
>>>>>> Suggested-by: Vladimir Sementsov-Ogievskiy <address@hidden>
>>>>>> Signed-off-by: Andrey Shinkevich <address@hidden>
>>>>>> Reviewed-by: Vladimir Sementsov-Ogievskiy <address@hidden>
>>>>>> ---
>>>>>>     block.c                   | 20 +++++++++++++++++++-
>>>>>>     block/io.c                | 13 +++++++++----
>>>>>>     block/qcow2.c             |  4 ++++
>>>>>>     blockdev.c                |  9 ++++++++-
>>>>>>     include/block/block.h     |  1 +
>>>>>>     include/block/block_int.h |  2 ++
>>>>>>     qapi/block-core.json      |  5 ++++-
>>>>>>     qemu-options.hx           |  6 ++++--
>>>>>>     8 files changed, 51 insertions(+), 9 deletions(-)
>>>>>
>>>>> The problem with compression is that there are such tight constraints on
>>>>> it that it can really only work for very defined use cases.  Those
>>>>> constraints are:
>>>>>
>>>>> - Only write whole clusters,
>>>>> - Clusters can be written to only once.
>>>>>
>>>>> The first point is addressed in this patch by setting request_alignment.
>>>>>     But I don’t see how the second one can be addressed.  Well, maybe by
>>>>> allowing it in all drivers that support compression.  But if I just look
>>>>> at qcow2, that isn’t going to be trivial: You need to allocate a
>>>>> completely new cluster where you write the data (in case it grows), and
>>>>> thus you leave behind a hole, which kind of defeats the purpose of
>>>>> compression.
>>>>>
>>>>> (For demonstration:
>>>>>
>>>>> $ ./qemu-img create -f qcow2 test.qcow2 64M
>>>>> Formatting 'test.qcow2', fmt=qcow2 size=67108864 cluster_size=65536
>>>>> lazy_refcounts=off refcount_bits=16
>>>>> $ x86_64-softmmu/qemu-system-x86_64 \
>>>>>        -blockdev "{'node-name': 'drv0', 'driver': 'qcow2',
>>>>>                    'compress': true,
>>>>>                    'file': {'driver': 'file', 'filename': 'test.qcow2'}}" 
>>>>> \
>>>>>        -monitor stdio
>>>>> QEMU 4.1.50 monitor - type 'help' for more information
>>>>> (qemu) qemu-io drv0 "write -P 42 0 64k"
>>>>> wrote 65536/65536 bytes at offset 0
>>>>> 64 KiB, 1 ops; 00.02 sec (4.055 MiB/sec and 64.8793 ops/sec)
>>>>> (qemu) qemu-io drv0 "write -P 23 0 64k"
>>>>> write failed: Input/output error
>>>>>
>>>>> )
>>>>>
>>>>> Compression really only works when you fully write all of an image
>>>>> exactly once; i.e. as the qemu-img convert or as a backup target.  For
>>>>> both cases we already have a compression option.  So I’m wondering where
>>>>> this new option is really useful.
>>>>>
>>>>> (You do add a test for stream, but I don’t know whether that’s really a
>>>>> good example, see my response there.)
>>>>>
>>>>> Max
>>>>>
>>>>
>>>> Thank you very much Max for your detailed response.
>>>>
>>>> 1) You are right that compression is used with the backup mostly. The
>>>> option for the compression with backup would be replaced by usage at the
>>>> block layer, with no duplication. Also, it can be useful for NBD for
>>>> instance,
>>>>
>>>> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
>>>> $ sudo ./qemu-nbd -c /dev/nbd0 ./image.qcow2
>>>> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
>>>> 10+0 records in
>>>> 10+0 records out
>>>> 104857600 bytes (105 MB) copied, 0,0890581 s, 1,2 GB/s
>>>> $ sudo ./qemu-nbd -d /dev/nbd0
>>>> $ du -sh ./image.qcow2
>>>> 101M    ./image.qcow2
>>>>
>>>> and with the compression
>>>>
>>>> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
>>>> $ sudo ./qemu-nbd -C -c /dev/nbd0 ./image.qcow2
>>>> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
>>>> 10+0 records in
>>>> 10+0 records out
>>>> 104857600 bytes (105 MB) copied, 0,076046 s, 1,4 GB/s
>>>> $ sudo ./qemu-nbd -d /dev/nbd0
>>>> $ du -sh ./image.qcow2
>>>> 5,3M    ./image.qcow2
>>
>> That seems wrong to me.  Why not use qemu-img convert for this case?
>>
>> Attaching an NBD server to a compressed disk has exactly the same
>> problem as attaching a compressed disk to a VM.  It won’t work unless
>> the client/guest is aware of the limitations.
>>
>>>> The idea behind introducing the new 'compress' option is to use that
>>>> only one instead of many other ones of such a kind.
>>>>
>>>> 2) You are right also that "Compression can't overwrite anything..."
>>>> It can be seen in the commit message
>>>> b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2
>>>>
>>>> I am not sure if data should be written compressed to the active layer.
>>>> I made the tests with the idea of bringing examples of usage the
>>>> 'compress' option because passing an option is a tricky thing in QEMU.
>>>> But the issue takes place anyway if we want to rewrite to allocated
>>>> clusters.
>>>> I would like to investigate the matter and make a patch that resolves
>>>> that issue.
>>>> Do you agree with that?
>>
>> What seems wrong to me is that this series just adds a generic compress
>> option without ensuring that it works generically.
>>
>> Either (1) it only works in well-defined cases, then either (1A) we have
>> to ensure that we only allow it then (as we do now, because only
>> qemu-img convert and the backup job have such an option; and those two
>> are safe), or (1B) we have to clearly give a big warning for the new
>> option that it doesn’t work correctly.  I don’t know whether such a
>> warning is even possible with just a generic node-level option.
>>
>> Or (2) we make it work in generic cases.  Well, that might be possible
>> for qcow2, but who’s going to make it work for VMDK’s streamOptimized
>> subformat?
>>
>> More on all of that below.
>>
>>> Yes, we want this option not to allow compressed writes for guests, but to
>>> allow
>>> - stream with compression (used to remove compressed incremental backup, we
>>> need to merge it to the next incremental)
>>
>> Based on the assumption that one shouldn’t attach a compressed disk to a
>> VM, I don’t see how streaming makes sense then.  Well, I suppose
>> intermediate streaming would work.
>>
>>> - backup with compression (we have an optional already, so it works)
>>> - backup to nbd server with compression: enable compression on nbd server
>>
>> The problem is clearly that if a generic client were to connect to the
>> NBD server, it wouldn’t work.  In this case, compression will work only
>> if the clients understands the limitation.
>>
>> (The safe way would be to make the NBD server provide an option that
>> clients can see so they know this limitation and agree they’ll adhere to
>> it.  It’s also a stupid way.)
>>
>>> So instead of adding two options (for stream and for nbd), it seems better 
>>> to
>>> add only one for generic layer.
>>
>> I don’t know.  It doesn’t work generically, so I don’t know whether it
>> should be a generic option.
>>
>>> Then, it becomes possible to run guest on image with compress=on. It's a 
>>> side
>>> effect, but still it should work correctly.
>>
>> How so?  compress=on only works if every party involved only writes to
>> any cluster of the image exactly once.  That is just not the case for
>> guests unless they know this limitation, and even I don’t see a use case.
>>
>>> I think the simplest thing is to just run normal write, if compressed write
>>> failed because of reallocation. We should check that on that failure-path
>>> ENOTSUP is returned and handle it for compress=on option by fallback to
>>> normal write in generic block/io.c
>>
>> It seems wrong to not compress data with compress=on.
> 
> We already fallback to normal write if can't compress in qcow2.

In a very specific case, namely where the compressed size is larger than
the uncompressed size.  It would be a whole different story to only
compress the first write to a given cluster but not any following one.

>>> (Note, that in case with stream we rewrite only unallocated clusters)
>>
>> Yes, but if the stream job writes the cluster before the guest, the
>> guest gets an I/O error.
> 
> I don't think that using compressed writes by guest is really usable thing.
> In our case with stream there is no guest (just use qemu binary to operate
> on block layer)
> 
>>
>>
>> By the way, the other thing I wondered was whether this should be a
>> filter driver instead (if it makes sense at all).  Such a filter driver
>> would at least be sufficiently cumbersome to use that probably only
>> users who understand the implications would make use of it (1B above).
>>
>>
>> I’m not against adding a generic compress=on option if we ensure that it
>> actually works generically (2 above).  It doesn’t right now, so I don’t
>> think this is right.
>>
>> I’m already not sure whether it’s really possible to support generic
>> compressed writes in qcow2.  I suppose it’d be at least awkward.  In
>> theory it should work, it’s just that we can’t keep track of subcluster
>> allocations, which in the worst case means that some compressed clusters
>> take up a whole cluster of space.
>>
>> For VMDK...  I suppose we could consider a new flag for block drivers
>> that flags whether a driver supports arbitrary compressed writes or has
>> the old limitations.  compress=on could then refuse to work on any block
>> driver but the ones that support arbitrary compressed writes.
>>
>>
>> Our current model (1A) is simply to ensure that all compressed writes
>> can adhere to the limitations.  As I’ve said above, extending this to
>> NBD would mean adding some NBD negotiation so both client and server
>> agree on this limitation.
> 
> In this case, I'd prefere simple cmdline option fro qemu-nbd.
> 
>> That seems kind of reasonable from the qemu
>> side, but probably very unreasonable from an NBD side.  Why would NBD
>> bother to reproduce qemu’s limitations?
>>
>>
>> So these are the three ways (1A, 1B, 2) I can imagine.  But just adding
>> a generic option that unsuspecting users are not unlikely to use but
>> that simply doesn’t work generically doesn’t seem right to me.
>>
> 
> Firstly, 1A already doesn't work correctly: we have compress option for 
> backup.
> So, it will not work if backup target is not empty.

I’m not sure whether that qualifies because the user is simply
responsible to ensure that the target is empty.

Otherwise, you could also argue that backup doesn’t necessarily create a
real backup because with sync=top it won’t copy unallocated clusters, so
if the target already has data in such a place, it won’t be overwritten.

Furthermore I’d argue this would hardly be an accident.  Both
drive-backup with mode=existing and qemu-img convert -n are most likely
not used blindly but only when the situation requires it.

My point is that using compress=on can be an accident because people see
that option and think “That sounds useful!” without reading the warning.

> So, 1A is impossible, as it's already broken, adding separate options for 
> stream
> and qemu-nbd is not better than just add one generic option.

I disagree that 1A is broken, and I disagree a generic option would not
be worse.  I still believe a simple generic option is prone to
accidental misuse.

> I don't like (2) as it means a lot of effort to support actually not needed 
> case.

Well, you say it isn’t needed.  I suppose if it were supported, people
would indeed use it.

> I don't really like filter solution (as at seems too much to add a filter for 
> simple
> boolean option), but I think, we can live with it.

First, it is not a simple boolean option.  This patch brings
implementation with it that converts writes to compressed writes and
sets a request alignment.  I actually think it’s better to do things
like that in a dedicated block driver than in the generic block layer code.

Second, we already this basically this for copy-on-read (which is also a
boolean option for -drive).

(FWIW, I’d prefer if detect-zeroes were a block driver instead of
generic block layer code.)

> 1B is OK for me, that is, just document what the option does in fact.
> 
> May be, name the option "compress-new-writes" instead of just "compress"?
> And document, that writes to clusters, which was not written before will be
> compressed?
> 
> Or make the option to be compress=<x>, where x may be 'no' or 'new-writes',
> reserving 'all-writes' for future?

I don’t know whether the limitations can be captured in three words, but
it would at least make users pay closer attention to what the option is
really doing.

OTOH, the only downsides to a dedicated block driver to me appear that
it’s more cumbersome to use (which I actually consider a benefit) and
that it requires the usual filter driver boilerplate – but that can be
copied from other drivers[1].

(The benefits are that we don’t need to modify generic code and that
people need to read into the filter and its caveats before they can use it.)

Max


[1] Actually, we could provide default implementations somewhere and
make block filters use them.  That should reduce the boilerplate size.

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]