qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] Performance impact of the qcow2 overlap checks


From: Alberto Garcia
Subject: Re: [Qemu-block] Performance impact of the qcow2 overlap checks
Date: Tue, 24 Jan 2017 17:43:44 +0100
User-agent: Notmuch/0.18.2 (http://notmuchmail.org) Emacs/24.4.1 (i586-pc-linux-gnu)

On Mon 23 Jan 2017 05:29:49 PM CET, Max Reitz wrote:
> Refcount data will only be queried when writing data to the image. If
> that data has been overwritten, we have a chance that it is being set
> to 0 (which is rather large because 0 generally has a higher
> probability of being a part of data, admittedly). But we also have a
> chance that it is set to something else, which generally will be
> greater than the number of internal snapshots (+ 1). Therefore, such
> corruption should be easily detectable before much data is wrongly
> overwritten.
>
> The drawbacks with this approach would be the following:
> (1) Is printing a warning enough to make the user shut down the VM as
> fast as possible and run qemu-img check?

> (2) It is legal to have a greater refcount than the number of internal
> snapshots plus one. qemu never produces such images, though (or does
> it?). Could there be existing images where users will be just annoyed by
> such warnings? Should we add a runtime option to disable them?

I don't think it's legal, or is there any reason why it would be?

I'll try to summarize my opinion:

- If using that refcount method that you propose we can guarantee that
  the image is corrupted then that should clearly cause an I/O error,
  and I would prevent further writes to the image (if that's possible).

- If this method cannot guarantee that it's corrupted but it can only
  give us an indication that it could be then I don't think I'd bother
  and I'd simply keep the current overlap check.

- Printing a warning and expecting the user to see it doesn't seem like
  a good way to deal with data corruption.

> And of course another approach I already mentioned would be to scrap
> the overlap checks altogether once we have image locking (and I guess
> we can keep them around in their current form at least until then).

I think the overlap checks are fine, at least in my tests I only found
problems with one of them, and only in some scenarios(*). So if we
cannot optimize them easily I'd simply tell the user about the risks and
suggest to disable them. Maybe the only thing that we need is simply
good documentation. What are the chances of corrupted qcow2 images that
are not caused by the user messing up? Do we know how many cases of
those are?

I think the most obvious candidate for optimization is refcount-block,
and as I said it's the check what would create the bottleneck in most
common scenarios. The optimization is simple: if the size of the qcow2
image is 7GB then you only need to check the first 4 entries in the
refcount table.

I can think of two problems with this, which is why I haven't sent a
patch yet:

(1) This needs the current size of the image every time we want to
    perform that check, and that means I/O.

(2) The unused entries that we're skipping in the refcount table should
    be 0, but what if they're not? That would be a sign of data
    corruption. But should we bother? Those entries will be checked
    before they're used if the image grows large enough.

(*)I actually noticed (I'm talking about a qcow2 image stored in RAM
now) that disabling the refcount-block check increases dramatically
(+90%) the number of IOPS when using virtio-blk, but doesn't seem to
have any effect (my tests even show a slightly negative effect!!) when
using virtio-scsi. Does that make sense? Am I hitting a SCSI limit or
what would be the reason for this?

Berto



reply via email to

[Prev in Thread] Current Thread [Next in Thread]