qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 1/4] block: Correct bs->growable


From: Max Reitz
Subject: Re: [Qemu-devel] [PATCH 1/4] block: Correct bs->growable
Date: Fri, 05 Sep 2014 14:46:33 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.0

On 05.09.2014 12:01, Kevin Wolf wrote:
Am 04.09.2014 um 22:01 hat Max Reitz geschrieben:
On 20.08.2014 13:40, Kevin Wolf wrote:
Am 12.07.2014 um 00:23 hat Max Reitz geschrieben:
Currently, the field "growable" in a BDS is set iff the BDS is opened in
protocol mode (with O_BDRV_PROTOCOL). However, not every protocol block
driver allows growing: NBD, for instance, does not. On the other hand,
a non-protocol block driver may allow growing: The raw driver does.

Fix this by correcting the "growable" field in the driver-specific open
function for the BDS, if necessary.

Signed-off-by: Max Reitz <address@hidden>
I'm not sure I agree with bs->growable = true for raw. It's certainly
true that the backend can technically provide the functionality that
writes beyond EOF grow the file. That's not the point of bs->growable,
though.

The point of it was to _forbid_ it to grow even when it's technically
possible (non-file protocols weren't really a thing back then, apart
>from vvfat, so the assumption was that it's always technically
possible). growable was introduced with bdrv_check_request(), which is
supposed to reject guest requests after the end of the virtual disk (and
this fixed a CVE, see commit 71d0770c). You're now disabling this check
for raw.

I think we need to make sure that bs->growable is only set if it is
opened for an image that has drv->requires_growing_file set and
therefore not directly used by a guest.

Well, except that with node-name a guest will be able to use any image
in the chain... Might this mean that it's really a BlockBackend
property?
Okay, the more I sit at this problem, the harder it seems to get.
The only thing I currently know for sure is that I disagree with
Anthony's opinion in 71d0770c ("this patch makes the BlockDriver API
guarantee that all requests are within 0..bdrv_getlength() which to
me seems like a Good Thing").

The initial point was to range check guest requests. In 2009, it may
have been enough to statically check the BDS type (protocol or
format) to know whether the guest directly accesses it (format) or
not (protocol). However, this is no longer sufficient. Now, as far
as I know, guests can access basically any BDS (as you yourself
say). Therefore, it seems to me like it's impossible to determine
whether the device should be marked growable or not when opening it.
Instead, I think we have to check for each single request whether it
comes from the guest or from within the block layer and do range
checking only for the former case; but this should not be the task
of the block layer core, but of the block devices instead.
Theoretically, guests may write beyond the image end and grow it
that way all they want, but in practice this should be limited by
the block devices which have a fixed length.
What about block jobs, built-in NBD server, etc.?

Well, I honestly think they should do the check themselves as well.

Also, we have many device models that I don't trust a bit and having one
central check is much easier to get right than n duplicates in each
device emulation.

That's why I (later) said that I do understand the concept of having a backup check.

I still think the block device drivers have to detect out-of-bounds accesses to be able to act differently than when just seeing an EIO from the block layer. The same applies to the NBD server; and then getting the range check into block jobs shouldn't hurt too much either.

I don't think all those guest interfaces (block jobs are not really a guest interface, imho, so I'd even keep them out of this) can just rely on the block layer to do the check for them.

Under this impression, I wanted to simply return to growable = false
for raw. However, this breaks test 071 which attaches blkdebug to a
raw BDS after qemu has been started. blkdebug detects raw is not
growable, therefore reports not to be growable as well, and because
qcow2 is on top of all that, the warning introduced by this series
is emitted. Okay, so we will need growable = true for raw in some
cases.
I don't want to make the shortcut more tempting, but if we come to the
conclusion that a fixed growable=false for raw is the right thing to do,
then we simply need to fix the test case.

As I wrote later: I do know that this is not a general use case but it should work anyway.

You create some block device which naturally has raw on top; then you put another block format on top of that. It's probably crazy and useless, but by no means wrong. When accessing the raw BDS from the guest, it should not be growable. When accessing it through the format BDS on top of it, it should be.

It's not trivial to determine whether the superordinate BDS of a
certain BDS has BlockDriver.requires_growing_file set or not. We
could introduce a new flag for bdrv_open(), but I'd rather avoid it.
In fact, I tried something like this, but I quickly got into
problems because e.g. blkdebug does not have
requires_growing_file_set, but decides whether its own BDS are
growable based on whether the underlying file is growable or not. So
if a blkdebug BDS is growable, the underlying file actually needs to
be growable as well. Therefore, we'd need a more sophisticated
requires_growing_file_set and maybe even propagation of growable
requirement through the BDS layers which quickly turns ugly when one
has to consider that a BDS may be used by multiple users.
Ugh. You're right, it's not a static per-driver property, but really a
per-BDS one. That's somewhat unfortunate.

Also, it's actually irrelevant whether requires_growing_file is set
or not. growable's current sole purpose apparently is enabling range
checks for guest-accessible images. If the BDS is only a subordinate
to another BDS, it doesn't matter whether that other BDS needs
growable set or not.
Hm... So in blockdev talk, what you're saying is that range checks are a
BlockBackend thing, and if we're on BDS level, we don't care? Perhaps
that's the right approach and gets us rid of bs->growable immediately.

Exactly.

So I scrapped that. Instead, we can just test whether
BDRV_O_PROTOCOL is given or not. If it is, the BDS is used from
within the block layer; if it isn't, it probably isn't, and even if
it is, the user just has to cope with activated range checks. That's
at least the idea.
Making any difference based on BDRV_O_PROTOCOL is a dead end in
blockdev-add times, where users are fully expected to build up a BDS
graph node by node with separate blockdev-add invocations.

Right, but that just means the protection is already broken.

But this doesn't work either: You can create a protocol BDS and then
pass it to the guest; on second thought, however, this is already
possible, so I wouldn't bother about this. But on the other hand,
this breaks 071 as well because 071 creates a (non-guest accessible,
but it could be) non-protocol BDS and then tries to put blkdebug on
top of that. I do know that this is not a general use case but it
should work anyway.

So, in my honest and very humble opinion, I'd reevaluate the
usefulness of BDS.growable regarding it's original purpose and
instead change it to be what this patch does.
What use case is left if you don't use it for range checks any more?
Shouldn't we remove rather than change it?

Do your patches still need a per-BDS flag, or would a static per-driver
flag be enough?

Well, that's not so easy. See blkdebug (I know blkdebug is a rather bad example, because noone really uses it except for the iotests, but it's there) and raw (which is another bad example, because it's what we're talking about here and don't really know how to handle it): A blkdebug or raw BDS is growable iff the underlying file BDS is growable. We could signal this through a special value in the BDS, but enter blkverify (just as bad as blkdebug, I know): It's growable iff both the file BDS and the tested BDS are growable; a similar thing applies to quorum (which I, for whatever reason, did not touch during this series; probably I should do that in v2).

Also, should qcow2 over host_device print a warning? It's a legitimate
setup that is frequently used, but a host_device isn't really growable
on its own. We rely on management taking care of it.

I most certainly think it should warn. But I'll keep that in mind and reformulate the message (strip the "please fix your configuration" and just replace it by something like "write operations may fail").

Anthony argumented that the block layer could very well do the range
checks. It can, but it cannot trivially know (at least in its
current state) whether a certain request comes from the guest or
not. In 2009, this may have been known when the BDS was created; but
this is absolutely not true today.

On the other hand, the block devices very well know that any request
coming to them has to fit in a certain range. They should do that
range checking, not the block layer. I understand the intent of
having a redundant fail-safe system, but it simply doesn't work
anymore. We cannot sometimes expected raw BDS to grow (when in the
middle of a BDS stack) and sometimes not (when directly used by a
guest). On the other hand, the guest can simply be given a protocol
BDS and all the range checks are disabled.

Putting BDS.growable into BlockBackend may (and probably) will fix
this. But I really don't like doing the check in the block layer
when it's really the block devices who are responsible for it, even
if it's just a backup check.
I see, you've come to the same conclusion regarding BlockBackend.

And no, I don't agree that it belongs in the device emulation code.
There are more users of block devices than just those.

I still think it does belong there. But I agree that we should have a central backup check in the block layer, if possible and feasible.

The worst thing is, I can't even introduce a new field like
"writes_beyond_eof" to BDS to circumvent all of this. Again, take
the blkdebug-on-existing-raw example. With a separate field, the
block layer will not complain about opening that constellation. But
if you actually try to write something to the qcow2 BDS which would
make the image file grow, the range checks breaks everything because
it only cares about growable. So in the end, the block layer should
actually have complained about the constellation. But on the other
hand, it shouldn't have, because the constellation should work. This
really is the heart of the problem: The raw BDS might be exposed to
the guest, so when the guest accesses it, range checks should be
done. But if it is used through qcow2+blkdebug, range checks should
be disabled. Using BlockBackend will fix this, but we don't have
that yet.


tl;dr, I see only two ways to go on: Either I wait until
BlockBackend exists; or I simply leave this patch as it is because
it's actually the block device driver's fault if an out-of-range
request comes in from the guest. Since I remember talking about the
former a year ago personally with you and Markus, I fear it'll still
take a considerable amount of time. Therefore, I'm strongly in favor
of the latter. If the block device drivers do their job, it won't
break anything. If they don't, they should be fixed (and at least
it'll be only raw that's broken).
Wait until BlockBackend exists. Markus seems to have switched from "do
everything sometime" to "do something now" and is working on it, so I
hope we'll be there soon (before KVM Forum) with the initial version.
That should be enough for your requirement.

Good.

Not sure if it's helpful in this context, but reading all of this
reminded me of something I suggested before: BDSes must be able to treat
requests differently depending on where they come from. We called it
something like BDSes exposing different "views" or "ports". One of the
use cases was qcow2 exposing read-only views of inactive snapshots,
another one was group throttling, and what you just wrote sounded in
parts like it could make use of the same.

Yay, sounds like shaders in the block layer to me. :-)

Max



reply via email to

[Prev in Thread] Current Thread [Next in Thread]