Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate seri

qemu-block

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate seri

From:	Vladimir Sementsov-Ogievskiy
Subject:	Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
Date:	Tue, 2 Jun 2020 20:08:49 +0300
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.8.1

02.06.2020 20:01, Vladimir Sementsov-Ogievskiy wrote:

02.06.2020 19:38, Max Reitz wrote:

On 02.06.20 18:16, Vladimir Sementsov-Ogievskiy wrote:

02.06.2020 18:46, Max Reitz wrote:

On 02.06.20 16:43, Vladimir Sementsov-Ogievskiy wrote:

01.11.2019 18:25, Max Reitz wrote:

Sorry for being late, I have some comments


Uh, well.  Reasonable, but I hope you don’t mind me having no longer
having this patch fresh on my mind.

The XFS kernel driver has a bug that may cause data corruption for
qcow2
images as of qemu commit c8bb23cbdbe32f.  We can work around it by
treating post-EOF fallocates as serializing up until infinity
(INT64_MAX
in practice).

Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
    block/file-posix.c | 36 ++++++++++++++++++++++++++++++++++++
    1 file changed, 36 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index 0b7e904d48..1f0f61a02b 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2721,6 +2721,42 @@ raw_do_pwrite_zeroes(BlockDriverState *bs,
int64_t offset, int bytes,
        RawPosixAIOData acb;
        ThreadPoolFunc *handler;
    +#ifdef CONFIG_FALLOCATE
+    if (offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
+        BdrvTrackedRequest *req;
+        uint64_t end;
+
+        /*
+         * This is a workaround for a bug in the Linux XFS driver,
+         * where writes submitted through the AIO interface will be
+         * discarded if they happen beyond a concurrently running
+         * fallocate() that increases the file length (i.e., both the
+         * write and the fallocate() happen beyond the EOF).
+         *
+         * To work around it, we extend the tracked request for this
+         * zero write until INT64_MAX (effectively infinity), and mark
+         * it as serializing.
+         *
+         * We have to enable this workaround for all filesystems and
+         * AIO modes (not just XFS with aio=native), because for
+         * remote filesystems we do not know the host configuration.
+         */
+
+        req = bdrv_co_get_self_request(bs);
+        assert(req);
+        assert(req->type == BDRV_TRACKED_WRITE);
+        assert(req->offset <= offset);
+        assert(req->offset + req->bytes >= offset + bytes);


Why these assertions?


Mostly to see that bdrv_co_get_self_request() (introduced by the same
series) actually got the right request.  (I suppose.)

TrackedRequest offset and bytes fields correspond
to the original request. When request is being expanded to satisfy
request_alignment, these fields are not updated.


Well, shrunk in this case, but OK.

So, maybe, we should assert overlap_offset and overlap_bytes?


Maybe, but would that have any benefit?  Especially after this patch
having been in qemu for over half a year?

(Also, intuitively off the top of my head I don’t see how it would make
more sense to check overlap_offset and overlap_bytes, if all the
assertions are for is to see that we got the right request.
overlap_offset and overlap_bytes may still not exactly match @offset or
@bytes, respectively.)

Your suggestion makes it sound a bit like you have a different purpose
in mind what these assertions might be useful for...?


No I just think it may have false-positives, when actual request is larger
than original.


Seems like a bug.  Why would we zero more than originally requested?


Hmm, you are right, seems it's not possible. We may expand the request to do 
COW,
but it will never produce write-zero request larger than original one.

(I really tried to reproduce to understand it :)

So offset may be < req->offset and req->offset +
req->bytes may be
less than offset + bytes. And we will crash. I should make a reproducer to
prove it, but it seems possible.


I’m definitely curious.

+
+        end = INT64_MAX & -(uint64_t)bs->bl.request_alignment;
+        req->bytes = end - req->offset;


And I doubt that we should update req->bytes. We never updated it in
other places, it corresponds to original request. It's enough to update
overlap_bytes to achieve corresponding serialising.


Does it hurt?  If so, would you send a patch?

I assume you reply to this patch instead of writing a patch because you
have the same feeling of “It probably doesn’t really matter, so let’s
have a discussion first”.


1. yes, and
2. I probably don't see the full picture around tracked requests


Neither do I, that’s for sure.

My stance is: I don’t think it matters and this whole piece of code is a
hack that shouldn’t exist, obviously.  So I don’t really care how it
fits into all of our other code.

I would like to say I wouldn’t mind a patch to drop the req->bytes
assignment, but OTOH it would mean I’d have to review it and verify that
it’s indeed sufficient to set overlap_bytes.

If it’s in any way inconvenient for you that req->bytes is adjusted,
then of course please send one.

+        req->overlap_bytes = req->bytes;
+
+        bdrv_mark_request_serialising(req, bs->bl.request_alignment);


Not sure, how much should we care about request_alignment here, I think,
it's enough to just set req->overlap_bytes = INT64_MAX -
req->overlap_offest, but it doesn't really matter.


As long as req->bytes is adjusted, we have to care, or the overlap_bytes
calculation in bdrv_mark_request_serialising will overflow.

Well, one could argue that it doesn’t matter because the MAX() will
still do the right thing, but overflowing is never nice.


Hmm I think, if reduce it to just INT64_MAX, we should pass 1 as align
to bdrv_mark_request_serialising.


True.

(Of course, it probably doesn’t matter at all if we just wouldn’t touch
req->bytes.)


OK, thanks for the answer, I'll prepare a patch.


OK?  I’m not sure where the benefit is (apart from the perhaps failing
assertions).  So it still looks to me like putting too much energy into
a hack.


Yes, now when I see that assertions should not fail I don't want to care
more neither send a patch :)

(I think the original reason I set both req->bytes and
req->overlap_bytes was actually because I just wanted to be sure, and
didn’t want to have to look too hard whether either would be sufficient.)

((Please also note that I can’t guarantee I will review your patch in a
timely manner, for one thing because I can already rarely give that
promise (as you are probably painfully aware...); and now there’s also
two weeks of mail on top for me to wade through after PTO.  So if
there’s no reason to change anything apart from saving two LoC, well.
Failing assertions are a different matter altogether, though, of course.))

Max



--
Best regards,
Vladimir

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize, Vladimir Sementsov-Ogievskiy, 2020/06/02
- Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize, Max Reitz, 2020/06/02
  - Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize, Vladimir Sementsov-Ogievskiy, 2020/06/02
    - Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize, Max Reitz, 2020/06/02
    - Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize, Vladimir Sementsov-Ogievskiy, 2020/06/02
    - Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize, Vladimir Sementsov-Ogievskiy <=

Prev by Date: Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
Next by Date: Re: [PATCH v3 4/6] iotests: Dump bitmap directory info with qcow2.py
Previous by thread: Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
Next by thread: Re: [PATCH v6 00/20] nvme: small fixes, refactoring and cleanups
Index(es):
- Date
- Thread