qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC 0/3] block/file-posix: Work around XFS bug


From: Max Reitz
Subject: Re: [RFC 0/3] block/file-posix: Work around XFS bug
Date: Mon, 28 Oct 2019 09:56:28 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.1.1

On 26.10.19 19:52, Vladimir Sementsov-Ogievskiy wrote:
> 26.10.2019 20:37, Nir Soffer wrote:
>> On Fri, Oct 25, 2019 at 1:11 PM Max Reitz <address@hidden> wrote:
>>>
>>> Hi,
>>>
>>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
>>> I’ve explained here:
>>>
>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>>>
>>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
>>> data corruption when using qcow2 images on XFS with aio=native.
>>>
>>> We can’t wait until the XFS kernel driver is fixed, we should work
>>> around the problem ourselves.
>>>
>>> This is an RFC for two reasons:
>>> (1) I don’t know whether this is the right way to address the issue,
>>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
>>>      if so stop applying the workaround.
>>>      I don’t know how we would go about this, so this series doesn’t do
>>>      it.  (Hence it’s an RFC.)
>>> (3) Perhaps it’s a bit of a layering violation to let the file-posix
>>>      driver access and modify a BdrvTrackedRequest object.
>>>
>>> As for how we can address the issue, I see three ways:
>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>      operations) to reach until infinity (INT64_MAX in practice), mark
>>>      them serializing and wait for other conflicting requests.
>>>
>>>      Advantages:
>>>      + Limits the impact to very specific cases
>>>        (And that means it wouldn’t hurt too much to keep this workaround
>>>        even when the XFS driver has been fixed)
>>>      + Works around the bug where it happens, namely in file-posix
>>>
>>>      Disadvantages:
>>>      - A bit complex
>>>      - A bit of a layering violation (should file-posix have access to
>>>        tracked requests?)
>>>
>>> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
>>>      becomes visible due to that function: I don’t think qcow2 writes
>>>      zeroes in any other I/O path, and raw images are fixed in size so
>>>      post-EOF writes won’t happen.
>>>
>>>      Advantages:
>>>      + Maybe simpler, depending on how difficult it is to handle the
>>>        layering violation
>>>      + Also fixes the performance problem of handle_alloc_space() being
>>>        slow on ppc64+XFS.
>>>
>>>      Disadvantages:
>>>      - Huge layering violation because qcow2 would need to know whether
>>>        the image is stored on XFS or not.
>>>      - We’d definitely want to skip this workaround when the XFS driver
>>>        has been fixed, so we need some method to find out whether it has
>>>
>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>      To my knowledge I’m the only one who has provided any benchmarks for
>>>      this commit, and even then I was a bit skeptical because it performs
>>>      well in some cases and bad in others.  I concluded that it’s
>>>      probably worth it because the “some cases” are more likely to occur.
>>>
>>>      Now we have this problem of corruption here (granted due to a bug in
>>>      the XFS driver), and another report of massively degraded
>>>      performance on ppc64
>>>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>      private BZ; I hate that :-/  The report is about 40 % worse
>>>      performance for an in-guest fio write benchmark.)
>>>
>>>      So I have to ask the question about what the justification for
>>>      keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>      it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>
>>>      Advantages:
>>>      + Trivial
>>>      + No layering violations
>>>      + We wouldn’t need to keep track of whether the kernel bug has been
>>>        fixed or not
>>>      + Fixes the ppc64+XFS performance problem
>>>
>>>      Disadvantages:
>>>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>        levels, whatever that means
>>
>> Correctness is more important than performance, so this is my
>> preference as a user.
>>
> 
> Hmm, still, incorrect is XFS, not Qemu. This bug may be triggered by another
> software, or may be another scenario in Qemu (not sure).

I don’t see any other image format (but qcow2, raw, and filters)
creating zero-writes apart from the parallels format which does so
before writes can happen to that area.  So with parallels, it’s
impossible to get a data write behind an ongoing zero-write.

Raw and filters do not initiate zero-writes on their own.  So there must
be something else that requests them.  Block jobs and guests are limited
to generally fixed-size disks, so I don’t see a way to generate
zero-writes beyond the EOF there.

Which leaves us with qcow2.  To my knowledge the only place where qcow2
generates zero-writes in an I/O path (so they can occur concurrently to
data writes) is in handle_alloc_space().  Therefore, dropping it should
remove the only place where the XFS bug can be triggered.


That the bug is in XFS and not in qemu is a good argument from a moral
standpoint, but it isn’t very pragmatic or reasonable from a technical
standpoint.  There is no fix for XFS yet, so right now the situation is
that there is a bug that causes guest data loss, that qemu can prevent
it, and that there is no other workaround or fix.

So we must work around it.

Of course, that doesn’t mean that we have to do everything in our power
to implement a work around no matter the cost.  We do indeed have to
balance between how far we’re willing to go to fix it and how much we
impact non-XFS workloads.


Honestly, I caught myself saying “Well, that’s too bad for XFS over
gluster if XFS is broken”.  But that just isn’t a reasonable thing to say.

I suppose what we could do is drop the is_xfs check in patch 3.  The
only outcome is that you can no longer do concurrent data writes past
zero-writes past the EOF.  And as I’ve claimed above, this only affects
handle_alloc_space() from qcow2.  Would it be so bad if we simply did
that on every filesystem?

Max

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]