[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_D
From: |
Jan Kara |
Subject: |
Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT |
Date: |
Thu, 12 Nov 2020 12:19:51 +0100 |
User-agent: |
Mutt/1.10.1 (2018-07-13) |
[added some relevant people and lists to CC]
On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > clone of "starship_production"
>
> The git-publish destroyed the cover letter:
>
> For the reference this is for bz #1872633
>
> The issue is that current kernel code that implements 'fallocate'
> on kernel block devices roughly works like that:
>
> 1. Flush the page cache on the range that is about to be discarded.
> 2. Issue the discard and wait for it to finish.
> (as far as I can see the discard doesn't go through the
> page cache).
>
> 3. Check if the page cache is dirty for this range,
> if it is dirty (meaning that someone wrote to it meanwhile)
> return -EBUSY.
>
> This means that if qemu (or qemu-img) issues a write, and then
> discard to the area that shares a page, -EBUSY can be returned by
> the kernel.
Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
EBUSY which seems wrong to me. IMO we should handle this gracefully in the
kernel so we need to fix this.
> On the other hand, for example, the ext4 implementation of discard
> doesn't seem to be affected. It does take a lock on the inode to avoid
> concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
Well, filesystem hole punching is somewhat different beast than block device
discard (at least implementation wise).
> Doing fsync and retrying is seems to resolve this issue, but it might be
> a too big hammer. Just retrying doesn't work, indicating that maybe the
> code that flushes the page cache in (1) doesn't do this correctly ?
>
> It also can be racy unless special means are done to block IO from happening
> from qemu during this fsync.
>
> This patch series contains two patches:
>
> First patch just lets the file-posix ignore the -EBUSY errors, which is
> technically enough to fail back to plain write in this case, but seems wrong.
>
> And the second patch adds an optimization to qemu-img to avoid such a
> fragmented write/discard in the first place.
>
> Both patches make the reproducer work for this particular bugzilla,
> but I don't think they are enough.
>
> What do you think?
So if the EBUSY error happens because something happened to the page cache
outside of discarded range (like you describe above), that is a kernel bug
than needs to get fixed. EBUSY should really mean - someone wrote to the
discarded range while discard was running and userspace app has to deal
with that depending on what it aims to do...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR