qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Strange data corruption issue with gluster (libgfapi) and ZFS


From: Stefan Ring
Subject: Strange data corruption issue with gluster (libgfapi) and ZFS
Date: Thu, 20 Feb 2020 09:56:58 +0100

Hi,

I have a very curious problem on an oVirt-like virtualization host
whose storage lives on gluster (as qcow2).

The problem is that of the writes done by ZFS, whose sizes according
to blktrace are a mixture of 8, 16, 24, ... 256 (512 byte) blocks,
sometimes the first 4KB or more, but at least the first 4KB, end up
zeroed out when read back later from storage. In my current test
scenario, I write approx. 3GB to the guest machine, which takes
roughly a minute. Actually it’s around 35 GB which gets compressed
down to 3GB by lz4. Within that, I end up with close to 100 data
errors when I read it back from storage afterwards (zpool scrub).

There are quite a few machines running on this host, and we have not
experienced other problems so far. So right now, only ZFS is able to
trigger this for some reason. The guest has 8 virtual cores. I also
tried writing directly to the affected device from user space in
patterns mimicking what I see in blktrace, but so far have been unable
to trigger the same issue that way. Of the many ZFS knobs, I know at
least one that makes a huge difference: When I set
zfs_vdev_async_write_max_active to 1 (as opposed to 2 or 10), the
error count goes through the roof (11.000). Curiously, when I switch
off ZFS compression, the data amount written increases almost 10-fold,
while the absolute error amount drops to close to, but not entirely,
zero. Which I guess supports my suspicion that this must be somehow
related to timing.

Switching the guest storage driver between scsi and virtio does not
make a difference.

Switching the storage backend to file on glusterfs-fuse does make a
difference, i.e. the problem disappears.

Any hints? I'm still trying to investigate a few things, but what bugs
me most that only ZFS seems to trigger this behavior, although I am
almost sure that ZFS is not really at fault here.

Software versions used:

Host
kernel 3.10.0-957.12.1.el7.x86_64
qemu-kvm-ev-2.12.0-18.el7_6.3.1.x86_64
glusterfs-api-5.6-1.el7.x86_64

Guest
kernel 3.10.0-1062.12.1.el7.x86_64
kmod-zfs-0.8.3-1.el7.x86_64 (from the official ZoL binaries)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]