qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PULL 18/20] block/nbd: drop connection_co


From: Hanna Reitz
Subject: Re: [PULL 18/20] block/nbd: drop connection_co
Date: Wed, 2 Feb 2022 15:21:40 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0

On 02.02.22 14:53, Eric Blake wrote:
On Wed, Feb 02, 2022 at 12:49:36PM +0100, Fabian Ebner wrote:
Am 27.09.21 um 23:55 schrieb Eric Blake:
From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

OK, that's a big rewrite of the logic.

Pre-patch we have an always running coroutine - connection_co. It does
reply receiving and reconnecting. And it leads to a lot of difficult
and unobvious code around drained sections and context switch. We also
abuse bs->in_flight counter which is increased for connection_co and
temporary decreased in points where we want to allow drained section to
begin. One of these place is in another file: in nbd_read_eof() in
nbd/client.c.

We also cancel reconnect and requests waiting for reconnect on drained
begin which is not correct. And this patch fixes that.

Let's finally drop this always running coroutine and go another way:
do both reconnect and receiving in request coroutines.

Hi,

while updating our stack to 6.2, one of our live-migration tests stopped
working (backtrace is below) and bisecting led me to this patch.

The VM has a single qcow2 disk (converting to raw doesn't make a
difference) and the issue only appears when using iothread (for both
virtio-scsi-pci and virtio-block-pci).

Reverting 1af7737871fb3b66036f5e520acb0a98fc2605f7 (which lives on top)
and 4ddb5d2fde6f22b2cf65f314107e890a7ca14fcf (the commit corresponding
to this patch) in v6.2.0 makes the migration work again.

Backtrace:

Thread 1 (Thread 0x7f9d93458fc0 (LWP 56711) "kvm"):
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f9d9d6bc537 in __GI_abort () at abort.c:79
#2  0x00007f9d9d6bc40f in __assert_fail_base (fmt=0x7f9d9d825128
"%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5579153763f8
"qemu_get_current_aio_context() == qemu_coroutine_get_aio_context(co)",
file=0x5579153764f9 "../io/channel.c", line=483, function=<optimized
out>) at assert.c:92
Given that this assertion is about which aio context is set, I wonder
if the conversation at
https://lists.gnu.org/archive/html/qemu-devel/2022-02/msg00096.html is
relevant; if so, Vladimir may already be working on the patch.

It should be exactly that patch:

https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg06222.html

(From the discussion it appears that for v1 I need to ensure the reconnection timer is deleted immediately once reconnecting succeeds, and then that should be good to move out of the RFC state.)

Basically, I expect qemu to crash every time that you try to use an NBD block device in an I/O thread (unless you don’t do any I/O), for example this is the simplest reproducer I know of:

$ qemu-nbd --fork -k /tmp/nbd.sock -f raw null-co://

$ qemu-system-x86_64 \
    -object iothread,id=iothr0 \
    -device virtio-scsi,id=vscsi,iothread=iothr0 \
    -blockdev '{
        "driver": "nbd",
        "node-name": "nbd",
        "server": {
            "type": "unix",
            "path": "/tmp/nbd.sock"
        } }' \
    -device scsi-hd,bus=vscsi.0,drive=nbd
qemu-system-x86_64: ../qemu-6.2.0/io/channel.c:483: qio_channel_restart_read: Assertion `qemu_get_current_aio_context() == qemu_coroutine_get_aio_context(co)' failed. qemu-nbd: Disconnect client, due to: Unable to read from socket: Connection reset by peer [1]    108747 abort (core dumped)  qemu-system-x86_64 -object iothread,id=iothr0 -device  -blockdev  -device




reply via email to

[Prev in Thread] Current Thread [Next in Thread]