Re: [PULL 18/20] block/nbd: drop connection

qemu-block

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PULL 18/20] block/nbd: drop connection_co

From:	Fabian Ebner
Subject:	Re: [PULL 18/20] block/nbd: drop connection_co
Date:	Thu, 3 Feb 2022 09:49:16 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0

Am 02.02.22 um 15:21 schrieb Hanna Reitz:
> On 02.02.22 14:53, Eric Blake wrote:
>> On Wed, Feb 02, 2022 at 12:49:36PM +0100, Fabian Ebner wrote:
>>> Am 27.09.21 um 23:55 schrieb Eric Blake:
>>>> From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>
>>>> OK, that's a big rewrite of the logic.
>>>>
>>>> Pre-patch we have an always running coroutine - connection_co. It does
>>>> reply receiving and reconnecting. And it leads to a lot of difficult
>>>> and unobvious code around drained sections and context switch. We also
>>>> abuse bs->in_flight counter which is increased for connection_co and
>>>> temporary decreased in points where we want to allow drained section to
>>>> begin. One of these place is in another file: in nbd_read_eof() in
>>>> nbd/client.c.
>>>>
>>>> We also cancel reconnect and requests waiting for reconnect on drained
>>>> begin which is not correct. And this patch fixes that.
>>>>
>>>> Let's finally drop this always running coroutine and go another way:
>>>> do both reconnect and receiving in request coroutines.
>>>>
>>> Hi,
>>>
>>> while updating our stack to 6.2, one of our live-migration tests stopped
>>> working (backtrace is below) and bisecting led me to this patch.
>>>
>>> The VM has a single qcow2 disk (converting to raw doesn't make a
>>> difference) and the issue only appears when using iothread (for both
>>> virtio-scsi-pci and virtio-block-pci).
>>>
>>> Reverting 1af7737871fb3b66036f5e520acb0a98fc2605f7 (which lives on top)
>>> and 4ddb5d2fde6f22b2cf65f314107e890a7ca14fcf (the commit corresponding
>>> to this patch) in v6.2.0 makes the migration work again.
>>>
>>> Backtrace:
>>>
>>> Thread 1 (Thread 0x7f9d93458fc0 (LWP 56711) "kvm"):
>>> #0  __GI_raise (sig=sig@entry=6) at
>>> ../sysdeps/unix/sysv/linux/raise.c:50
>>> #1  0x00007f9d9d6bc537 in __GI_abort () at abort.c:79
>>> #2  0x00007f9d9d6bc40f in __assert_fail_base (fmt=0x7f9d9d825128
>>> "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5579153763f8
>>> "qemu_get_current_aio_context() == qemu_coroutine_get_aio_context(co)",
>>> file=0x5579153764f9 "../io/channel.c", line=483, function=<optimized
>>> out>) at assert.c:92
>> Given that this assertion is about which aio context is set, I wonder
>> if the conversation at
>> https://lists.gnu.org/archive/html/qemu-devel/2022-02/msg00096.html is
>> relevant; if so, Vladimir may already be working on the patch.
> 
> It should be exactly that patch:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg06222.html
> 
> (From the discussion it appears that for v1 I need to ensure the
> reconnection timer is deleted immediately once reconnecting succeeds,
> and then that should be good to move out of the RFC state.)

Thanks for the quick responses and happy to hear you're already working
on it! With the RFC, the issue is gone for me.

> 
> Basically, I expect qemu to crash every time that you try to use an NBD
> block device in an I/O thread (unless you don’t do any I/O), for example
> this is the simplest reproducer I know of:
> 
> $ qemu-nbd --fork -k /tmp/nbd.sock -f raw null-co://
> 
> $ qemu-system-x86_64 \
>     -object iothread,id=iothr0 \
>     -device virtio-scsi,id=vscsi,iothread=iothr0 \
>     -blockdev '{
>         "driver": "nbd",
>         "node-name": "nbd",
>         "server": {
>             "type": "unix",
>             "path": "/tmp/nbd.sock"
>         } }' \
>     -device scsi-hd,bus=vscsi.0,drive=nbd
> qemu-system-x86_64: ../qemu-6.2.0/io/channel.c:483:
> qio_channel_restart_read: Assertion `qemu_get_current_aio_context() ==
> qemu_coroutine_get_aio_context(co)' failed.
> qemu-nbd: Disconnect client, due to: Unable to read from socket:
> Connection reset by peer
> [1]    108747 abort (core dumped)  qemu-system-x86_64 -object
> iothread,id=iothr0 -device  -blockdev  -device
> 
> 

Interestingly, the reproducer didn't crash the very first time I tried
it. I did get the same error after ^C-ing though, and on subsequent
tries it mostly crashed immediately, but very occasionally it didn't.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PULL 18/20] block/nbd: drop connection_co, Fabian Ebner, 2022/02/02
- Re: [PULL 18/20] block/nbd: drop connection_co, Eric Blake, 2022/02/02
  - Re: [PULL 18/20] block/nbd: drop connection_co, Hanna Reitz, 2022/02/02
    - Re: [PULL 18/20] block/nbd: drop connection_co, Fabian Ebner <=

Prev by Date: [PATCH 3/4] iotests/mirror-top-perms: switch to AQMP
Next by Date: Re: [PATCH v4 4/4] python/aqmp: add socket bind step to legacy.py
Previous by thread: Re: [PULL 18/20] block/nbd: drop connection_co
Next by thread: Re: [PATCH v6 21/33] block: move BQL logic of bdrv_co_invalidate_cache in bdrv_activate
Index(es):
- Date
- Thread