qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/3] block/nbd: allow drain during reconnect attempt


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [PATCH 1/3] block/nbd: allow drain during reconnect attempt
Date: Fri, 24 Jul 2020 13:21:52 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0

20.07.2020 12:00, Vladimir Sementsov-Ogievskiy wrote:
It should be to reenter qio_channel_yield() on io/channel read/write
path, so it's safe to reduce in_flight and allow attaching new aio
context. And no problem to allow drain itself: connection attempt is
not a guest request. Moreover, if remote server is down, we can hang
in negotiation, blocking drain section and provoking a dead lock.

How to reproduce the dead lock:

1. Create nbd-fault-injector.conf with the following contents:

[inject-error "mega1"]
event=data
io=readwrite
when=before

2. In one terminal run nbd-fault-injector in a loop, like this:

n=1; while true; do
     echo $n; ((n++));
     ./nbd-fault-injector.py 127.0.0.1:10000 nbd-fault-injector.conf;
done

3. In another terminal run qemu-io in a loop, like this:

n=1; while true; do
     echo $n; ((n++));
     ./qemu-io -c 'read 0 512' nbd+tcp://127.0.0.1:10000;
done

After some time, qemu-io will hang trying to drain, for example, like
this:

  #3 aio_poll (ctx=0x55f006bdd890, blocking=true) at
     util/aio-posix.c:600
  #4 bdrv_do_drained_begin (bs=0x55f006bea710, recursive=false,
     parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:427
  #5 bdrv_drained_begin (bs=0x55f006bea710) at block/io.c:433
  #6 blk_drain (blk=0x55f006befc80) at block/block-backend.c:1710
  #7 blk_unref (blk=0x55f006befc80) at block/block-backend.c:498
  #8 bdrv_open_inherit (filename=0x7fffba1563bc
     "nbd+tcp://127.0.0.1:10000", reference=0x0, options=0x55f006be86d0,
     flags=24578, parent=0x0, child_class=0x0, child_role=0,
     errp=0x7fffba154620) at block.c:3491
  #9 bdrv_open (filename=0x7fffba1563bc "nbd+tcp://127.0.0.1:10000",
     reference=0x0, options=0x0, flags=16386, errp=0x7fffba154620) at
     block.c:3513
  #10 blk_new_open (filename=0x7fffba1563bc "nbd+tcp://127.0.0.1:10000",
     reference=0x0, options=0x0, flags=16386, errp=0x7fffba154620) at
     block/block-backend.c:421

And connection_co stack like this:

  #0 qemu_coroutine_switch (from_=0x55f006bf2650, to_=0x7fe96e07d918,
     action=COROUTINE_YIELD) at util/coroutine-ucontext.c:302
  #1 qemu_coroutine_yield () at util/qemu-coroutine.c:193
  #2 qio_channel_yield (ioc=0x55f006bb3c20, condition=G_IO_IN) at
     io/channel.c:472
  #3 qio_channel_readv_all_eof (ioc=0x55f006bb3c20, iov=0x7fe96d729bf0,
     niov=1, errp=0x7fe96d729eb0) at io/channel.c:110
  #4 qio_channel_readv_all (ioc=0x55f006bb3c20, iov=0x7fe96d729bf0,
     niov=1, errp=0x7fe96d729eb0) at io/channel.c:143
  #5 qio_channel_read_all (ioc=0x55f006bb3c20, buf=0x7fe96d729d28
     "\300.\366\004\360U", buflen=8, errp=0x7fe96d729eb0) at
     io/channel.c:247
  #6 nbd_read (ioc=0x55f006bb3c20, buffer=0x7fe96d729d28, size=8,
     desc=0x55f004f69644 "initial magic", errp=0x7fe96d729eb0) at
     /work/src/qemu/master/include/block/nbd.h:365
  #7 nbd_read64 (ioc=0x55f006bb3c20, val=0x7fe96d729d28,
     desc=0x55f004f69644 "initial magic", errp=0x7fe96d729eb0) at
     /work/src/qemu/master/include/block/nbd.h:391
  #8 nbd_start_negotiate (aio_context=0x55f006bdd890,
     ioc=0x55f006bb3c20, tlscreds=0x0, hostname=0x0,
     outioc=0x55f006bf19f8, structured_reply=true,
     zeroes=0x7fe96d729dca, errp=0x7fe96d729eb0) at nbd/client.c:904
  #9 nbd_receive_negotiate (aio_context=0x55f006bdd890,
     ioc=0x55f006bb3c20, tlscreds=0x0, hostname=0x0,
     outioc=0x55f006bf19f8, info=0x55f006bf1a00, errp=0x7fe96d729eb0) at
     nbd/client.c:1032
  #10 nbd_client_connect (bs=0x55f006bea710, errp=0x7fe96d729eb0) at
     block/nbd.c:1460
  #11 nbd_reconnect_attempt (s=0x55f006bf19f0) at block/nbd.c:287
  #12 nbd_co_reconnect_loop (s=0x55f006bf19f0) at block/nbd.c:309
  #13 nbd_connection_entry (opaque=0x55f006bf19f0) at block/nbd.c:360
  #14 coroutine_trampoline (i0=113190480, i1=22000) at
     util/coroutine-ucontext.c:173

Note, that the hang may be
triggered by another bug, so the whole case is fixed only together with
commit "block/nbd: on shutdown terminate connection attempt".

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
  block/nbd.c | 11 +++++++++++
  1 file changed, 11 insertions(+)

diff --git a/block/nbd.c b/block/nbd.c
index 65a4f56924..49254f1c3c 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -280,7 +280,18 @@ static coroutine_fn void 
nbd_reconnect_attempt(BDRVNBDState *s)
          s->ioc = NULL;
      }
+ bdrv_dec_in_flight(s->bs);
      s->connect_status = nbd_client_connect(s->bs, &local_err);
+    s->wait_drained_end = true;
+    while (s->drained) {
+        /*
+         * We may be entered once from nbd_client_attach_aio_context_bh
+         * and then from nbd_client_co_drain_end. So here is a loop.
+         */
+        qemu_coroutine_yield();
+    }
+    bdrv_inc_in_flight(s->bs);

My next version of non-blocking connect will need nbd_establish_connection() to 
be above bdrv_dec_in_flight(). So, I want to resend this anyway.

+
      error_free(s->connect_err);
      s->connect_err = NULL;
      error_propagate(&s->connect_err, local_err);



--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]