[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/3] block/nbd: allow drain during reconnect attempt

From: Eric Blake
Subject: Re: [PATCH 1/3] block/nbd: allow drain during reconnect attempt
Date: Thu, 23 Jul 2020 13:47:58 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0

On 7/20/20 4:00 AM, Vladimir Sementsov-Ogievskiy wrote:
It should be to reenter qio_channel_yield() on io/channel read/write
path, so it's safe to reduce in_flight and allow attaching new aio
context. And no problem to allow drain itself: connection attempt is
not a guest request. Moreover, if remote server is down, we can hang
in negotiation, blocking drain section and provoking a dead lock.

How to reproduce the dead lock:

I tried to reproduce this; but in the several minutes it has taken me to write this email, it still has not hung. Still, your stack trace is fairly good evidence of the problem, where adding a temporary sleep or running it under gdb with a breakpoint can probably make reproduction easier.

1. Create nbd-fault-injector.conf with the following contents:

[inject-error "mega1"]

2. In one terminal run nbd-fault-injector in a loop, like this:

n=1; while true; do
     echo $n; ((n++));

Bashism, but not a problem for the commit message.

     ./nbd-fault-injector.py nbd-fault-injector.conf;

3. In another terminal run qemu-io in a loop, like this:

n=1; while true; do
     echo $n; ((n++));
     ./qemu-io -c 'read 0 512' nbd+tcp://;

I prefer the spelling nbd:// for TCP connections, but also inconsequential.

Note, that the hang may be
triggered by another bug, so the whole case is fixed only together with
commit "block/nbd: on shutdown terminate connection attempt".

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
  block/nbd.c | 11 +++++++++++
  1 file changed, 11 insertions(+)

diff --git a/block/nbd.c b/block/nbd.c
index 65a4f56924..49254f1c3c 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -280,7 +280,18 @@ static coroutine_fn void 
nbd_reconnect_attempt(BDRVNBDState *s)
          s->ioc = NULL;
+ bdrv_dec_in_flight(s->bs);
      s->connect_status = nbd_client_connect(s->bs, &local_err);
+    s->wait_drained_end = true;
+    while (s->drained) {
+        /*
+         * We may be entered once from nbd_client_attach_aio_context_bh
+         * and then from nbd_client_co_drain_end. So here is a loop.
+         */
+        qemu_coroutine_yield();
+    }
+    bdrv_inc_in_flight(s->bs);

This is very similar to the code in nbd_co_reconnect_loop. Does that function still need to wait on drained, since it calls nbd_reconnect_attempt which is now doing the same loop? But off-hand, I'm not seeing a problem with keeping both places.

Reviewed-by: Eric Blake <eblake@redhat.com>

As a bug fix, I'll be including this in my NBD pull request for the next -rc build.

Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

reply via email to

[Prev in Thread] Current Thread [Next in Thread]