qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 5/5] vhost: add device started check in migration set log


From: Dima Stepanov
Subject: Re: [PATCH v2 5/5] vhost: add device started check in migration set log
Date: Mon, 11 May 2020 12:43:39 +0300
User-agent: Mutt/1.5.24 (2015-08-30)

On Sun, May 10, 2020 at 08:03:39PM -0400, Raphael Norwitz wrote:
> On Thu, May 7, 2020 at 11:35 AM Dima Stepanov <address@hidden> wrote:
> >
> > What do you think?
> >
> 
> Apologies - I tripped over the if (dev->started && r < 0) check.
> Never-mind my point with race conditions and failing migrations.
> 
> Rather than modifying vhost_dev_set_log(), it may be clearer to put a
> check after vhost_dev_log_resize()? Something like:
> 
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -829,11 +829,22 @@ static int vhost_migration_log(MemoryListener
> *listener, int enable)
>          vhost_log_put(dev, false);
>      } else {
>          vhost_dev_log_resize(dev, vhost_get_log_size(dev));
> +        /*
> +         * A device can be stopped because of backend disconnect inside
> +         * vhost_dev_log_resize(). In this case we should mark logging
> +         * enabled and return without attempting to set the backend
> +         * logging state.
> +         */
> +        if (!dev->started) {
> +            goto out_success;
> +        }
>          r = vhost_dev_set_log(dev, true);
>          if (r < 0) {
>              return r;
>          }
>      }
> +
> +out_success:
>      dev->log_enabled = enable;
>      return 0;
>  }
This patch will not fix all the issues. Consider the case than you will
hit disconnect inside vhost_dev_set_log. For instance for the 3rd
virtqueue, for the following call:
  vhost_virtqueue_set_addr(...)
Maybe i didn't explain very clearly the problem. The problem i've tried
to fix is only for the vhost-user-blk devices. This issue can be hit
during VHOST_USER commands "handshake". If we hit disconnect on any step
of this "handshake" then we will try to make clean up twice:
1. First during disconnect cleanup (unix socket backend).
2. Second as roll back for initialization.
If this is the case, then we shouldn't call p2, as everything was clean
up on p1. And the complicated thing is that there are several VHOST_USER
commands and we should consider the state after each. And even more,
initialization could fail because of some other reason and we hit
disconnect inside roll back clean up, in this case we should complete
clean up in the disconnect function and stop rolling back.

Hope it helps ).

> 
> This seems harmless enough to me, and I see how it fixes your
> particular crash, but I would still prefer we worked towards a more
> robust solution. In particular I think we could handle this inside
> vhost-user-blk if we let the device state persist between connections
> (i.e. call vhost_dev_cleanup() inside vhost_user_blk_connect() before
> vhost_dev_init() on reconnect). This should also fix some of the
> crashes Li Feng has hit, and probably others which haven’t been
> reported yet. What do you think?
Yes, this looks like a good direction. Because all my patches are only
workarounds and i believe there can be some other issues which haven't
been reported or will be introduced ).
I still think that these patches are good to submit and to think about
more complicated/refactoring solution as the next step.

> 
> If that’s unworkable I guess we will need to add these vhost level
> checks.
At least for now, i don't think its unworkable, i just think that it
will take some time to figure out how to refactor it properly. But the
SIGSEGV issue is real.

> In that case I would still prefer we add a “disconnected” flag
> in struct vhost_dev struct, and make sure it isn’t cleared by
> vhost_dev_cleanup(). That way we don’t conflate stopping a device with
> backend disconnect at the vhost level and potentially regress behavior
> for other device types.
It is also possible, but should be analyzed and properly tested. So as i
said it will take some time to figure out how to refactor it properly.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]