qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2 1/1] qemu-char: socket backend: disconnect on


From: Marc-André Lureau
Subject: Re: [Qemu-devel] [PATCH v2 1/1] qemu-char: socket backend: disconnect on write error
Date: Fri, 24 Feb 2017 14:46:01 +0000

Hi

On Fri, Feb 24, 2017 at 5:42 PM Markus Armbruster <address@hidden> wrote:

> Marc-André Lureau <address@hidden> writes:
>
> > Hi
> >
> > On Thu, Feb 2, 2017 at 5:20 PM Denis V. Lunev <address@hidden> wrote:
> >
> >> From: Anton Nefedov <address@hidden>
> >>
> >> Socket backend read handler should normally perform a disconnect,
> however
> >> the read handler may not get a chance to run if the frontend is not
> ready
> >> (qemu_chr_be_can_write() == 0).
> >>
> >> This means that in virtio-serial frontend case if
> >>  - the host has disconnected (giving EPIPE on socket write)
> >>  - and the guest has disconnected (-> frontend not ready -> backend
> >>    will not read)
> >>  - and there is still data (frontend->backend) to flush (has to be a
> really
> >>    tricky timing but nevertheless, we have observed the case in
> production)
> >>
> >> This results in virtio-serial trying to flush this data continiously
> >> forming
> >> a busy loop.
> >>
> >> Solution: react on write error in the socket write handler.
> >> errno is not reliable after qio_channel_writev_full(), so we may not get
> >> the exact EPIPE, so disconnect on any error but QIO_CHANNEL_ERR_BLOCK
> which
> >> io_channel_send_full() converts to errno EAGAIN.
> >> We must not disconnect right away though, there still may be data to
> read
> >> (see 4bf1cb0).
> >>
> >> Signed-off-by: Anton Nefedov <address@hidden>
> >> Signed-off-by: Denis V. Lunev <address@hidden>
> >> CC: Paolo Bonzini <address@hidden>
> >> CC: Daniel P. Berrange <address@hidden>
> >> CC: Marc-André Lureau <address@hidden>
> >> ---
> >> Changes from v1:
> >> - we do not rely on EPIPE anynore. Socket should be closed on all errors
> >>   except EAGAIN
> >>
> >>  chardev/char-socket.c | 10 ++++++++++
> >>  1 file changed, 10 insertions(+)
> >>
> >> diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> >> index 4068dc5..e2137aa 100644
> >> --- a/chardev/char-socket.c
> >> +++ b/chardev/char-socket.c
> >> @@ -97,6 +97,9 @@ static gboolean tcp_chr_accept(QIOChannel *chan,
> >>                                 GIOCondition cond,
> >>                                 void *opaque);
> >>
> >> +static int tcp_chr_read_poll(void *opaque);
> >> +static void tcp_chr_disconnect(CharDriverState *chr);
> >> +
> >>  /* Called with chr_write_lock held.  */
> >>  static int tcp_chr_write(Chardev *chr, const uint8_t *buf, int len)
> >>  {
> >> @@ -114,6 +117,13 @@ static int tcp_chr_write(Chardev *chr, const
> uint8_t
> >> *buf, int len)
> >>              s->write_msgfds_num = 0;
> >>          }
> >>
> >> +        if (ret < 0 && errno != EAGAIN) {
> >> +            if (tcp_chr_read_poll(chr) <= 0) {
> >> +                tcp_chr_disconnect(chr);
> >> +                return len;
> >>
> >
> > This change breaks a number of assumption in vhost-user code. Until now,
> a
> > vhost-user function assumed that dev->vhost_ops would remain as long as
> the
> > function is running, so it may call vhost_ops callbacks several time,
> which
> > may eventually fail to do io, but no crash would occur. The disconnection
> > would be handled later with the HUP handler. Now, vhost_ops may be
> cleared
> > during a write (chr_disconnect -> CHR_EVENT_CLOSED in
> > net_vhost_user_event). This can be randomly reproduced with
> vhost-user-test
> > -p /x86_64/vhost-user/flags-mismatch/subprocess  (broken pipe, qemu
> crashes)
>
> It hangs for me with annoying frequency.  I'm glad you found the culprit!
>
>
Yeah, when reverted, it seems it never hangs here. But the test exercices
the code in interesting ways, with some unpredictable situations (device
initialization while the backend disconnects/reconnects), so I wouldn't be
surprised if we find more hidden issues. (vhost-user disconnect handling
still isn't close to being safe)

> I am trying to fix this, but it isn't simple (races, many code paths etc),
> > so I just wanted to inform you about it.
>
> Can we revert this patch until we have a fix?
>

That or disable the test for now.

But  I wonder if this patch is a good solution to the problem. And I don't
like the socket backend behaving differently than other backends.
-- 
Marc-André Lureau


reply via email to

[Prev in Thread] Current Thread [Next in Thread]