qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 2/5] test-bdrv-drain: avoid race with BH in IOThread drain


From: Stefan Hajnoczi
Subject: Re: [PATCH v3 2/5] test-bdrv-drain: avoid race with BH in IOThread drain test
Date: Thu, 14 Sep 2023 08:01:41 -0400

On Wed, Sep 13, 2023 at 11:08:54AM -0500, Eric Blake wrote:
> On Tue, Sep 12, 2023 at 07:10:34PM -0400, Stefan Hajnoczi wrote:
> > This patch fixes a race condition in test-bdrv-drain that is difficult
> > to reproduce. test-bdrv-drain sometimes fails without an error message
> > on the block pull request sent by Kevin Wolf on Sep 4, 2023. I was able
> > to reproduce it locally and found that "block-backend: process I/O in
> > the current AioContext" (in this patch series) is the first commit where
> > it reproduces.
> > 
> > I do not know why "block-backend: process I/O in the current AioContext"
> > exposes this bug. It might be related to the fact that the test's preadv
> > request runs in the main thread instead of IOThread a after my commit.
> 
> In reading the commit message before the impacted code, my first
> thought was that you had a typo of an extra word (that is, something
> to fix by s/a //), but reading further, a better fix would be calling
> attention to the fact that you are referencing a specific named
> thread, as in s/IOThread a/IOThread A/...
> 
> > That might simply change the timing of the test.
> > 
> > Now on to the race condition in test-bdrv-drain. The main thread
> > schedules a BH in IOThread a and then drains the BDS:
> 
> ...and another spot with the same parse issue...
> 
> > 
> >   aio_bh_schedule_oneshot(ctx_a, test_iothread_main_thread_bh, &data);
> > 
> >   /* The request is running on the IOThread a. Draining its block device
> 
> ...but here you were quoting from the existing code base, which is
> where I finally realized it was more than just your commit message.
> 
> >    * will make sure that it has completed as far as the BDS is concerned,
> >    * but the drain in this thread can continue immediately after
> >    * bdrv_dec_in_flight() and aio_ret might be assigned only slightly
> >    * later. */
> >   do_drain_begin(drain_type, bs);
> > 
> > If the BH completes before do_drain_begin() then there is nothing to
> > worry about.
> > 
> > If the BH invokes bdrv_flush() before do_drain_begin(), then
> > do_drain_begin() waits for it to complete.
> > 
> > The problematic case is when do_drain_begin() runs before the BH enters
> > bdrv_flush(). Then do_drain_begin() misses the BH and the drain
> > mechanism has failed in quiescing I/O.
> > 
> > Fix this by incrementing the in_flight counter so that do_drain_begin()
> > waits for test_iothread_main_thread_bh().
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  tests/unit/test-bdrv-drain.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
> > index ccc453c29e..67a79aa3f0 100644
> > --- a/tests/unit/test-bdrv-drain.c
> > +++ b/tests/unit/test-bdrv-drain.c
> > @@ -512,6 +512,7 @@ static void test_iothread_main_thread_bh(void *opaque)
> >       * executed during drain, otherwise this would deadlock. */
> >      aio_context_acquire(bdrv_get_aio_context(data->bs));
> >      bdrv_flush(data->bs);
> > +    bdrv_dec_in_flight(data->bs); /* incremented by test_iothread_common() 
> > */
> >      aio_context_release(bdrv_get_aio_context(data->bs));
> >  }
> >  
> > @@ -583,6 +584,13 @@ static void test_iothread_common(enum drain_type 
> > drain_type, int drain_thread)
> >              aio_context_acquire(ctx_a);
> >          }
> >  
> > +        /*
> > +         * Increment in_flight so that do_drain_begin() waits for
> > +         * test_iothread_main_thread_bh(). This prevents the race between
> > +         * test_iothread_main_thread_bh() in IOThread a and 
> > do_drain_begin() in
> > +         * this thread. test_iothread_main_thread_bh() decrements 
> > in_flight.
> > +         */
> > +        bdrv_inc_in_flight(bs);
> >          aio_bh_schedule_oneshot(ctx_a, test_iothread_main_thread_bh, 
> > &data);
> >  
> >          /* The request is running on the IOThread a. Draining its block 
> > device
> 
> and indeed, your commit message is consistent with the current code's
> naming convention.  If you have reason to respin, a pre-req patch to
> change the case before adding more references might be nice, but I
> won't insist.
> 
> Reviewed-by: Eric Blake <eblake@redhat.com>

Sorry about that. It is confusing.

Stefan

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]