qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bdrv_drained_begin deadlock with io-threads


From: Kevin Wolf
Subject: Re: bdrv_drained_begin deadlock with io-threads
Date: Thu, 2 Apr 2020 16:25:24 +0200
User-agent: Mutt/1.12.1 (2019-06-15)

Am 02.04.2020 um 14:14 hat Kevin Wolf geschrieben:
> Am 02.04.2020 um 11:10 hat Dietmar Maurer geschrieben:
> > > It seems to fix it, yes. Now I don't get any hangs any more. 
> > 
> > I just tested using your configuration, and a recent centos8 image
> > running dd loop inside it:
> > 
> > # while dd if=/dev/urandom of=testfile.raw bs=1M count=100; do sync; done
> > 
> > With that, I am unable to trigger the bug.
> > 
> > Would you mind running the test using a Debian Buster image running 
> > "stress-ng -d 5" inside?
> > I (and to other people here) can trigger the bug quite reliable with that.
> > 
> > On Debian, you can easily install stress-ng using apt:
> > 
> > # apt update
> > # apt install stress-ng
> > 
> > Seems stress-ng uses a different write pattern which can trigger the bug 
> > more reliable.
> 
> I was going to, just give me some time...

Can you reproduce the problem with my script, but pointing it to your
Debian image and running stress-ng instead of dd? If so, how long does
it take to reproduce for you?

I was just going to write that I can't reproduce in my first attempt
(which is still with the image on tmpfs as in my script, and therefore
without O_DIRECT or Linux AIO) when it finally did hang. However, this
is still while completing a job, not while starting it:

(gdb) bt
#0  0x00007f8b6b4e9526 in ppoll () at /lib64/libc.so.6
#1  0x00005619fc090919 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized 
out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  0x00005619fc090919 in qemu_poll_ns (fds=<optimized out>, nfds=<optimized 
out>, timeout=timeout@entry=-1) at util/qemu-timer.c:335
#3  0x00005619fc0930f1 in fdmon_poll_wait (ctx=0x5619fe79ae00, 
ready_list=0x7fff4006cf58, timeout=-1) at util/fdmon-poll.c:79
#4  0x00005619fc0926d7 in aio_poll (ctx=0x5619fe79ae00, 
blocking=blocking@entry=true) at util/aio-posix.c:589
#5  0x00005619fbfefd83 in bdrv_do_drained_begin (poll=<optimized out>, 
ignore_bds_parents=false, parent=0x0, recursive=false, bs=0x5619fe81e490) at 
block/io.c:429
#6  0x00005619fbfefd83 in bdrv_do_drained_begin (bs=0x5619fe81e490, 
recursive=<optimized out>, parent=0x0, ignore_bds_parents=<optimized out>, 
poll=<optimized out>) at block/io.c:395
#7  0x00005619fbfe0ce7 in blk_drain (blk=0x5619ffd35c00) at 
block/block-backend.c:1617
#8  0x00005619fbfe18cd in blk_unref (blk=0x5619ffd35c00) at 
block/block-backend.c:473
#9  0x00005619fbf9b185 in block_job_free (job=0x5619ffd0b800) at blockjob.c:89
#10 0x00005619fbf9c769 in job_unref (job=0x5619ffd0b800) at job.c:378
#11 0x00005619fbf9c769 in job_unref (job=0x5619ffd0b800) at job.c:370
#12 0x00005619fbf9d57d in job_exit (opaque=0x5619ffd0b800) at job.c:892
#13 0x00005619fc08eea5 in aio_bh_call (bh=0x7f8b5406f410) at util/async.c:164
#14 0x00005619fc08eea5 in aio_bh_poll (ctx=ctx@entry=0x5619fe79ae00) at 
util/async.c:164
#15 0x00005619fc09252e in aio_dispatch (ctx=0x5619fe79ae00) at 
util/aio-posix.c:380
#16 0x00005619fc08ed8e in aio_ctx_dispatch (source=<optimized out>, 
callback=<optimized out>, user_data=<optimized out>) at util/async.c:298
#17 0x00007f8b6df5606d in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#18 0x00005619fc091798 in glib_pollfds_poll () at util/main-loop.c:219
#19 0x00005619fc091798 in os_host_main_loop_wait (timeout=<optimized out>) at 
util/main-loop.c:242
#20 0x00005619fc091798 in main_loop_wait (nonblocking=nonblocking@entry=0) at 
util/main-loop.c:518
#21 0x00005619fbd07559 in qemu_main_loop () at 
/home/kwolf/source/qemu/softmmu/vl.c:1664
#22 0x00005619fbbf093e in main (argc=<optimized out>, argv=<optimized out>, 
envp=<optimized out>) at /home/kwolf/source/qemu/softmmu/main.c:49

It does looks more like your case because I now have bs.in_flight == 0
and the BlockBackend of the scsi-hd device has in_flight == 8. Of
course, this still doesn't answer why it happens, and I'm not sure if we
can tell without adding some debug code.

I'm testing on my current block branch with Stefan's fixes on top.

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]