qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bdrv_drained_begin deadlock with io-threads


From: Kevin Wolf
Subject: Re: bdrv_drained_begin deadlock with io-threads
Date: Wed, 1 Apr 2020 20:35:44 +0200
User-agent: Mutt/1.12.1 (2019-06-15)

Am 01.04.2020 um 20:12 hat Kevin Wolf geschrieben:
> Am 01.04.2020 um 17:37 hat Dietmar Maurer geschrieben:
> > > > I really nobody else able to reproduce this (somebody already tried to 
> > > > reproduce)?
> > > 
> > > I can get hangs, but that's for job_completed(), not for starting the
> > > job. Also, my hangs have a non-empty bs->tracked_requests, so it looks
> > > like a different case to me.
> > 
> > Please can you post the command line args of your VM? I use something like
> > 
> > ./x86_64-softmmu/qemu-system-x86_64 -chardev
> > 'socket,id=qmp,path=/var/run/qemu-server/101.qmp,server,nowait' -mon
> > 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/101.pid  -m
> > 1024 -object 'iothread,id=iothread-virtioscsi0' -device
> > 'virtio-scsi-pci,id=virtioscsi0,iothread=iothread-virtioscsi0' -drive
> > 'file=/backup/disk3/debian-buster.raw,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on'
> > -device
> > 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0'
> > -machine "type=pc,accel=kvm"
> > 
> > Do you also run "stress-ng -d 5" indied the VM?
> 
> I'm not using the exact same test case, but something that I thought
> would be similar enough. Specifically, I run the script below, which
> boots from a RHEL 8 CD and in the rescue shell, I'll do 'dd if=/dev/zero
> of=/dev/sda' while the script keeps starting and cancelling backup jobs
> in the background.
> 
> Anyway, I finally managed to bisect my problem now (did it wrong the
> first time) and got this result:
> [...]

So back on current git master, my deadlock is between main thread and
the iothread. For some reason, the main thread holds the thread pool
mutex of the iothread's thread pool. This means that the iothread can't
complete its requests and the drain operation in the main thread can't
make progress.

I think there is no reason why the main thread should ever take the
mutex of the thread pool of a different thread, so I'm not sure. But
maybe that backup commit changed something in the way nodes are moved
between AioContexts that would cause this to happen.

Kevin


Thread 3 (Thread 0x7f53c7438700 (LWP 3967)):
#0  0x00007f53d23449ad in __lll_lock_wait () at /lib64/libpthread.so.0
#1  0x00007f53d233dd94 in pthread_mutex_lock () at /lib64/libpthread.so.0
#2  0x000055dcc331bdb3 in qemu_mutex_lock_impl (mutex=0x55dcc4ff1c80, 
file=0x55dcc3512bff "util/async.c", line=557) at util/qemu-thread-posix.c:78
#3  0x000055dcc33167ae in thread_pool_completion_bh (opaque=0x7f53b8003120) at 
util/thread-pool.c:167
#4  0x000055dcc331597e in aio_bh_call (bh=0x55dcc5b94680) at util/async.c:117
#5  0x000055dcc331597e in aio_bh_poll (ctx=ctx@entry=0x55dcc4ff1c20) at 
util/async.c:117
#6  0x000055dcc3318ee7 in aio_poll (ctx=0x55dcc4ff1c20, 
blocking=blocking@entry=true) at util/aio-posix.c:638
#7  0x000055dcc2ff7df0 in iothread_run (opaque=opaque@entry=0x55dcc4cfdac0) at 
iothread.c:75
#8  0x000055dcc331bbba in qemu_thread_start (args=<optimized out>) at 
util/qemu-thread-posix.c:519
#9  0x00007f53d233b4aa in start_thread () at /lib64/libpthread.so.0
#10 0x00007f53d226b3f3 in clone () at /lib64/libc.so.6

Thread 1 (Thread 0x7f53c7dab680 (LWP 3962)):
#0  0x00007f53d2260526 in ppoll () at /lib64/libc.so.6
#1  0x000055dcc33171c9 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized 
out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  0x000055dcc33171c9 in qemu_poll_ns (fds=<optimized out>, nfds=<optimized 
out>, timeout=timeout@entry=-1) at util/qemu-timer.c:335
#3  0x000055dcc33199a1 in fdmon_poll_wait (ctx=0x55dcc4f78470, 
ready_list=0x7fffb3506768, timeout=-1) at util/fdmon-poll.c:79
#4  0x000055dcc3318f87 in aio_poll (ctx=0x55dcc4f78470, 
blocking=blocking@entry=true) at util/aio-posix.c:589
#5  0x000055dcc3276763 in bdrv_do_drained_begin (poll=<optimized out>, 
ignore_bds_parents=false, parent=0x0, recursive=false, bs=0x55dcc5b66010) at 
block/io.c:429
#6  0x000055dcc3276763 in bdrv_do_drained_begin (bs=0x55dcc5b66010, 
recursive=<optimized out>, parent=0x0, ignore_bds_parents=<optimized out>, 
poll=<optimized out>) at block/io.c:395
#7  0x000055dcc3291422 in bdrv_backup_top_drop (bs=0x55dcc5b66010) at 
block/backup-top.c:273
#8  0x000055dcc328cb4c in backup_clean (job=0x55dcc64ab800) at 
block/backup.c:132
#9  0x000055dcc322324d in job_clean (job=0x55dcc64ab800) at job.c:656
#10 0x000055dcc322324d in job_finalize_single (job=0x55dcc64ab800) at job.c:672
#11 0x000055dcc322324d in job_finalize_single (job=0x55dcc64ab800) at job.c:660
#12 0x000055dcc3223baa in job_completed_txn_abort (job=<optimized out>) at 
job.c:748
#13 0x000055dcc3223db2 in job_completed (job=0x55dcc64ab800) at job.c:842
#14 0x000055dcc3223db2 in job_completed (job=0x55dcc64ab800) at job.c:835
#15 0x000055dcc3223f60 in job_exit (opaque=0x55dcc64ab800) at job.c:863
#16 0x000055dcc331597e in aio_bh_call (bh=0x7f53b8010eb0) at util/async.c:117
#17 0x000055dcc331597e in aio_bh_poll (ctx=ctx@entry=0x55dcc4f78470) at 
util/async.c:117
#18 0x000055dcc3318dde in aio_dispatch (ctx=0x55dcc4f78470) at 
util/aio-posix.c:380
#19 0x000055dcc331585e in aio_ctx_dispatch (source=<optimized out>, 
callback=<optimized out>, user_data=<optimized out>) at util/async.c:260
#20 0x00007f53d4ccd06d in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#21 0x000055dcc3318048 in glib_pollfds_poll () at util/main-loop.c:219
#22 0x000055dcc3318048 in os_host_main_loop_wait (timeout=<optimized out>) at 
util/main-loop.c:242
#23 0x000055dcc3318048 in main_loop_wait (nonblocking=nonblocking@entry=0) at 
util/main-loop.c:518
#24 0x000055dcc2f8e3f9 in qemu_main_loop () at 
/home/kwolf/source/qemu/softmmu/vl.c:1665
#25 0x000055dcc2e7793e in main (argc=<optimized out>, argv=<optimized out>, 
envp=<optimized out>) at /home/kwolf/source/qemu/softmmu/main.c:49

(gdb) thread 3
[Switching to thread 3 (Thread 0x7f53c7438700 (LWP 3967))]
#0  0x00007f53d23449ad in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) up
#1  0x00007f53d233dd94 in pthread_mutex_lock () from /lib64/libpthread.so.0
(gdb)
#2  0x000055dcc331bdb3 in qemu_mutex_lock_impl (mutex=0x55dcc4ff1c80, 
file=0x55dcc3512bff "util/async.c", line=557) at util/qemu-thread-posix.c:78
78          err = pthread_mutex_lock(&mutex->lock);
(gdb) p *mutex
$1 = {lock = {__data = {__lock = 2, __count = 1, __owner = 3962, __nusers = 1, 
__kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
    __size = 
"\002\000\000\000\001\000\000\000z\017\000\000\001\000\000\000\001", '\000' 
<repeats 22 times>, __align = 4294967298}, initialized = true}
(gdb) info threads
  Id   Target Id                                          Frame
  1    Thread 0x7f53c7dab680 (LWP 3962) "qemu-system-x86" 0x00007f53d2260526 in 
ppoll () from /lib64/libc.so.6
  2    Thread 0x7f53c7c39700 (LWP 3966) "qemu-system-x86" 0x00007f53d2265c4d in 
syscall () from /lib64/libc.so.6
* 3    Thread 0x7f53c7438700 (LWP 3967) "qemu-system-x86" 0x00007f53d23449ad in 
__lll_lock_wait () from /lib64/libpthread.so.0
  4    Thread 0x7f53c61b2700 (LWP 3970) "qemu-system-x86" 0x00007f53d2260431 in 
poll () from /lib64/libc.so.6
  5    Thread 0x7f53c59b1700 (LWP 3972) "qemu-system-x86" 0x00007f53d23449ad in 
__lll_lock_wait () from /lib64/libpthread.so.0
  6    Thread 0x7f53c49ff700 (LWP 3974) "gmain"           0x00007f53d2260431 in 
poll () from /lib64/libc.so.6
  7    Thread 0x7f536ebff700 (LWP 3975) "gdbus"           0x00007f53d2260431 in 
poll () from /lib64/libc.so.6




reply via email to

[Prev in Thread] Current Thread [Next in Thread]