qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: iotest 030 still occasionally intermittently failing


From: Vladimir Sementsov-Ogievskiy
Subject: Re: iotest 030 still occasionally intermittently failing
Date: Fri, 20 Nov 2020 15:48:08 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0

20.11.2020 13:34, Philippe Mathieu-Daudé wrote:
Cc'ing Stefan/Paolo

On 11/20/20 10:51 AM, Vladimir Sementsov-Ogievskiy wrote:
19.11.2020 23:31, Vladimir Sementsov-Ogievskiy wrote:
19.11.2020 22:31, Vladimir Sementsov-Ogievskiy wrote:
19.11.2020 22:30, Vladimir Sementsov-Ogievskiy wrote:
19.11.2020 19:11, Vladimir Sementsov-Ogievskiy wrote:
16.11.2020 20:59, Peter Maydell wrote:
On Mon, 16 Nov 2020 at 17:34, Alberto Garcia <berto@igalia.com>
wrote:
Do you know if there is a core dump or stack trace available ?

Nope, sorry. What you get is what the 'vm-build-netbsd' etc targets
produce, so if you want more diagnostics on failures you have to
arrange for the test harness to produce them...

thanks
-- PMM


Hi!

After some iterations I've reproduced on SIGABRT:

#0  0x00007feb701bae35 in raise () at /lib64/libc.so.6
#1  0x00007feb701a5895 in abort () at /lib64/libc.so.6
#2  0x00007feb701a5769 in _nl_load_domain.cold () at /lib64/libc.so.6
#3  0x00007feb701b3566 in annobin_assert.c_end () at /lib64/libc.so.6
#4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00,
new_bs=0x0) at ../block.c:2648
#5  0x000055a93374fd5a in bdrv_detach_child (child=0x55a9363a3a00)
at ../block.c:2777
#6  0x000055a93374fd9c in bdrv_root_unref_child
(child=0x55a9363a3a00) at ../block.c:2789
#7  0x000055a933722e8b in block_job_remove_all_bdrv
(job=0x55a935f4f4b0) at ../blockjob.c:191
#8  0x000055a933722bb2 in block_job_free (job=0x55a935f4f4b0) at
../blockjob.c:88
#9  0x000055a9337755fa in job_unref (job=0x55a935f4f4b0) at
../job.c:380
#10 0x000055a9337767a6 in job_exit (opaque=0x55a935f4f4b0) at
../job.c:894
#11 0x000055a93386037e in aio_bh_call (bh=0x55a9352e16b0) at
../util/async.c:136
#12 0x000055a933860488 in aio_bh_poll (ctx=0x55a9351366f0) at
../util/async.c:164
#13 0x000055a93383151e in aio_dispatch (ctx=0x55a9351366f0) at
../util/aio-posix.c:381
#14 0x000055a9338608b9 in aio_ctx_dispatch (source=0x55a9351366f0,
callback=0x0, user_data=0x0)
      at ../util/async.c:306
#15 0x00007feb71349ecd in g_main_context_dispatch () at
/lib64/libglib-2.0.so.0
#16 0x000055a933840300 in glib_pollfds_poll () at
../util/main-loop.c:221
#17 0x000055a93384037a in os_host_main_loop_wait (timeout=0) at
../util/main-loop.c:244
#18 0x000055a933840482 in main_loop_wait (nonblocking=0) at
../util/main-loop.c:520
#19 0x000055a933603979 in qemu_main_loop () at ../softmmu/vl.c:1678
#20 0x000055a933190046 in main (argc=20, argv=0x7ffd48c31138,
envp=0x7ffd48c311e0)

(gdb) fr 4
#4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00,
new_bs=0x0) at ../block.c:2648
2648            assert(tighten_restrictions == false);
(gdb) list
2643            int ret;
2644
2645            bdrv_get_cumulative_perm(old_bs, &perm, &shared_perm);
2646            ret = bdrv_check_perm(old_bs, NULL, perm,
shared_perm, NULL,
2647                                  &tighten_restrictions, NULL);
2648            assert(tighten_restrictions == false);
2649            if (ret < 0) {
2650                /* We only tried to loosen restrictions, so
errors are not fatal */
2651                bdrv_abort_perm_update(old_bs);
2652            } else {
(gdb) p tighten_restrictions
$1 = true



I've modified code a bit, to crash when we actually want to set
tighten_restrictions to true, and get the following bt:
#0  0x00007f6dbb49ee35 in raise () at /lib64/libc.so.6
#1  0x00007f6dbb489895 in abort () at /lib64/libc.so.6
#2  0x000055b9174104d7 in bdrv_check_perm
      (bs=0x55b918f09720, q=0x0, cumulative_perms=1,
cumulative_shared_perms=21, ignore_children=0x55b918a57b20 = {...},
tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>,
errp=0x0) at ../block.c:2009
#3  0x000055b917410ec0 in bdrv_check_update_perm
      (bs=0x55b918f09720, q=0x0, new_used_perm=1, new_shared_perm=21,
ignore_children=0x55b918a57b20 = {...},
tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>,
errp=0x0) at ../block.c:2280
#4  0x000055b917410f38 in bdrv_child_check_perm
      (c=0x55b91921fcf0, q=0x0, perm=1, shared=21,
ignore_children=0x55b918a57b20 = {...},
tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>,
errp=0x0) at ../block.c:2294
#5  0x000055b91741078c in bdrv_check_perm
      (bs=0x55b918defd90, q=0x0, cumulative_perms=1,
cumulative_shared_perms=21, ignore_children=0x0,
tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>,
errp=0x0) at ../block.c:2076
#6  0x000055b91741194e in bdrv_replace_child (child=0x55b919cf6200,
new_bs=0x0) at ../block.c:2666
#7  0x000055b917411f1d in bdrv_detach_child (child=0x55b919cf6200)
at ../block.c:2798
#8  0x000055b917411f5f in bdrv_root_unref_child
(child=0x55b919cf6200) at ../block.c:2810
#9  0x000055b9173e4d88 in block_job_remove_all_bdrv
(job=0x55b918f06a60) at ../blockjob.c:191
#10 0x000055b9173e4aaf in block_job_free (job=0x55b918f06a60) at
../blockjob.c:88
#11 0x000055b917437aca in job_unref (job=0x55b918f06a60) at
../job.c:380
#12 0x000055b917438c76 in job_exit (opaque=0x55b918f06a60) at
../job.c:894
#13 0x000055b917522a57 in aio_bh_call (bh=0x55b919a2b3b0) at
../util/async.c:136
#14 0x000055b917522b61 in aio_bh_poll (ctx=0x55b918a866f0) at
../util/async.c:164
#15 0x000055b9174f3bf7 in aio_dispatch (ctx=0x55b918a866f0) at
../util/aio-posix.c:381
#16 0x000055b917522f92 in aio_ctx_dispatch (source=0x55b918a866f0,
callback=0x0, user_data=0x0)
      at ../util/async.c:306
#17 0x00007f6dbc62decd in g_main_context_dispatch () at
/lib64/libglib-2.0.so.0
#18 0x000055b9175029d9 in glib_pollfds_poll () at
../util/main-loop.c:221
#19 0x000055b917502a53 in os_host_main_loop_wait (timeout=0) at
../util/main-loop.c:244
#20 0x000055b917502b5b in main_loop_wait (nonblocking=0) at
../util/main-loop.c:520
#21 0x000055b9172c5979 in qemu_main_loop () at ../softmmu/vl.c:1678
#22 0x000055b916e52046 in main (argc=20, argv=0x7fff7f81f208,
envp=0x7fff7f81f2b0)


and the picture taken at the moment of abort (and it is the same as
at the moment before bdrv_replace_child call) is attached. So, it
looks like graph is already corrupted (you see that backing
permissions are not propagated to node2-node0 child).

How graph was corrupted it's still the question..


additional info: it's all during test_stream_commit_1 in 30 iiotest


OK, after some debugging and looking at block-graph dumps I tend to
think that this a race between finish (.prepare) of mirror and
block-stream. They do graph updates. Nothing prevents interleaving of
graph-updating operations (note that bdrv_replace_child_noperm may do
aio_poll). And nothing protects two processes of graph-update from
intersection.

Any ideas?


Wow, I get the following:

#0  0x00007fa84bcf97b6 in ppoll () at /lib64/libc.so.6
#1  0x000055b33e46321b in qemu_poll_ns (fds=0x55b340d28b60, nfds=1,
timeout=600000000000) at ../util/qemu-timer.c:349
#2  0x000055b33e495c33 in fdmon_poll_wait (ctx=0x55b340d026f0,
ready_list=0x7ffcc0af5f30, timeout=600000000000) at ../util/fdmon-poll.c:80
#3  0x000055b33e4575b0 in aio_poll (ctx=0x55b340d026f0, blocking=true)
at ../util/aio-posix.c:607
#4  0x000055b33e302c41 in mirror_exit_common (job=0x55b341e94d20) at
../block/mirror.c:648
#5  0x000055b33e3031e3 in mirror_prepare (job=0x55b341e94d20) at
../block/mirror.c:776
#6  0x000055b33e39ba6b in job_prepare (job=0x55b341e94d20) at ../job.c:781
#7  0x000055b33e39a31f in job_txn_apply (job=0x55b341e94d20,
fn=0x55b33e39ba2e <job_prepare>) at ../job.c:158
#8  0x000055b33e39baf4 in job_do_finalize (job=0x55b341e94d20) at
../job.c:798
#9  0x000055b33e39bcbb in job_completed_txn_success (job=0x55b341e94d20)
at ../job.c:852
#10 0x000055b33e39bd5b in job_completed (job=0x55b341e94d20) at
../job.c:865
#11 0x000055b33e39bdae in job_exit (opaque=0x55b341e94d20) at ../job.c:885
#12 0x000055b33e485cfc in aio_bh_call (bh=0x7fa830004100) at
../util/async.c:136
#13 0x000055b33e485e06 in aio_bh_poll (ctx=0x55b340d026f0) at
../util/async.c:164
#14 0x000055b33e4577a3 in aio_poll (ctx=0x55b340d026f0, blocking=true)
at ../util/aio-posix.c:659
#15 0x000055b33e2fa4d9 in bdrv_poll_co (s=0x7ffcc0af62e0) at
/work/src/qemu/master/block/block-gen.h:44
#16 0x000055b33e2faff4 in bdrv_pwritev (child=0x55b340ed0f60, offset=0,
bytes=65536, qiov=0x7ffcc0af6370, flags=0) at block/block-gen.c:408
#17 0x000055b33e34be9b in bdrv_pwrite (child=0x55b340ed0f60, offset=0,
buf=0x55b341d10000, bytes=65536) at ../block/io.c:984
#18 0x000055b33e386c6f in qcow2_update_header (bs=0x55b341023020) at
../block/qcow2.c:3037
#19 0x000055b33e386e53 in qcow2_change_backing_file (bs=0x55b341023020,
backing_file=0x55b341bf3c80 "/ramdisk/xx/img-0.img",
backing_fmt=0x55b33e613afc "raw") at ../block/qcow2.c:3074
#20 0x000055b33e379908 in bdrv_change_backing_file (bs=0x55b341023020,
backing_file=0x55b341bf3c80 "/ramdisk/xx/img-0.img",
backing_fmt=0x55b33e613afc "raw", warn=false) at ../block.c:4768
#21 0x000055b33e3b2377 in stream_prepare (job=0x55b341030620) at
../block/stream.c:88
#22 0x000055b33e39ba6b in job_prepare (job=0x55b341030620) at ../job.c:781
#23 0x000055b33e39a31f in job_txn_apply (job=0x55b341030620,
fn=0x55b33e39ba2e <job_prepare>) at ../job.c:158
#24 0x000055b33e39baf4 in job_do_finalize (job=0x55b341030620) at
../job.c:798
#25 0x000055b33e39bcbb in job_completed_txn_success (job=0x55b341030620)
at ../job.c:852
#26 0x000055b33e39bd5b in job_completed (job=0x55b341030620) at
../job.c:865
#27 0x000055b33e39bdae in job_exit (opaque=0x55b341030620) at ../job.c:885
#28 0x000055b33e485cfc in aio_bh_call (bh=0x55b340ead840) at
../util/async.c:136
#29 0x000055b33e485e06 in aio_bh_poll (ctx=0x55b340d026f0) at
../util/async.c:164
#30 0x000055b33e456e9c in aio_dispatch (ctx=0x55b340d026f0) at
../util/aio-posix.c:381
#31 0x000055b33e486237 in aio_ctx_dispatch (source=0x55b340d026f0,
callback=0x0, user_data=0x0) at ../util/async.c:306
#32 0x00007fa84cdceecd in g_main_context_dispatch () at
/lib64/libglib-2.0.so.0
#33 0x000055b33e465c7e in glib_pollfds_poll () at ../util/main-loop.c:221
#34 0x000055b33e465cf8 in os_host_main_loop_wait (timeout=0) at
../util/main-loop.c:244
#35 0x000055b33e465e00 in main_loop_wait (nonblocking=0) at
../util/main-loop.c:520
#36 0x000055b33e228979 in qemu_main_loop () at ../softmmu/vl.c:1678
#37 0x000055b33ddb5046 in main (argc=20, argv=0x7ffcc0af6a88,
envp=0x7ffcc0af6b30) at ../softmmu/main.c:50


aio_poll at start of mirror_exit_common is my addition. But anyway the
problem is here: we do call mirror_prepare inside of stream_prepare!

----

Ok, moving further, I moved stream_clean, stream_prepare,
mirror_exit_common and mirror_complete into coroutine context and
protect them all by new global coroutine mutex. This helps, now the bug
doesn't reproduce.

Still, I'll try to finally understand, what's going on. But I think now
that:

- graph modification code (permissions, drains, child-replacements, etc)
are not safe to call in parallel, for example in different coroutines or
through nested aio_poll.

- but they all do different kinds of drains, io operations, etc, which
lead to nested aio polls and everything is broken



My debugging log of the problem:

stream_start start
../block/stream.c:287 setro 0
bdrv_reopen_multiple enter
bdrv_reopen_multiple exit
stream_start end
../block/mirror.c:1889 setro 0
bdrv_reopen_multiple enter
bdrv_reopen_multiple exit
mirror_start_job start
mirror_start_job end
job_exit
bdrv_co_stream_prepare start
bdrv_co_stream_prepare end
bdrv_co_mirror_complete start
bdrv_co_mirror_complete end
stream_clean start
../block/stream.c:117 setro 1
bdrv_reopen_multiple enter
job_exit
bdrv_co_mirror_exit_common start
set no perm for mirror_top
replace mirror_top
replaceD mirror_top
bdrv_co_mirror_exit_common end
job_exit end
bdrv_reopen_multiple exit
stream_clean exit


So, we start bdrv_reopen_multiple in stream_clean, and switch to 
mirror_exit_common. Note that mirror_exit_common at some point drop permissions 
on mirror_top node, which is still in graph, so empty permission may propagate 
to the whole backing chain. Anyway we have nested permission update which 
results in unpredicted things.

I don't want to continue digging, it doesn't seem to make any more sense.


So, what I can summarize:

1. The bug may be workarounded by moving graph-changing operations in 
block-jobs to coroutine context and use coroutine-mutex. Still, it's not a 
generic solution.

2. The core problem is that graph modification functions may trigger 
context-switch due to nested aio_polls.. which leads to (for example) nested 
permission updates..


I have big series (not yet sent) of updating graph modifications (mostly around 
permissions) to solve some other problems and drop .active field workaround in 
block jobs.. But it doesn't solve aio_poll- problem.

Look also at my recent "question about bdrv_replace_node" letter, it's about 
same thing.


--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]