qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: iotest 030 still occasionally intermittently failing


From: Philippe Mathieu-Daudé
Subject: Re: iotest 030 still occasionally intermittently failing
Date: Fri, 20 Nov 2020 11:34:16 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.4.0

Cc'ing Stefan/Paolo

On 11/20/20 10:51 AM, Vladimir Sementsov-Ogievskiy wrote:
> 19.11.2020 23:31, Vladimir Sementsov-Ogievskiy wrote:
>> 19.11.2020 22:31, Vladimir Sementsov-Ogievskiy wrote:
>>> 19.11.2020 22:30, Vladimir Sementsov-Ogievskiy wrote:
>>>> 19.11.2020 19:11, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 16.11.2020 20:59, Peter Maydell wrote:
>>>>>> On Mon, 16 Nov 2020 at 17:34, Alberto Garcia <berto@igalia.com>
>>>>>> wrote:
>>>>>>> Do you know if there is a core dump or stack trace available ?
>>>>>>
>>>>>> Nope, sorry. What you get is what the 'vm-build-netbsd' etc targets
>>>>>> produce, so if you want more diagnostics on failures you have to
>>>>>> arrange for the test harness to produce them...
>>>>>>
>>>>>> thanks
>>>>>> -- PMM
>>>>>>
>>>>>
>>>>> Hi!
>>>>>
>>>>> After some iterations I've reproduced on SIGABRT:
>>>>>
>>>>> #0  0x00007feb701bae35 in raise () at /lib64/libc.so.6
>>>>> #1  0x00007feb701a5895 in abort () at /lib64/libc.so.6
>>>>> #2  0x00007feb701a5769 in _nl_load_domain.cold () at /lib64/libc.so.6
>>>>> #3  0x00007feb701b3566 in annobin_assert.c_end () at /lib64/libc.so.6
>>>>> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00,
>>>>> new_bs=0x0) at ../block.c:2648
>>>>> #5  0x000055a93374fd5a in bdrv_detach_child (child=0x55a9363a3a00)
>>>>> at ../block.c:2777
>>>>> #6  0x000055a93374fd9c in bdrv_root_unref_child
>>>>> (child=0x55a9363a3a00) at ../block.c:2789
>>>>> #7  0x000055a933722e8b in block_job_remove_all_bdrv
>>>>> (job=0x55a935f4f4b0) at ../blockjob.c:191
>>>>> #8  0x000055a933722bb2 in block_job_free (job=0x55a935f4f4b0) at
>>>>> ../blockjob.c:88
>>>>> #9  0x000055a9337755fa in job_unref (job=0x55a935f4f4b0) at
>>>>> ../job.c:380
>>>>> #10 0x000055a9337767a6 in job_exit (opaque=0x55a935f4f4b0) at
>>>>> ../job.c:894
>>>>> #11 0x000055a93386037e in aio_bh_call (bh=0x55a9352e16b0) at
>>>>> ../util/async.c:136
>>>>> #12 0x000055a933860488 in aio_bh_poll (ctx=0x55a9351366f0) at
>>>>> ../util/async.c:164
>>>>> #13 0x000055a93383151e in aio_dispatch (ctx=0x55a9351366f0) at
>>>>> ../util/aio-posix.c:381
>>>>> #14 0x000055a9338608b9 in aio_ctx_dispatch (source=0x55a9351366f0,
>>>>> callback=0x0, user_data=0x0)
>>>>>      at ../util/async.c:306
>>>>> #15 0x00007feb71349ecd in g_main_context_dispatch () at
>>>>> /lib64/libglib-2.0.so.0
>>>>> #16 0x000055a933840300 in glib_pollfds_poll () at
>>>>> ../util/main-loop.c:221
>>>>> #17 0x000055a93384037a in os_host_main_loop_wait (timeout=0) at
>>>>> ../util/main-loop.c:244
>>>>> #18 0x000055a933840482 in main_loop_wait (nonblocking=0) at
>>>>> ../util/main-loop.c:520
>>>>> #19 0x000055a933603979 in qemu_main_loop () at ../softmmu/vl.c:1678
>>>>> #20 0x000055a933190046 in main (argc=20, argv=0x7ffd48c31138,
>>>>> envp=0x7ffd48c311e0)
>>>>>
>>>>> (gdb) fr 4
>>>>> #4  0x000055a93374f7d3 in bdrv_replace_child (child=0x55a9363a3a00,
>>>>> new_bs=0x0) at ../block.c:2648
>>>>> 2648            assert(tighten_restrictions == false);
>>>>> (gdb) list
>>>>> 2643            int ret;
>>>>> 2644
>>>>> 2645            bdrv_get_cumulative_perm(old_bs, &perm, &shared_perm);
>>>>> 2646            ret = bdrv_check_perm(old_bs, NULL, perm,
>>>>> shared_perm, NULL,
>>>>> 2647                                  &tighten_restrictions, NULL);
>>>>> 2648            assert(tighten_restrictions == false);
>>>>> 2649            if (ret < 0) {
>>>>> 2650                /* We only tried to loosen restrictions, so
>>>>> errors are not fatal */
>>>>> 2651                bdrv_abort_perm_update(old_bs);
>>>>> 2652            } else {
>>>>> (gdb) p tighten_restrictions
>>>>> $1 = true
>>>>>
>>>>>
>>>>
>>>> I've modified code a bit, to crash when we actually want to set
>>>> tighten_restrictions to true, and get the following bt:
>>>> #0  0x00007f6dbb49ee35 in raise () at /lib64/libc.so.6
>>>> #1  0x00007f6dbb489895 in abort () at /lib64/libc.so.6
>>>> #2  0x000055b9174104d7 in bdrv_check_perm
>>>>      (bs=0x55b918f09720, q=0x0, cumulative_perms=1,
>>>> cumulative_shared_perms=21, ignore_children=0x55b918a57b20 = {...},
>>>> tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>,
>>>> errp=0x0) at ../block.c:2009
>>>> #3  0x000055b917410ec0 in bdrv_check_update_perm
>>>>      (bs=0x55b918f09720, q=0x0, new_used_perm=1, new_shared_perm=21,
>>>> ignore_children=0x55b918a57b20 = {...},
>>>> tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>,
>>>> errp=0x0) at ../block.c:2280
>>>> #4  0x000055b917410f38 in bdrv_child_check_perm
>>>>      (c=0x55b91921fcf0, q=0x0, perm=1, shared=21,
>>>> ignore_children=0x55b918a57b20 = {...},
>>>> tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>,
>>>> errp=0x0) at ../block.c:2294
>>>> #5  0x000055b91741078c in bdrv_check_perm
>>>>      (bs=0x55b918defd90, q=0x0, cumulative_perms=1,
>>>> cumulative_shared_perms=21, ignore_children=0x0,
>>>> tighten_restrictions=0x55b917b044f8 <abort_on_set_to_true>,
>>>> errp=0x0) at ../block.c:2076
>>>> #6  0x000055b91741194e in bdrv_replace_child (child=0x55b919cf6200,
>>>> new_bs=0x0) at ../block.c:2666
>>>> #7  0x000055b917411f1d in bdrv_detach_child (child=0x55b919cf6200)
>>>> at ../block.c:2798
>>>> #8  0x000055b917411f5f in bdrv_root_unref_child
>>>> (child=0x55b919cf6200) at ../block.c:2810
>>>> #9  0x000055b9173e4d88 in block_job_remove_all_bdrv
>>>> (job=0x55b918f06a60) at ../blockjob.c:191
>>>> #10 0x000055b9173e4aaf in block_job_free (job=0x55b918f06a60) at
>>>> ../blockjob.c:88
>>>> #11 0x000055b917437aca in job_unref (job=0x55b918f06a60) at
>>>> ../job.c:380
>>>> #12 0x000055b917438c76 in job_exit (opaque=0x55b918f06a60) at
>>>> ../job.c:894
>>>> #13 0x000055b917522a57 in aio_bh_call (bh=0x55b919a2b3b0) at
>>>> ../util/async.c:136
>>>> #14 0x000055b917522b61 in aio_bh_poll (ctx=0x55b918a866f0) at
>>>> ../util/async.c:164
>>>> #15 0x000055b9174f3bf7 in aio_dispatch (ctx=0x55b918a866f0) at
>>>> ../util/aio-posix.c:381
>>>> #16 0x000055b917522f92 in aio_ctx_dispatch (source=0x55b918a866f0,
>>>> callback=0x0, user_data=0x0)
>>>>      at ../util/async.c:306
>>>> #17 0x00007f6dbc62decd in g_main_context_dispatch () at
>>>> /lib64/libglib-2.0.so.0
>>>> #18 0x000055b9175029d9 in glib_pollfds_poll () at
>>>> ../util/main-loop.c:221
>>>> #19 0x000055b917502a53 in os_host_main_loop_wait (timeout=0) at
>>>> ../util/main-loop.c:244
>>>> #20 0x000055b917502b5b in main_loop_wait (nonblocking=0) at
>>>> ../util/main-loop.c:520
>>>> #21 0x000055b9172c5979 in qemu_main_loop () at ../softmmu/vl.c:1678
>>>> #22 0x000055b916e52046 in main (argc=20, argv=0x7fff7f81f208,
>>>> envp=0x7fff7f81f2b0)
>>>>
>>>>
>>>> and the picture taken at the moment of abort (and it is the same as
>>>> at the moment before bdrv_replace_child call) is attached. So, it
>>>> looks like graph is already corrupted (you see that backing
>>>> permissions are not propagated to node2-node0 child).
>>>>
>>>> How graph was corrupted it's still the question..
>>>>
>>>
>>> additional info: it's all during test_stream_commit_1 in 30 iiotest
>>>
>>
>> OK, after some debugging and looking at block-graph dumps I tend to
>> think that this a race between finish (.prepare) of mirror and
>> block-stream. They do graph updates. Nothing prevents interleaving of
>> graph-updating operations (note that bdrv_replace_child_noperm may do
>> aio_poll). And nothing protects two processes of graph-update from
>> intersection.
>>
>> Any ideas?
>>
> 
> Wow, I get the following:
> 
> #0  0x00007fa84bcf97b6 in ppoll () at /lib64/libc.so.6
> #1  0x000055b33e46321b in qemu_poll_ns (fds=0x55b340d28b60, nfds=1,
> timeout=600000000000) at ../util/qemu-timer.c:349
> #2  0x000055b33e495c33 in fdmon_poll_wait (ctx=0x55b340d026f0,
> ready_list=0x7ffcc0af5f30, timeout=600000000000) at ../util/fdmon-poll.c:80
> #3  0x000055b33e4575b0 in aio_poll (ctx=0x55b340d026f0, blocking=true)
> at ../util/aio-posix.c:607
> #4  0x000055b33e302c41 in mirror_exit_common (job=0x55b341e94d20) at
> ../block/mirror.c:648
> #5  0x000055b33e3031e3 in mirror_prepare (job=0x55b341e94d20) at
> ../block/mirror.c:776
> #6  0x000055b33e39ba6b in job_prepare (job=0x55b341e94d20) at ../job.c:781
> #7  0x000055b33e39a31f in job_txn_apply (job=0x55b341e94d20,
> fn=0x55b33e39ba2e <job_prepare>) at ../job.c:158
> #8  0x000055b33e39baf4 in job_do_finalize (job=0x55b341e94d20) at
> ../job.c:798
> #9  0x000055b33e39bcbb in job_completed_txn_success (job=0x55b341e94d20)
> at ../job.c:852
> #10 0x000055b33e39bd5b in job_completed (job=0x55b341e94d20) at
> ../job.c:865
> #11 0x000055b33e39bdae in job_exit (opaque=0x55b341e94d20) at ../job.c:885
> #12 0x000055b33e485cfc in aio_bh_call (bh=0x7fa830004100) at
> ../util/async.c:136
> #13 0x000055b33e485e06 in aio_bh_poll (ctx=0x55b340d026f0) at
> ../util/async.c:164
> #14 0x000055b33e4577a3 in aio_poll (ctx=0x55b340d026f0, blocking=true)
> at ../util/aio-posix.c:659
> #15 0x000055b33e2fa4d9 in bdrv_poll_co (s=0x7ffcc0af62e0) at
> /work/src/qemu/master/block/block-gen.h:44
> #16 0x000055b33e2faff4 in bdrv_pwritev (child=0x55b340ed0f60, offset=0,
> bytes=65536, qiov=0x7ffcc0af6370, flags=0) at block/block-gen.c:408
> #17 0x000055b33e34be9b in bdrv_pwrite (child=0x55b340ed0f60, offset=0,
> buf=0x55b341d10000, bytes=65536) at ../block/io.c:984
> #18 0x000055b33e386c6f in qcow2_update_header (bs=0x55b341023020) at
> ../block/qcow2.c:3037
> #19 0x000055b33e386e53 in qcow2_change_backing_file (bs=0x55b341023020,
> backing_file=0x55b341bf3c80 "/ramdisk/xx/img-0.img",
> backing_fmt=0x55b33e613afc "raw") at ../block/qcow2.c:3074
> #20 0x000055b33e379908 in bdrv_change_backing_file (bs=0x55b341023020,
> backing_file=0x55b341bf3c80 "/ramdisk/xx/img-0.img",
> backing_fmt=0x55b33e613afc "raw", warn=false) at ../block.c:4768
> #21 0x000055b33e3b2377 in stream_prepare (job=0x55b341030620) at
> ../block/stream.c:88
> #22 0x000055b33e39ba6b in job_prepare (job=0x55b341030620) at ../job.c:781
> #23 0x000055b33e39a31f in job_txn_apply (job=0x55b341030620,
> fn=0x55b33e39ba2e <job_prepare>) at ../job.c:158
> #24 0x000055b33e39baf4 in job_do_finalize (job=0x55b341030620) at
> ../job.c:798
> #25 0x000055b33e39bcbb in job_completed_txn_success (job=0x55b341030620)
> at ../job.c:852
> #26 0x000055b33e39bd5b in job_completed (job=0x55b341030620) at
> ../job.c:865
> #27 0x000055b33e39bdae in job_exit (opaque=0x55b341030620) at ../job.c:885
> #28 0x000055b33e485cfc in aio_bh_call (bh=0x55b340ead840) at
> ../util/async.c:136
> #29 0x000055b33e485e06 in aio_bh_poll (ctx=0x55b340d026f0) at
> ../util/async.c:164
> #30 0x000055b33e456e9c in aio_dispatch (ctx=0x55b340d026f0) at
> ../util/aio-posix.c:381
> #31 0x000055b33e486237 in aio_ctx_dispatch (source=0x55b340d026f0,
> callback=0x0, user_data=0x0) at ../util/async.c:306
> #32 0x00007fa84cdceecd in g_main_context_dispatch () at
> /lib64/libglib-2.0.so.0
> #33 0x000055b33e465c7e in glib_pollfds_poll () at ../util/main-loop.c:221
> #34 0x000055b33e465cf8 in os_host_main_loop_wait (timeout=0) at
> ../util/main-loop.c:244
> #35 0x000055b33e465e00 in main_loop_wait (nonblocking=0) at
> ../util/main-loop.c:520
> #36 0x000055b33e228979 in qemu_main_loop () at ../softmmu/vl.c:1678
> #37 0x000055b33ddb5046 in main (argc=20, argv=0x7ffcc0af6a88,
> envp=0x7ffcc0af6b30) at ../softmmu/main.c:50
> 
> 
> aio_poll at start of mirror_exit_common is my addition. But anyway the
> problem is here: we do call mirror_prepare inside of stream_prepare!
> 
> ----
> 
> Ok, moving further, I moved stream_clean, stream_prepare,
> mirror_exit_common and mirror_complete into coroutine context and
> protect them all by new global coroutine mutex. This helps, now the bug
> doesn't reproduce.
> 
> Still, I'll try to finally understand, what's going on. But I think now
> that:
> 
> - graph modification code (permissions, drains, child-replacements, etc)
> are not safe to call in parallel, for example in different coroutines or
> through nested aio_poll.
> 
> - but they all do different kinds of drains, io operations, etc, which
> lead to nested aio polls and everything is broken
> 
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]