qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PULL for-2.5 2/4] block: Don't wait serialising for no


From: Christian Borntraeger
Subject: Re: [Qemu-devel] [PULL for-2.5 2/4] block: Don't wait serialising for non-COR read requests
Date: Tue, 8 Dec 2015 14:28:51 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0

On 12/08/2015 01:30 PM, Christian Borntraeger wrote:
> On 12/08/2015 01:00 PM, Cornelia Huck wrote:
>> On Tue, 8 Dec 2015 10:59:54 +0100
>> Kevin Wolf <address@hidden> wrote:
>>
>>> Am 07.12.2015 um 17:42 hat Cornelia Huck geschrieben:
>>>> On Mon, 7 Dec 2015 11:02:51 +0100
>>>> Cornelia Huck <address@hidden> wrote:
>>>>
>>>>> On Thu,  3 Dec 2015 13:00:00 +0800
>>>>> Stefan Hajnoczi <address@hidden> wrote:
>>>>>
>>>>>> From: Fam Zheng <address@hidden>
>>>>>>
>>>>>> The assertion problem was noticed in 06c3916b35a, but it wasn't
>>>>>> completely fixed, because even though the req is not marked as
>>>>>> serialising, it still gets serialised by wait_serialising_requests
>>>>>> against other serialising requests, which could lead to the same
>>>>>> assertion failure.
>>>>>>
>>>>>> Fix it by even more explicitly skipping the serialising for this
>>>>>> specific case.
>>>>>>
>>>>>> Signed-off-by: Fam Zheng <address@hidden>
>>>>>> Message-id: address@hidden
>>>>>> Signed-off-by: Stefan Hajnoczi <address@hidden>
>>>>>> ---
>>>>>>  block/backup.c        |  2 +-
>>>>>>  block/io.c            | 12 +++++++-----
>>>>>>  include/block/block.h |  4 ++--
>>>>>>  trace-events          |  2 +-
>>>>>>  4 files changed, 11 insertions(+), 9 deletions(-)
>>>>>
>>>>> This one causes segfaults for me:
>>>>>
>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>> bdrv_is_inserted (bs=0x800000000000) at /data/git/yyy/qemu/block.c:3071
>>>>> 3071          if (!drv) {
>>>>>
>>>>> (gdb) bt
>>>>> #0  bdrv_is_inserted (bs=0x800000000000) at 
>>>>> /data/git/yyy/qemu/block.c:3071
>>>
>>> This looks like some kind of memory corruption that hit blk->bs. It's
>>> most definitely not a valid pointer anyway.
>>>
>>>>> #1  0x0000000080216974 in blk_is_inserted (blk=<optimized out>)
>>>>>     at /data/git/yyy/qemu/block/block-backend.c:986
>>>>> #2  0x00000000802169c6 in blk_is_available (address@hidden)
>>>>>     at /data/git/yyy/qemu/block/block-backend.c:991
>>>>> #3  0x0000000080216d12 in blk_check_byte_request (address@hidden, 
>>>>>     address@hidden, size=16384)
>>>>>     at /data/git/yyy/qemu/block/block-backend.c:558
>>>>> #4  0x0000000080216df2 in blk_check_request (address@hidden, 
>>>>>     address@hidden, address@hidden)
>>>>>     at /data/git/yyy/qemu/block/block-backend.c:589
>>>>> #5  0x0000000080217ee8 in blk_aio_readv (blk=0x3ffb17e7960, sector_num=
>>>>>     9626888, iov=0x8098c658, nb_sectors=<optimized out>, cb=
>>>>>     0x80081150 <virtio_blk_rw_complete>, opaque=0x80980620)
>>>>>     at /data/git/yyy/qemu/block/block-backend.c:727
>>>>> #6  0x000000008008186e in submit_requests (niov=<optimized out>, 
>>>>>     num_reqs=<optimized out>, start=<optimized out>, mrb=<optimized out>, 
>>>>>     blk=<optimized out>) at /data/git/yyy/qemu/hw/block/virtio-blk.c:366
>>>>> #7  virtio_blk_submit_multireq (mrb=<optimized out>, blk=<optimized out>)
>>>>>     at /data/git/yyy/qemu/hw/block/virtio-blk.c:444
>>>>> #8  virtio_blk_submit_multireq (blk=0x3ffb17e7960, mrb=0x3ffffffeb58)
>>>>>     at /data/git/yyy/qemu/hw/block/virtio-blk.c:389
>>>>> #9  0x00000000800823ee in virtio_blk_handle_output (vdev=<optimized out>, 
>>>>>     vq=<optimized out>) at /data/git/yyy/qemu/hw/block/virtio-blk.c:615
>>>>> #10 0x00000000801e367e in aio_dispatch (ctx=0x80918520)
>>>>>     at /data/git/yyy/qemu/aio-posix.c:326
>>>>> #11 0x00000000801d28b0 in aio_ctx_dispatch (source=<optimized out>, 
>>>>>     callback=<optimized out>, user_data=<optimized out>)
>>>>>     at /data/git/yyy/qemu/async.c:231
>>>>> #12 0x000003fffd36a05a in g_main_context_dispatch ()
>>>>>    from /lib64/libglib-2.0.so.0
>>>>> #13 0x00000000801e0ffa in glib_pollfds_poll ()
>>>>>     at /data/git/yyy/qemu/main-loop.c:211
>>>>> #14 os_host_main_loop_wait (timeout=<optimized out>)
>>>>>     at /data/git/yyy/qemu/main-loop.c:256
>>>>> #15 main_loop_wait (nonblocking=<optimized out>)
>>>>>     at /data/git/yyy/qemu/main-loop.c:504
>>>>> #16 0x00000000800148a6 in main_loop () at /data/git/yyy/qemu/vl.c:1923
>>>>> #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized 
>>>>> out>)
>>>>>     at /data/git/yyy/qemu/vl.c:4684
>>>>>
>>>>> Relevant part of command line:
>>>>>
>>>>> -drive 
>>>>> file=/dev/sda,if=none,id=drive-virtio-disk0,format=raw,serial=ccwzfcp1,cache=none
>>>>>  -device 
>>>>> virtio-blk-ccw,devno=fe.0.0001,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,scsi=off
>>>>
>>>> I played around a bit. The main part of this change seems to be calling
>>>> wait_serialising_requests() conditionally; reverting this makes the
>>>> guest boot again.
>>>>
>>>> I then tried to find out when wait_serialising_requests() was NOT
>>>> called and added fprintfs: well, it was _always_ called. I then added a
>>>> fprintf for flags at the beginning of the function: this produced a
>>>> segfault no matter whether wait_serialising_requests() was called
>>>> conditionally or unconditionally. Weird race?
>>>>
>>>> Anything further I can do? I guess this patch fixes a bug for someone,
>>>> but it means insta-death for my setup...
>>>
>>> If it happens immediately, perhaps running under valgrind is possible
>>> and could give some hints about what happened with blk->bs?
>>
>> Just a quick update: This triggers on a qemu built with a not-so-fresh
>> gcc 4.7.2 (and it seems to depend on compiler optimizations as well).
>> We can't trigger it with newer gccs. No hints from valgrind, sadly.
>> Investigation ongoing.
>>
>>
> 
> Some updates after looking at Cornelias system. It seem to be related to 
> the F18 gcc that is still on that test system.
> 
> Problem happens when hw/block/virtio-blk.c is compiled
> with -O2 and goes away with -O1 and -O0 (I trimmed that down to
> -fexpensive-optimizations) 
> 
> The system calls virtio_blk_submit_multireq 26 times and then it messes
> up the blk pointer:
> 
> (gdb) display blk->name
> 1: blk->name = 0x80894300 "drive-virtio-disk0"
> (gdb) next
> 419               if (niov + req->qiov.niov > IOV_MAX) {
> 1: blk->name = 0x80894300 "drive-virtio-disk0"
> (gdb) 
> 424               if (req->qiov.size / BDRV_SECTOR_SIZE + nb_sectors > 
> max_xfer_len) {
> 1: blk->name = 0x80894300 "drive-virtio-disk0"
> (gdb) 
> 419               if (niov + req->qiov.niov > IOV_MAX) {
> 1: blk->name = 0x80894300 "drive-virtio-disk0"
> (gdb) 
> 429               if (sector_num + nb_sectors != req->sector_num) {
> 1: blk->name = 0x80894300 "drive-virtio-disk0"
> (gdb) 
> 434                   submit_requests(blk, mrb, start, num_reqs, niov);
> 1: blk->name = 0x80894300 "drive-virtio-disk0"
> (gdb) 
> 413       for (i = 0; i < mrb->num_reqs; i++) {
> 1: blk->name = 0x8089ae40 ""
> 
> 
> uninlining submit_request makes the problem go away, so might be a
> gcc bug in Fedora18. I am now going to look at the disassembly to be
> sure.

Not a compiler bug. gcc uses a floating point register 8 to spill
the pointer of blk (which is call saved) submit_request will later
on call  qemu_coroutine_enter and after returning from 
qemu_coroutine_enter, the fpr8 contains junk. Not sure yet, what happened.

Christian






reply via email to

[Prev in Thread] Current Thread [Next in Thread]