[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx->dispatching is dead, al
From: |
Laszlo Ersek |
Subject: |
Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx->dispatching is dead, all hail ctx->notify_me |
Date: |
Fri, 17 Jul 2015 15:57:20 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 |
On 07/17/15 15:48, Marc Zyngier wrote:
> On Fri, 17 Jul 2015 14:39:55 +0100
> Laszlo Ersek <address@hidden> wrote:
>
>> On 07/17/15 15:28, Marc Zyngier wrote:
>>> On Fri, 17 Jul 2015 10:30:38 +0100
>>> Paolo Bonzini <address@hidden> wrote:
>>>
>>>>
>>>>
>>>> On 17/07/2015 06:44, Paolo Bonzini wrote:
>>>>>
>>>>>
>>>>> On 16/07/2015 21:05, Richard W.M. Jones wrote:
>>>>>>
>>>>>> Sorry to spoil things, but I'm still seeing this bug, although it is
>>>>>> now a lot less frequent with your patch. I would estimate it happens
>>>>>> more often than 1 in 5 runs with qemu.git, and probably 1 in 200 runs
>>>>>> with qemu.git + the v2 patch series.
>>>>>>
>>>>>> It's the exact same hang in both cases.
>>>>>>
>>>>>> Is it possible that this patch doesn't completely close any race?
>>>>>>
>>>>>> Still, it is an improvement, so there is that.
>>>>>
>>>>> Would seem at first glance like a different bug.
>>>>>
>>>>> Interestingly, adding some "tracing" (qemu_clock_get_ns) makes the bug
>>>>> more likely: now it reproduces in about 10 tries. Of course :) adding
>>>>> other kinds of tracing instead make it go away again (>50 tries).
>>>>>
>>>>> Perhaps this:
>>>>>
>>>>> i/o thread vcpu thread worker thread
>>>>> ---------------------------------------------------------------------
>>>>> lock_iothread
>>>>> notify_me = 1
>>>>> ...
>>>>> unlock_iothread
>>>>> lock_iothread
>>>>> notify_me = 3
>>>>> ppoll
>>>>> notify_me = 1
>>>>> bh->scheduled = 1
>>>>> event_notifier_set
>>>>> event_notifier_test_and_clear
>>>>> ppoll
>>>>> ^^ hang
>>>>>
>>>>> In the exact shape above, it doesn't seem too likely to happen, but
>>>>> perhaps there's another simpler case. Still, the bug exists.
>>>>>
>>>>> The above is not really related to notify_me. Here the notification is
>>>>> not being optimized away! So I wonder if this one has been there forever.
>>>>>
>>>>> Fam suggested putting the event_notifier_test_and_clear before
>>>>> aio_bh_poll(), but it does not work. I'll look more close
>>>>>
>>>>> However, an unconditional event_notifier_test_and_clear is pretty
>>>>> expensive. On one hand, obviously correctness comes first. On the
>>>>> other hand, an expensive operation at the wrong place can mask the race
>>>>> very easily; I'll let the fix run for a while, but I'm not sure if a
>>>>> successful test really says anything useful.
>>>>
>>>> So it may not be useful, but still successful test is successful. :)
>>>> The following patch, which also includes the delta between v2 and v3
>>>> of this series, survived 674 reboots before hitting a definitely
>>>> unrelated problem:
>>>>
>>>> error: kvm run failed Function not implemented
>>>> PC=00000000bf671210 SP=00000000c00001f0
>>>> X00=000000000a003e70 X01=0000000000000000 X02=00000000bf680548
>>>> X03=0000000000000030
>>>> X04=00000000bbb5fc18 X05=00000000004b7770 X06=00000000bf721930
>>>> X07=000000000000009a
>>>> X08=00000000bf716858 X09=0000000000000090 X10=0000000000000000
>>>> X11=0000000000000046
>>>> X12=00000000a007e03a X13=0000000000000000 X14=0000000000000000
>>>> X15=0000000000000000
>>>> X16=00000000bf716df0 X17=0000000000000000 X18=0000000000000000
>>>> X19=00000000bf6f5f18
>>>> X20=0000000000000000 X21=0000000000000000 X22=0000000000000000
>>>> X23=0000000000000000
>>>> X24=0000000000000000 X25=0000000000000000 X26=0000000000000000
>>>> X27=0000000000000000
>>>> X28=0000000000000000 X29=0000000000000000 X30=0000000000000000
>>>> PSTATE=60000305 (flags -ZC-)
>>>>
>>>> For the record, this is the kvm_run struct:
>>>>
>>>> $6 = {request_interrupt_window = 0 '\000', padding1 =
>>>> "\000\000\000\000\000\000", exit_reason = 0,
>>>> ready_for_interrupt_injection = 0 '\000', if_flag = 0 '\000', flags = 0,
>>>> cr8 = 0, apic_base = 0, {hw = {
>>>> hardware_exit_reason = 150994968}, fail_entry =
>>>> {hardware_entry_failure_reason = 150994968}, ex = {
>>>> exception = 150994968, error_code = 0}, io = {direction = 24 '\030',
>>>> size = 0 '\000', port = 2304,
>>>> count = 0, data_offset = 144}, debug = {arch = {<No data fields>}},
>>>> mmio = {phys_addr = 150994968,
>>>> data = "\220\000\000\000\000\000\000", len = 4, is_write = 0
>>>> '\000'}, hypercall = {nr = 150994968,
>>>> args = {144, 4, 0, 0, 0, 0}, ret = 0, longmode = 0, pad = 0},
>>>> tpr_access = {rip = 150994968,
>>>> is_write = 144, pad = 0}, s390_sieic = {icptcode = 24 '\030', ipa =
>>>> 2304, ipb = 0},
>>>> s390_reset_flags = 150994968, s390_ucontrol = {trans_exc_code =
>>>> 150994968, pgm_code = 144}, dcr = {
>>>> dcrn = 150994968, data = 0, is_write = 144 '\220'}, internal =
>>>> {suberror = 150994968, ndata = 0,
>>>> data = {144, 4, 0 <repeats 14 times>}}, osi = {gprs = {150994968,
>>>> 144, 4, 0 <repeats 29 times>}},
>>>> papr_hcall = {nr = 150994968, ret = 144, args = {4, 0, 0, 0, 0, 0, 0,
>>>> 0, 0}}, s390_tsch = {
>>>> subchannel_id = 24, subchannel_nr = 2304, io_int_parm = 0,
>>>> io_int_word = 144, ipb = 0,
>>>> dequeued = 4 '\004'}, epr = {epr = 150994968}, system_event = {type
>>>> = 150994968, flags = 144},
>>>> s390_stsi = {addr = 150994968, ar = 144 '\220', reserved = 0 '\000',
>>>> fc = 0 '\000', sel1 = 0 '\000',
>>>> sel2 = 0},
>>>> padding =
>>>> "\030\000\000\t\000\000\000\000\220\000\000\000\000\000\000\000\004",
>>>> '\000' <repeats 238 times>}, kvm_valid_regs = 0, kvm_dirty_regs = 0, s =
>>>> {regs = {<No data fields>},
>>>> padding = '\000' <repeats 2047 times>}}
>>>>
>>>> Marc, does it ring any bell?
>>>
>>> Well, this is an example of a guest accessing non-memory using an
>>> instruction that we cannot safely emulate - not an IO accessor (load
>>> multiple, for example).
>>>
>>> In this case, we kill the guest (we could as well inject a fault).
>>>
>>> This vcpu seems to be accessing 0x9000018 (the mmio structure points
>>> there), but I can't immediately relate it to the content of the
>>> registers.
>>
>> [VIRT_UART] = { 0x09000000, 0x00001000 },
>>
>
> Still: there is nothing in the registers that remotely points to that
> area. X0 is the closest, but it'd take a big negative offset to get
> there.
>
> Is that a Linux kernel? or something else?
Assuming I managed to decipher the nesting of the contexts, the register
dump was pasted by Paolo. IIRC Paolo has been using the guest firmware
(the ArmVirtPkg/ArmVirtQemu.dsc platform build in edk2, also known as
"AAVMF" sometimes) to reproduce the issue.
Thanks
Laszlo
- Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx->dispatching is dead, all hail ctx->notify_me, (continued)
- Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx->dispatching is dead, all hail ctx->notify_me, Paolo Bonzini, 2015/07/17
- Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx->dispatching is dead, all hail ctx->notify_me, Marc Zyngier, 2015/07/17
- Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx->dispatching is dead, all hail ctx->notify_me, Laszlo Ersek, 2015/07/17
- Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx->dispatching is dead, all hail ctx->notify_me, Marc Zyngier, 2015/07/17
- Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx->dispatching is dead, all hail ctx->notify_me, Richard W.M. Jones, 2015/07/17
- Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx->dispatching is dead, all hail ctx->notify_me, Marc Zyngier, 2015/07/17
- Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx->dispatching is dead, all hail ctx->notify_me,
Laszlo Ersek <=
- Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx->dispatching is dead, all hail ctx->notify_me, Paolo Bonzini, 2015/07/17
- Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx->dispatching is dead, all hail ctx->notify_me, Marc Zyngier, 2015/07/17
[Qemu-devel] [PATCH v2 0/3] AioContext: ctx->dispatching is dead, all hail ctx->notify_me, Paolo Bonzini, 2015/07/18
- [Qemu-devel] [PATCH 1/3] tests: remove irrelevant assertions from test-aio, Paolo Bonzini, 2015/07/18
- [Qemu-devel] [PATCH 2/3] aio-win32: reorganize polling loop, Paolo Bonzini, 2015/07/18
- [Qemu-devel] [PATCH 3/3] AioContext: fix broken ctx->dispatching optimization, Paolo Bonzini, 2015/07/18
- Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx->dispatching is dead, all hail ctx->notify_me, Richard W.M. Jones, 2015/07/19
- Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx->dispatching is dead, all hail ctx->notify_me, Stefan Hajnoczi, 2015/07/20