qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with


From: Blue Swirl
Subject: Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
Date: Fri, 28 May 2010 20:06:45 +0000

2010/5/28 Gleb Natapov <address@hidden>:
> On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
>> 2010/5/27 Gleb Natapov <address@hidden>:
>> > On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
>> >> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <address@hidden> wrote:
>> >> > Blue Swirl wrote:
>> >> >> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <address@hidden> wrote:
>> >> >>> Anthony Liguori wrote:
>> >> >>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>> >> >>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<address@hidden>  wrote:
>> >> >>>>>
>> >> >>>>>> From: Jan Kiszka<address@hidden>
>> >> >>>>>>
>> >> >>>>>> This allows to communicate potential IRQ coalescing during 
>> >> >>>>>> delivery from
>> >> >>>>>> the sink back to the source. Targets that support IRQ coalescing
>> >> >>>>>> workarounds need to register handlers that return the appropriate
>> >> >>>>>> QEMU_IRQ_* code, and they have to propergate the code across all 
>> >> >>>>>> IRQ
>> >> >>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it 
>> >> >>>>>> can
>> >> >>>>>> apply its workaround. If multiple sinks exist, the source may only
>> >> >>>>>> consider an IRQ coalesced if all other sinks either report
>> >> >>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>> >> >>>>>>
>> >> >>>>> No real devices are interested whether any of their output lines are
>> >> >>>>> even connected. This would introduce a new signal type, 
>> >> >>>>> bidirectional
>> >> >>>>> multi-level, which is not correct.
>> >> >>>>>
>> >> >>>> I don't think it's really an issue of correct, but I wouldn't 
>> >> >>>> disagree
>> >> >>>> to a suggestion that we ought to introduce a new signal type for this
>> >> >>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and 
>> >> >>>> has a
>> >> >>>> similar interface as qemu_irq.
>> >> >>> A separate type would complicate the delivery of the feedback value
>> >> >>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>> >> >>>
>> >> >>>>> I think the real solution to coalescing is put the logic inside one
>> >> >>>>> device, in this case APIC because it has the information about irq
>> >> >>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>> >> >>>>> information and whether they get delivered or not. If not, an 
>> >> >>>>> internal
>> >> >>>>> timer is installed which injects the lost irqs.
>> >> >>> That won't fly as the IRQs will already arrive at the APIC with a
>> >> >>> sufficiently high jitter. At the bare minimum, you need to tell the
>> >> >>> interrupt controller about the fact that a particular IRQ should be
>> >> >>> delivered at a specific regular rate. For this, you also need a 
>> >> >>> generic
>> >> >>> interface - nothing really "won".
>> >> >>
>> >> >> OK, let's simplify: just reinject at next possible chance. No need to
>> >> >> monitor or tell anything.
>> >> >
>> >> > There are guests that won't like this (I know of one in-house, but
>> >> > others may even have more examples), specifically if you end up firing
>> >> > multiple IRQs in a row due to a longer backlog. For that reason, the RTC
>> >> > spreads the reinjection according to the current rate.
>> >>
>> >> Then reinject with a constant delay, or next CPU exit. Such buggy
>> > If guest's time frequency is the same as host time frequency you can't
>> > reinject with constant delay. That is why current code mixes two
>> > approaches: reinject M interrupts in a raw then delay.
>>
>> This approach can be also used by APIC-only version.
>>
> I don't know what APIC-only version you are talking about. I haven't
> seen the code and I don't understand hand waving, sorry.

There is no code, because we're still at architecture design stage.

>> >> guests could also be assisted with special handling (like win2k
>> >> install hack), for example guest instructions could be counted
>> >> (approximately, for example using TB size or TSC) and only inject
>> >> after at least N instructions have passed.
>> > Guest instructions cannot be easily counted in KVM (it can be done more
>> > or less reliably using perf counters, may be).
>>
>> Aren't there any debug registers or perf counters, which can generate
>> an interrupt after some number of instructions have been executed?
> Don't think debug registers have something like that and they are
> available for guest use anyway. Perf counters differs greatly from CPU
> to CPU (even between two CPUs of the same manufacturer), and we want to
> keep using them for profiling guests. And I don't see what problem it
> will solve anyway that can be solved by simple delay between irq
> reinjection.

This would allow counting the executed instructions and limit it. Thus
we could emulate a 500MHz CPU on a 2GHz CPU more accurately.

>>
>> >>
>> >> > And even if the rate did not matter, the APIC woult still have to now
>> >> > about the fact that an IRQ is really periodic and does not only appear
>> >> > as such for a certain interval. This really does not sound like
>> >> > simplifying things or even make them cleaner.
>> >>
>> >> It would, the voodoo would be contained only in APIC, RTC would be
>> >> just like any other device. With the bidirectional irqs, this voodoo
>> >> would probably eventually spread to many other devices. The logical
>> >> conclusion of that would be a system where all devices would be
>> >> careful not to disturb the guest at wrong moment because that would
>> >> trigger a bug.
>> >>
>> > This voodoo will be so complex and unreliable that it will make RTC hack
>> > pale in comparison (and I still don't see how you are going to make it
>> > actually work).
>>
>> Implement everything inside APIC: only coalescing and reinjection.
> APIC has zero info needed to implement reinjection correctly as was
> shown to you several time in this thread and you simply keep ignoring
> it.

On the contrary, APIC is actually the only source of the IRQ ack
information. RTC hack would not work without APIC (or the
bidirectional IRQ) passing this info to RTC.

What APIC doesn't have now is the timer frequency or period info. This
is known by RTC and also higher levels managing the clocks.

I keep ignoring the idea that the current model, where both RTC and
APIC must somehow work together to make coalescing work, is the only
possible just because it is committed and it happens to work in some
cases. It would be much better to concentrate this to one place, APIC
or preferably higher level where it may benefit other timers too.
Provided of course that the other models can be made to work.

>> Maybe that version would not bend backwards as much as the current to
>> cater for buggy hosts.
>>
> You mean "buggy guests"?

Yes, sorry.

> What guests are not buggy in your opinion?
> Linux tries hard to be smart and as a result the only way to have stable
> clock with it is to go paravirt.

I'm not an OS designer, but I think an OS should never crash, even if
a burst of IRQs is received. Reprogramming the timer should consider
the pending IRQ situation (0 or 1 with real HW). Those bugs are one
cause of the problem.

>> > The fact is that timer device is not "just like any
>> > other device" in virtual world. Any other device is easy: you just
>> > implement spec as close as possible and everything works. For time
>> > source device this is not enough. You can implement RTC+HPET to the
>> > letter and your guest will drift like crazy.
>>
>> It's doable: a cycle accurate emulator will not cause any drift,
>> without any voodoo. The interrupts would come after executing the same
>> instruction as the real HW. For emulating any sufficiently buggy
>> guests in any sufficiently desperate low resource conditions, this may
>> be the only option that will always work.
>>
> Yes, but qemu and kvm are not cycle accurate emulators and don't strive
> to be one. On the contrary KVM runs at native host CPU speed most of the
> time, so any emulation done between two instruction is theoretically
> noticeable for a guest. TSC is bypassed directly to a guest too, so
> keeping all time source in perfect sync is also impossible.

That is actually another cause of the problem. KVM gives the guest an
illusion that the VCPU speed is equal to host speed. When they don't
match, especially in critical code, there can be problems. It would be
better to tell the guest a lower speed, which also can be guaranteed.

Maybe we should also offline the device emulation to another host CPU
with threading. A load from a device will always be much slower than
on real HW though.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]