qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access fun


From: Michael S. Tsirkin
Subject: Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
Date: Mon, 21 May 2012 15:18:24 +0300

On Mon, May 21, 2012 at 09:45:58PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-05-21 at 13:31 +0300, Michael S. Tsirkin wrote:
> 
> > > IE. Just making the default accessors ordered means that all devices
> > > written with the assumption that the guest will see accesses in the
> > > order they are written in the emulated device will be correct, which
> > > means pretty much all of them (well, almost).
> > > 
> > >  --> It actually fixes a real bug that affects almost all devices
> > >      that do DMA today in qemu
> > 
> > In theory fine but practical examples that affect x86?
> > We might want to at least document some of them.
> 
> x86 I don't know, I suspect mostly none that have actually been hit but
> I could be wrong, I'm not familiar enough with it.
> 
> I have practical examples that affect power though :-) And I'm
> reasonably confident they'll affect ARM as soon as people start doing
> serious SMP on it etc...
> 
> IE. The code is provably wrong without barriers.
> 
> > wmb and rmb are nops so there's no bug in practice.
> > So the only actual rule which might be violated by qemu is that
> > read flushes out writes.
> > It's unlikely you will find real examples where this matters
> > but I'm interested to hear otherwise.
> 
> wmb and rmb are not nops on powerpc and arm to name a couple.
> 
> mb is more than just "read flush writes" (besides it's not a statement
> about flushing, it's a statement about ordering. whether it has a
> flushing side effect on x86 is a separate issue, it doesn't on power for
> example).

I referred to reads not bypassing writes on PCI.
This is the real argument in my eyes: that we
should behave the way real hardware does.

> Real flushing out writes matters very much in real life in two very
> different contexts that tend to not affect emulation in qemu as much.
> 
> One is flushing write in the opposite direction (or rather, having the
> read response queued up behind those writes) which is critical to
> ensuring proper completion of DMAs after an LSI from a guest driver
> perspective on real HW typically.
> 
> The other classic case is to flush posted MMIO writes in order to ensure
> that a subsequent delay is respected.
> 
> Most of those don't actually matter when doing emulation. Besides a
> barrier won't provide you the second guarantee, you need a nastier
> construct at least on some architectures like power.

Exactly. This is what I was saying too.

> However, we do need to ensure that read and writes are properly ordered
> vs. each other (regardless of any "flush" semantic) or things could go
> very wrong on OO architectures (here too, my understanding on x86 is
> limited).

Right. Here's a compromize:
- add smp_rmb() on any DMA read
- add smp_wmb( on any DMA write
This is almost zero cost on x86 at least.
So we are not regressing existing setups.

Are there any places where devices do read after write?
My preferred way is to find them and do pci_dma_flush() invoking
smp_mb(). If there is such a case it's likely on datapath anyway
so we do care.

But I can also live with a global flag "latest_dma_read"
and on read we could do
        if (unlikely(latest_dma_read))
                smp_mb();

if you really insist on it
though I do think it's inelegant.

> > I also note that guests do use write-combining e.g. for vga.
> > One wonders whether stronger barrriers are needed
> > because of that?
> 
> 
> What I'm trying to address here is really to ensure that load and stores
> issued by qemu emulated devices "appear" in the right order in respect
> to guest driver code running simultaneously on different CPUs (both P
> and V in this context).
> 
> I've somewhat purposefully for now left alone the problem ordering
> "accross" transfer directions which in the case of PCI (and most "sane"
> busses) means read flushing writes in the other direction. (Here too
> it's not really a flush, it's just an ordering statement between the
> write and the read response).
> 
> If you want to look at that other problem more closely, it breaks down
> into two parts:
> 
>  - Guest writes vs. qemu reads. My gut feeling is that this should take
> care of itself for the most part, tho you might want to bracket PIO
> operations with barriers for extra safety. But we might want to dig
> more.
> 
>  - qemu writes vs. guest reads. Here too, this only matters as long as
> the guest can observe the cat inside the box. This means the guest gets
> to "observe" the writes vs. some other event that might need to be
> synchronized with the read, such as an interrupt. Here my gut feeling is
> that we might be safer by having a barrier before any interrupt get shot
> to the guest (and a syscall/ioctl is not a memory barrier, though at
> least with kvm on power, a guest entry is so we should be fine) but here
> too, it might be a good idea to dig more.
> 
> However, both of these can (and probably should) be treated separately
> and are rather unlikely to cause problems in practice. In most case
> these have to do with flushing non coherent DMA buffers in intermediary
> bridges, and while those could be considered as akin to a store queue
> that hasn't yet reached coherency on a core, in practice I think we are
> fine.
> 
> So let's go back to the problem at hand, which is to make sure that load
> and stores done by emulated devices get observed by the guest code in
> the right order.
> 
> My preference is to do it in the dma_* accessors (with possibly
> providing __relaxed versions), which means "low level" stuff using cpu_*
> directly such as virtio doesn't pay the price (or rather is responsible
> for doing its own barriers, which is good since typically that's our
> real perf sensitive stuff).
> 
> Anthony earlier preferred all the way down into cpu_physical_* which is
> what my latest round of patches did. But I can understand that this
> makes people uncomfortable.
> 
> In any case, we yet have to get some measurements of the actual cost of
> those barriers on x86. I can try to give it a go on my laptop next week,
> but I'm not an x86 expert and don't have that much different x86 HW
> around (such as large SMP machines where the cost might differ or
> different generations of x86 processors etc...).
> 
> > > Then, fine-tuning performance critical one by selectively removing
> > > barriers allows to improve performance where it would be othewise
> > > harmed.
> > 
> > So that breaks attempts to bisect performance regressions.
> > Not good.
> 
> Disagreed. In all cases safety trumps performances.

You said above x86 is unaffected. This is portability, not safety.

> Almost all our
> devices were written without any thought given to ordering, so they
> basically can and should be considered as all broken.

Problem is, a lot of code is likely broken even after you sprinkle
barriers around. For example qemu might write A then B where guest driver
expects to see B written before A.

> Since thinking
> about ordering is something that, by experience, very few programmer can
> do and get right, the default should absolutely be fully ordered.

Give it bus ordering. That is not fully ordered.

> Performance regressions aren't a big deal to bisect in that case: If
> there's a regression for a given driver and it points to this specific
> patch adding the barriers then we know precisely where the regression
> come from, and we can get some insight about how this specific driver
> can be improved to use more relaxed accessors.
> 
> I don't see the problem.
> 
> One thing that might be worth looking at is if indeed mb() is so much
> more costly than just wmb/rmb, in which circumstances we could have some
> smarts in the accessors to make them skip the full mb based on knowledge
> of previous access direction, though here too I would be tempted to only
> do that if absolutely necessary (ie if we can't instead just fix the
> sensitive driver to use explicitly relaxed accessors).

We did this in virtio and yes it is measureable.
branches are pretty cheap though.

> > > So on that I will not compromise.
> > > 
> > > However, I think it might be better to leave the barrier in the dma
> > > accessor since that's how you also get iommu transparency etc... so it's
> > > not a bad place to put them, and leave the cpu_physical_* for use by
> > > lower level device drivers which are thus responsible also for dealing
> > > with ordering if they have to.
> > > 
> > > Cheers,
> > > Ben.
> > 
> > You claim to understand what matters for all devices I doubt that.
> 
> It's pretty obvious that anything that does DMA using a classic
> descriptor + buffers structure is broken without appropriate ordering.
> 
> And yes, I claim to have a fairly good idea of the problem, but I don't
> think throwing credentials around is going to be helpful.
>  
> > Why don't we add safe APIs, then go over devices and switch over?
> > I counted 97 pci_dma_ accesses.
> > 33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.
> > 
> > Let maintainers make a decision where does speed matter.
> 
> No. Let's fix the general bug first. Then let's people who know the
> individual drivers intimately and understand their access patterns make
> the call as to when things can/should be relaxed.
> 
> Ben.

As a maintainer of a device, if you send me a patch I can review.
If you change core APIs creating performance regressions
I don't even know what to review without wasting time debugging
and bisecting.

According to what you said you want to fix kvm on powerpc.
Good. Find a way that looks non intrusive on x86 please.

-- 
MST



reply via email to

[Prev in Thread] Current Thread [Next in Thread]