[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows gu

From: Avi Kivity
Subject: Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
Date: Fri, 26 Feb 2010 10:47:19 +0200
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv: Gecko/20100120 Fedora/3.0.1-1.fc12 Thunderbird/3.0.1

On 02/25/2010 09:55 PM, Anthony Liguori wrote:
On 02/25/2010 11:33 AM, Avi Kivity wrote:
On 02/25/2010 07:15 PM, Anthony Liguori wrote:
I agree. Further, once we fine-grain device threading, the iothread essentially disappears and is replaced by device-specific threads. There's no "idle" anymore.

That's a nice idea, but how is io dispatch handled? Is everything synchronous or do we continue to program asynchronously?

Simple stuff can be kept asynchronous, complex stuff (like qcow2) ought to be made synchronous (it uses threads anyway, so we don't lose anything). Stuff like vnc can go either way.

We've discussed this before and I still contend that threads do not make qcow2 any simpler.

qcow2 is still not fully asynchronous. All the other format drivers (except raw) are fully synchronous. If we had a threaded infrastructure, we could convert them all in a day. As it is, you can only use the other block format drivers in 'qemu-img convert'.

Each such thread could run the same loop as the iothread. Any pollable fd or timer would be associated with a thread, so things continue as normal more or less. Unassociated objects continue with the main iothread.

Is the point latency or increasing available CPU resources?


If the device models are re-entrant, that reduces a ton of the demand on the qemu_mutex which means that IO thread can run uncontended. While we have evidence that the VCPU threads and IO threads are competing with each other today, I don't think we have any evidence to suggest that the IO thread is self-starving itself with long running events.

I agree we have no evidence and that this is all speculation. But consider a 64-vcpu guest, it has a 1:64 ratio of vcpu time (initiations) to iothread time (completions). If each vcpu generates 5000 initiations per second, the iothread needs to handle 320,000 completions per second. At that rate you will see some internal competition. That thread will also have a hard time shuffling data since every completion's data will reside in the wrong cpu cache.

Note, an alternative to multiple iothreads is to move completion handling back to vcpus, provided we can steer the handler close to the guest completion handler.

With the device model, I'd like to see us move toward a very well defined API for each device to use. Part of the reason for this is to limit the scope of the devices in such a way that we can enforce this at compile time. Then we can introduce locking within devices with some level of guarantee that we've covered the API devices are actually consuming.

Yes. On the other hand, the shape of the API will be influenced by the locking model, so we'll have to take iterative steps, unless someone comes out with a brilliant design.

For host services though, it's much more difficult to isolate them like this.

What do you mean by host services?

I'm not necessarily claiming that this will never be the right thing to do, but I don't think we really have the evidence today to suggest that we should focus on this in the short term.

Agreed. We will start to see evidence (one way or the other) as fully loaded 64-vcpu guests are benchmarked. Another driver may be real-time guests; if a timer can be deferred by some block device initiation or completion, then we can say goodbye to any realtime guarantees we want to make.

Do not meddle in the internals of kernels, for they are subtle and quick to 

reply via email to

[Prev in Thread] Current Thread [Next in Thread]