[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH -V3 09/32] virtio-9p: Implement P9_TWRITE/ Threa

From: Avi Kivity
Subject: Re: [Qemu-devel] [PATCH -V3 09/32] virtio-9p: Implement P9_TWRITE/ Thread model in QEMU
Date: Tue, 30 Mar 2010 16:28:35 +0300
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv: Gecko/20100301 Fedora/3.0.3-1.fc12 Thunderbird/3.0.3

On 03/30/2010 04:13 PM, Anthony Liguori wrote:
On 03/30/2010 05:24 AM, Avi Kivity wrote:
On 03/30/2010 12:23 AM, Anthony Liguori wrote:
It's not sufficient. If you have a single thread that runs both live migrations and timers, then timers will be backlogged behind live migration, or you'll have to yield often. This is regardless of the locking model (and of course having threads without fixing the locking is insufficient as well, live migration accesses guest memory so it needs the big qemu lock).

But what's the solution? Sending every timer in a separate thread? We'll hit the same problem if we implement an arbitrary limit to number of threads.

A completion that's expected to take a couple of microseconds at most can live in the iothread. A completion that's expected to take a couple of milliseconds wants its own thread. We'll have to think about anything in between.

vnc and migration can perform large amounts of work in a single completion; they're limited only by the socket send rate and our internal rate-limiting which are both outside our control. Most device timers are O(1). virtio completions probably fall into the annoying "have to think about it" department.

I think it may make more sense to have vcpu completions vs. io thread completions and make vcpu completions target short lived operations.

vcpu completions make sense when you can tell that a completion will cause an interrupt injection and you have a good idea which cpu will be interrupted.

What I'm skeptical of, is whether converting virtio-9p or qcow2 to handle each request in a separate thread is really going to improve things.

Currently qcow2 isn't even fullly asynchronous, so it can't fail to improve things.

Unless it introduces more data corruptions which is my concern with any significant change to qcow2.

It's possible to move qcow2 to a thread without any significant change to it (simply run the current code in its own thread, protected by a mutex). Further changes would be very incremental.

But that offers no advantage to what we have which fails the proof-by-example that threading makes the situation better.

It has an advantage, qcow2 is currently synchronous in parts:

block/qcow2-cluster.c: ret = bdrv_write(s->hd, (cluster_offset >> 9) + n_start, block/qcow2.c: bdrv_write(s->hd, (meta.cluster_offset >> 9) + num - 1, buf, 1);
block/qcow2.c:        bdrv_write(bs, sector_num, buf, s->cluster_sectors);
block/qcow2-cluster.c: ret = bdrv_read(bs->backing_hd, sector_num, buf, n1); block/qcow2-cluster.c: ret = bdrv_read(s->hd, coffset >> 9, s->cluster_data, nb_csectors);

To convert qcow2 to be threaded, I think you would have to wrap the whole thing in a lock, then convert the current asynchronous functions to synchronous (that's the whole point, right). At this point, you've regressed performance because you can only handle one read/write outstanding at a given time. So now you have to make the locking more granular but because we do layered block devices, you've got to make most of the core block driver functions thread safe.

Not at all. The first conversion will be to keep the current code as is, operating asynchronously, but running in its own thread. It will still support multiple outstanding requests using the current state machine code; the synchronous parts will be remain synchronous relative to the block device, but async relative to everything else. The second stage will convert the state machine code to threaded code. This is more difficult but not overly so - turn every dependency list into a mutex.

Once you get basic data operations concurrent, which I expect won't be so bad, to get an improvement over the current code, you have to allow simultaneous access to metadata which is where I think the vast majority of the complexity will come from.

I have no plans to do that, all I want is qcow2 not to block vcpus. btw, I don't think it's all that complicated, it's simple to lock individual L2 blocks and the L1 block.

You could argue that we stick qcow2 into a thread and stop there and that fixes the problems with synchronous data access. If that's the argument, then let's not even bother doing at the qcow layer, let's just switch the block aio emulation to use a dedicated thread.

That's certainly the plan for vmdk and friends which are today useless. qcow2 deserves better treatment.

Sticking the VNC server in it's own thread would be fine. Trying to make the VNC server multithreaded though would be problematic.

Why would it be problematic? Each client gets its own threads, they don't interact at all do they?

Dealing with locking of the core display which each client uses for rendering. Things like CopyRect will get ugly quickly.Ultimately, this comes down to a question of lock granularity and thread granularity. I don't think it's a good idea to start with the assumption that we want extremely fine granularity. There's certainly very low hanging fruit with respect to threading.

Not familiar with the code, but doesn't vnc access the display core through an API? Slap a lot onto that.

I meant, exposing qemu core to the threads instead of pretending they aren't there. I'm not familiar with 9p so don't hold much of an opinion, but didn't you say you need threads in order to handle async syscalls? That may not be the deep threading we're discussing here.

btw, IIUC currently disk hotunplug will stall a guest, no? We need async aio_flush().

But aio_flush() never takes a very long time, right :-)

We had this discussion in the past re: live migration because we do an aio_flush() in the critical stage.

Live migration will stall a guest anyway. It doesn't matter if aio_flush blocks for a few ms, since the final stage will dominate it.

Do not meddle in the internals of kernels, for they are subtle and quick to 

reply via email to

[Prev in Thread] Current Thread [Next in Thread]