[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [libvirt] [RFC 0/5] block: File descriptor passing usin
Re: [Qemu-devel] [libvirt] [RFC 0/5] block: File descriptor passing using -open-hook-fd
Mon, 09 Jul 2012 15:46:55 -0500
Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20120329 Thunderbird/11.0.1
On 07/09/2012 03:29 PM, Eric Blake wrote:
On 07/09/2012 02:00 PM, Anthony Liguori wrote:
with the fd:name approach, the sequence is:
libvirt calls getfd:name1 over normal monitor
libvirt calls getfd:name2 over normal monitor
libvirt calls transaction around blockdev-snapshot-sync over normal
monitor, using fd:name1 and fd:name2
This general layout is true whether we rewrite all commands to
understand fd:nnn (proposal 1) or whether we add new magic parsing
(/dev/fd/nnn of proposal 3, or even /dev/fdset/nnn of proposal 5), all
as called out in these messages:
but with -open-hook-fd, the approach would be:
libvirt calls transaction
qemu calls open(file1) over hook
qemu calls open(file2) over hook
qemu responds to the original transaction
whereas this approach is quite different in semantics, but may indeed be
easier for qemu to implement, at the expense of some more complexity on
the part of libvirt.
At the high level, I think both approaches have one thing in common: by
refactoring all qemu code to go through qemu_open(), we can then
implement our desired complexity (whether fd:nn, /dev/fd/nnn,
/dev/fdset/nnn, or some other magic name parsing; or whether it is an
rpc call over a second socket in parallel to the monitor socket) in just
one location. Likewise, both approaches have to deal with libvirtd
restarts (magic name parsing by changing an 'inuse' flag when the
monitor detects EOF; rpc passing by failing a qemu_open() when the rpc
socket detects EOF).
The 'transaction' operation is thus blocked by the time it takes to do
two intermediate opens over a second channel, which kind of defeats the
purpose of making the transaction take effect with minimal guest
How are you defining "guest down time"?
It's important to note that code running in QEMU does not equate to
guest visible down time unless QEMU does an explicit vm_stop() which is
not happening here.
Instead, a VCPU may become blocked *if* it attempts to acquire qemu_mute
while QEMU is holding it.
If your concern is qemu_mutex being held while waiting for libvirt, it
would be fairly easy to implement a qemu_open_async() that dropped
allowed dropping back to the main loop and then calling a callback when
the open completes.
It would be pretty trivial to convert qmp_transaction to use such a
In other words, remembering that transactions are divided into phases:
phase 1 - prepare: obtain all needed fds (whether by pre-opening them
via 'pass-fd' or other new 'getfd' relative, or whether by RPC calls);
no guest downtime, and with cleanup that avoids any leaks on any failures
phase 2 - commit: flush all devices and actually make the changes in
qemu state to use the fds obtained in phase 1
and where the guest downtime (if any) is more likely due to flushing
changes in phase 2
Not quite. A synchronous flush can cause lock contention. We need to separate
out the problem of lock contention from guest down time.
Also, there's no obvious need to move the flushes before opens. The main issue
is that we use qemu_mutex to effectively create a write queue.
You can imagine a simple write queueing mechanism that would obviate the need
need for this such that we could flush, queue upcoming writes, and drop
qemu_mutex to sleep waiting for libvirt to send us our fds.
But this is all speculative. There's no reason to believe that an RPC
would have a noticable guest visible latency unless you assume there's
lot contention. I would strongly suspect that the bdrv_flush() is going
to be a much greater source of lock contention than the RPC would be.
An RPC is only bound by scheduler latency whereas synchronous disk I/O
is bound spinning a platter.
And libvirt code becomes a lot trickier to deal with the fact
that two channels are in use, and that the channel that issued the
'transaction' command must block while the other channel for handling
hooks must be responsive.
All libvirt needs to do is listen on a socket and delegate access
according to a white list. Whatever is providing fd's needs to have no
knowledge of anythign other than what the guest is allowed to access
which shouldn't depend on an executing command.
That's not quite accurate. What the guest is allowed to access should
indeed change depending on the executing command. That is, if I start a
I should have spoke more clearly. libvirt may change the white list for various
reasons dynamically. But there shouldn't be a direct dependency on whatever is
serving up fd's and whatever is changing the white list.
Basically, you just need a shared hash table for each guest. It should be quite
Maybe the only reason that I'm still leaning towards a 'pass-fd'
solution instead of a hook fd solution is that libvirt would have less
code to write to get it working. But it was originally Dan's complaint
that an rpc solution has too much risk for deadlock or leaks;
The reason I came back to this is that after reading through the threads, I
started thinking about how to solve the leak problem.
You need clear ownership. Having QEMU "own" a file descriptor because it "asks"
for an fd allows QEMU to be the clear owner. OTOH, having libvirt give QEMU an
fd with a floating reference that QEMU may or not may not pin ends up being
extremely complex in practice. I'm not sure you can really solve the reliable
closing problem either. If you did have a "kill all floating references"
command, that could introduce other problems (what about multiple clients?).
if we are
confident that we can avoid deadlock,
I don't think deadlocks are possible FWIW.
and that the idea of passing in
fds in response to an rpc involves less bookkeeping and speculation on
libvirt's part about which monitor commands will require opening an fd,
then maybe it really is a better technical solution, even if it costs
more libvirt code to implement.
I think the important part is that it allows libvirt to not have to have
intimate knowledge of how QEMU commands work. If we decide we need to change
the flags/perms on a file descriptor down the road, it's a lot easier to cope
with that as it is to cope with changing the order in which we open files.
Plus, once you implement this in libvirt, you don't have to worry about it for
future block commands. With fdsets, would need to deal with figuring out the
magic incantation of setfd commands for all future block commands.
Plus, making /dev/fdset be treated as not a valid file path is just asking for
trouble down the road...