qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] RFC: KVM _CREATE_DEVICE considered harmful?


From: Gleb Natapov
Subject: Re: [Qemu-devel] RFC: KVM _CREATE_DEVICE considered harmful?
Date: Wed, 16 Oct 2013 18:44:58 +0300

On Wed, Oct 16, 2013 at 02:59:47PM +0200, Christian Borntraeger wrote:
> Folks,
> 
> from time to time I update valgrind or qemu to work reasonably well
> with KVM.
> 
> Now, newer KVMs have the ability to create subdevices of a KVM guest (e.g. an 
> in kernel
> kvm interrupt controller) with the following ioctl:
> 
> #define KVM_CREATE_DEVICE         _IOWR(KVMIO,  0xe0, struct 
> kvm_create_device)
> 
> qemu can then work on these devices with the ioctls 
> 
> /* ioctls for fds returned by KVM_CREATE_DEVICE */
> #define KVM_SET_DEVICE_ATTR       _IOW(KVMIO,  0xe1, struct kvm_device_attr)
> #define KVM_GET_DEVICE_ATTR       _IOW(KVMIO,  0xe2, struct kvm_device_attr)
> #define KVM_HAS_DEVICE_ATTR       _IOW(KVMIO,  0xe3, struct kvm_device_attr)
> 
> struct kvm_device_attr {
>         __u32   flags;          /* no flags currently defined */
>         __u32   group;          /* device-defined */
>         __u64   attr;           /* group-defined */
>         __u64   addr;           /* userspace address of attr data */
> };
> 
> to communicate with the in-kvm devices to exchange data. There is an 
> interesting
> problem here: Properly defined ioctls allow to deduct the written or read area
> of a ioctl. This is useful, e.g. for valgrind. For unknown ioctls, valgrind 
> will
> decode the ioctl number according to the _IOxx prefixes and deducts the memory
> area that the kernels reads/writes. This is very helpful for definedness 
> checkins.
> 
And is not very helpful if you need to change structure size or it
changes by itself because of 32bit->64bit move.

> For specific or more complex ioctls valgrind has special handling. The 
> problem is
> now, that looking at the current implementations of kvm device attr, all use 
> 0,1,2,
> etc for group/attr. So by inspecting the ioctl, valgrind cannot find out what 
> device
> it is talking to without tracking the original create ioctl.
> 
> So it seems that by using a multiplexing ioctls we actually make the interface
> less well defined and more error prone than individual ioctls.
I do not see why it is more error prone. Yes, it is harder for valgrind
to track it, but this by itself does not make it error prone.

> 
> Another hint to look at are system calls. Older archs (like i386) have an IPC 
> system
> call that multiplexes several things. But time has shown that having a system 
> for each
> and every subfunctions is better that a multiplexer. (so newer archs like 
> amd64 dont
> have an ipc system call they have semop, semget etc. This also makes tools 
> like strace
> easier to implement.
> 
The difference is in amount of those subfunctions. We can define
separate ioctl for each cpu/device register of each architecture and it
will be nicely straceable, but how much new ioctls this will create? It
will easily reach 1000s quickly. So at some point you start multiplex.
ONE_REG interface is multiplexer, DEVICE_ATTR is same.

> Whats even more puzzling to me: The main complaint about ioctls is the 
> possibility to
> create arbitrary interfaces into the kernel. Now with the IORW family, we 
> actually
> had some limited form of control. With an additional uncoordinated group/attr 
> parameter we dropped all of that.
How this control is actually used in the kernel?

> 
> 
> So how to process from here? 
> a) leave it as is and accept false positives from valgrind and undecoded 
> straces
strace still cannot decode kvm ioctls after all this years. I am not
concerned about straces at all, it is less then useless to debug KVM.

> b) We could enhance the device_addr group/attr values with size macros, e.g
> the same as _IOWR (do we need direction?) for future users
I am fine with that. But even in KVM this is not the first interface
that has this drawback. See KVM_GET_DIRTY_LOG for instance.

> c) Avoid the device api for new code and go back to individual ioctls in KVM
Individual ioctls were a mess. People are adding devices without even
knowing beforehand how final interface will look like. To accommodate
such "design" strategy devices need to have as fine grained interfaces as
possible otherwise ioctls will be deprecated faster than added (that was
justification for ONE_REG too BTW). If you want fine grained interface
you need to multiplex at some point or add hundreds of ioctls, if you
need to multiplex anyway multiplexing at device level is logical thing
to do. I understand valgrind sentiment, but it should not be a central
point of interface design.

> 
> The reason for KVM_CREATE_DEVICE seems to be that in qemu we want to model 
> everything 
> as a device. Having an in-kernel KVM device seemed logical. The problem is, 
> that we
> should really have a 2nd look at the Linux system call ABI. Does it have to 
> follow the
> device model? especially if the interface has obvious draw backs.
> 
Try to add new Linux system call. It's very hard and for a good reason,
you do not want to have 1000s of them where most of then is one-trick
pony and mostly deprecated (but you cannot remove them or reuse them
anyway). If we apply the same constrains to kvm ioctl additions as Linux
has for system calls, KVM development will stall and people will be less
then happy.

--
                        Gleb.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]