qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated dev


From: Alex Williamson
Subject: Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
Date: Thu, 13 Oct 2016 08:36:07 -0600

On Thu, 13 Oct 2016 14:52:09 +0530
Kirti Wankhede <address@hidden> wrote:

> On 10/13/2016 3:14 AM, Alex Williamson wrote:
> > On Thu, 13 Oct 2016 00:32:48 +0530
> > Kirti Wankhede <address@hidden> wrote:
> >   
> >> On 10/12/2016 9:29 PM, Alex Williamson wrote:  
> >>> On Wed, 12 Oct 2016 20:43:48 +0530
> >>> Kirti Wankhede <address@hidden> wrote:
> >>>     
> >>>> On 10/12/2016 7:22 AM, Tian, Kevin wrote:    
> >>>>>> From: Kirti Wankhede [mailto:address@hidden
> >>>>>> Sent: Wednesday, October 12, 2016 4:45 AM      
> >>>>>>>> +* mdev_supported_types:
> >>>>>>>> +    List of current supported mediated device types and its details 
> >>>>>>>> are added
> >>>>>>>> +in this directory in following format:
> >>>>>>>> +
> >>>>>>>> +|- <parent phy device>
> >>>>>>>> +|--- Vendor-specific-attributes [optional]
> >>>>>>>> +|--- mdev_supported_types
> >>>>>>>> +|     |--- <type id>
> >>>>>>>> +|     |   |--- create
> >>>>>>>> +|     |   |--- name
> >>>>>>>> +|     |   |--- available_instances
> >>>>>>>> +|     |   |--- description /class
> >>>>>>>> +|     |   |--- [devices]
> >>>>>>>> +|     |--- <type id>
> >>>>>>>> +|     |   |--- create
> >>>>>>>> +|     |   |--- name
> >>>>>>>> +|     |   |--- available_instances
> >>>>>>>> +|     |   |--- description /class
> >>>>>>>> +|     |   |--- [devices]
> >>>>>>>> +|     |--- <type id>
> >>>>>>>> +|          |--- create
> >>>>>>>> +|          |--- name
> >>>>>>>> +|          |--- available_instances
> >>>>>>>> +|          |--- description /class
> >>>>>>>> +|          |--- [devices]
> >>>>>>>> +
> >>>>>>>> +[TBD : description or class is yet to be decided. This will 
> >>>>>>>> change.]      
> >>>>>>>
> >>>>>>> I thought that in previous discussions we had agreed to drop
> >>>>>>> the <type id> concept and use the name as the unique identifier.
> >>>>>>> When reporting these types in libvirt we won't want to report
> >>>>>>> the type id values - we'll want the name strings to be unique.
> >>>>>>>      
> >>>>>>
> >>>>>> The 'name' might not be unique but type_id will be. For example that 
> >>>>>> Neo
> >>>>>> pointed out in earlier discussion, virtual devices can come from two
> >>>>>> different physical devices, end user would be presented with what they
> >>>>>> had selected but there will be internal implementation differences. In
> >>>>>> that case 'type_id' will be unique.
> >>>>>>      
> >>>>>
> >>>>> Hi, Kirti, my understanding is that Neo agreed to use an unique type
> >>>>> string (if you still called it <type id>), and then no need of 
> >>>>> additional
> >>>>> 'name' field which can be put inside 'description' field. See below 
> >>>>> quote:
> >>>>>       
> >>>>
> >>>> We had internal discussions about this within NVIDIA and found that
> >>>> 'name' might not be unique where as 'type_id' would be unique. I'm
> >>>> refering to Neo's mail after that, where Neo do pointed that out.
> >>>>
> >>>> https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html    
> >>>
> >>> Everyone not privy to those internal discussions, including me, seems to
> >>> think we dropped type_id and that if a vendor does not have a stable
> >>> name, they can compose some sort of stable type description based on the
> >>> name+id, or even vendor+id, ex. NVIDIA-11.  So please share why we
> >>> haven't managed to kill off type_id yet.  No matter what internal
> >>> representation each vendor driver has of "type_id" it seems possible
> >>> for it to come up with stable string to define a given configuration.    
> >>
> >>
> >> The 'type_id' is unique and the 'name' are not, the name is just a
> >> virtual device name/ human readable name. Because at this moment Intel
> >> can't define a proper GPU class, we have to add a 'description' field
> >> there as well to represent the features of this virtual device, once we
> >> have all agreed with the GPU class and its mandatory attributes, the
> >> 'description' field can be removed. Here is an example,
> >> type_id/type_name = NVIDIA_11,
> >> name=M60-M0Q,
> >> description=2560x1600, 2 displays, 512MB"
> >>
> >> Neo's previous comment only applies to the situation where we will have
> >> the GPU class or optional attributes defined and recognized by libvirt,
> >> since that is not going to happen any time soon, we will have to have
> >> the new 'description' field, and we don't want to have it mixed up with
> >> 'name' field.
> >>
> >> We can definitely have something like name+id as Alex recommended to
> >> remove the 'name' field, but it will just require libvirt to have more
> >> logic to parse that string.  
> > 
> > Let's use the mtty example driver provided in patch 5 so we can all
> > more clearly see how the interfaces work.  I'll start from the
> > beginning of my experience and work my way to the type/name thing.
> >   
> 
> Thanks for looking into it and getting feel of it. And I hope this helps
> to understand that 'name' and 'type_id' are different.
> 
> 
> > (please add a modules_install target to the Makefile)
> >  
> 
> This is an example and I feel it should not be installed in
> /lib/modules/../build path. This should be used to understand the
> interface and the flow of mdev device management life cycle. User can
> use insmod to load driver:
> 
> # insmod mtty.ko

It's not built by default, that's sufficient.  Providing a
modules_install target makes it more accessible for testing and allows
easier testing of module dependencies with modprobe.  insmod does not
exercise the automatic module dependency loading.

> > # modprobe mtty
> > 
> > Now what?  It seems like I need to have prior knowledge that this
> > drivers supports mdev devices and I need to go hunt for them.  We need
> > to create a class (ex. /sys/class/mdev/) where a user can find all the
> > devices that participate in this mediated device infrastructure.  That
> > would point me to /sys/devices/mtty.
> >   
> 
> You can find devices registered to mdev framework by searching for
> 'mdev_supported_types' directory at the leaf nodes of devices in
> /sys/devices directory. Yes, we can have 'mdev' class and links to
> devices which are registered to mdev framework in /sys/class/mdev/.
> 
> 
> > # tree /sys/devices/mtty
> > /sys/devices/mtty
> > |-- mdev_supported_types
> > |   `-- mtty1
> > |       |-- available_instances (1)
> > |       |-- create
> > |       |-- devices
> > |       `-- name ("Dual-port-serial")
> > |-- mtty_dev
> > |   `-- sample_mtty_dev ("This is phy device")
> > |-- power
> > |   |-- async
> > |   |-- autosuspend_delay_ms
> > |   |-- control
> > |   |-- runtime_active_kids
> > |   |-- runtime_active_time
> > |   |-- runtime_enabled
> > |   |-- runtime_status
> > |   |-- runtime_suspended_time
> > |   `-- runtime_usage
> > `-- uevent
> > 
> > Ok, but that was boring, we really need to have at least 2 supported
> > types to validate the interface, so without changing the actual device
> > backing, I pretended to have a single port vs dual port:
> > 
> > /sys/devices/mtty
> > |-- mdev_supported_types
> > |   |-- mtty1
> > |   |   |-- available_instances (24)
> > |   |   |-- create
> > |   |   |-- devices
> > |   |   `-- name (Single-port-serial)
> > |   `-- mtty2
> > |       |-- available_instances (12)
> > |       |-- create
> > |       |-- devices
> > |       `-- name (Dual-port-serial)
> > [snip]
> > 
> > I arbitrarily decides I have 24 ports and each single port uses 1 port
> > and each dual port uses 2 ports.
> > 
> > Before I start creating devices, what are we going to key the libvirt
> > XML on?  Can we do anything to prevent vendors from colliding or do we
> > have any way to suggest meaningful and unique type_ids?   
> 
> Libvirt would have parent and type_id in XML. No two vendors can own
> same parent device. So I don't think vendors would collide even having
> same type_id, since <parent, type_id> pair would always be unique.


We have a goal of supporting migration with mdev devices, Intel has
already shown this is possible.  Tying the XML representation of an
mdev device to a parent device is directly contradictory to that goal.
libvirt needs a token which is unique across vendor to be able to
instantiate an mdev device.  <parent, type_id> is unacceptable.
 
>  Presumably if
> > we had a PCI device hosting this, we would be rooted at that parent
> > device, ex. /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0.  Maybe
> > the type_id should automatically be prefixed by the vendor module name,
> > ex. mtty-1, i915-foo, nvidia-bar.  There's something missing for
> > deterministically creating a "XYZ" device and knowing exactly what that
> > means and finding a parent device that supports it.
> >   
> 
> We can prefix type_id with module name, i.e using dev->driver->name, but
> <parent, type_id> pair is unique so I don't see much benefit in doing
> that.
> 
> 
> > Let's get to mdev creating...
> > 
> > # uuidgen > mdev_supported_types/mtty2/create
> > # tree /sys/devices/mtty
> > /sys/devices/mtty
> > |-- e68189be-700e-41f7-93a3-b5351e79c470
> > |   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
> > |   |-- iommu_group -> ../../../kernel/iommu_groups/63
> > |   |-- mtty2 -> ../mdev_supported_types/mtty2
> > |   |-- power
> > |   |   |-- async
> > |   |   |-- autosuspend_delay_ms
> > |   |   |-- control
> > |   |   |-- runtime_active_kids
> > |   |   |-- runtime_active_time
> > |   |   |-- runtime_enabled
> > |   |   |-- runtime_status
> > |   |   |-- runtime_suspended_time
> > |   |   `-- runtime_usage
> > |   |-- remove
> > |   |-- subsystem -> ../../../bus/mdev
> > |   |-- uevent
> > |   `-- vendor
> > |       `-- sample_mdev_dev ("This is MDEV 
> > e68189be-700e-41f7-93a3-b5351e79c470")
> > |-- mdev_supported_types
> > |   |-- mtty1
> > |   |   |-- available_instances (22)
> > |   |   |-- create
> > |   |   |-- devices
> > |   |   `-- name
> > |   `-- mtty2
> > |       |-- available_instances (11)
> > |       |-- create
> > |       |-- devices
> > |       |   `-- e68189be-700e-41f7-93a3-b5351e79c470 -> 
> > ../../../e68189be-700e-41f7-93a3-b5351e79c470
> > |       `-- name
> > 
> > The mdev device was created directly under the parent, which seems like
> > it's going to get messy to me (ie. imagine dropping a bunch of uuids
> > into a PCI parent device's sysfs directory, how does a user know what
> > they are?).
> >   
> 
> That is the way devices are placed in sysfs. For example below devices:
> 
> 80:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
> Root Port 1a (rev 07)
> 80:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
> Root Port 2a (rev 07)
> 80:03.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
> Root Port 3a in PCI Express Mode (rev 07)
> 80:04.0 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 0 (rev 07)
> 80:04.1 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 1 (rev 07)
> 80:04.2 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 2 (rev 07)
> 80:04.3 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 3 (rev 07)
> 80:04.4 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 4 (rev 07)
> 80:04.5 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 5 (rev 07)
> 80:04.6 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 6 (rev 07)
> 80:04.7 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 7 (rev 07)
> 80:05.0 System peripheral: Intel Corporation Xeon E5/Core i7 Address
> Map, VTd_Misc, System Management (rev 07)
> 80:05.2 System peripheral: Intel Corporation Xeon E5/Core i7 Control
> Status and Global Errors (rev 07)
> 80:05.4 PIC: Intel Corporation Xeon E5/Core i7 I/O APIC (rev 07)
> 
> In sysfs, those are in same parent folder of its parent root port:
> 
> # ls /sys/devices/pci0000\:80/ -l
> total 0
> drwxr-xr-x 8 root root    0 Oct 13 12:08 0000:80:01.0
> drwxr-xr-x 7 root root    0 Oct 13 12:08 0000:80:02.0
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:03.0
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.0
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.1
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.2
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.3
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.4
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.5
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.6
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.7
> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.0
> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.2
> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.4
> lrwxrwxrwx 1 root root    0 Oct 13 13:25 firmware_node ->
> ../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:01
> drwxr-xr-x 3 root root    0 Oct 13 12:08 pci_bus
> drwxr-xr-x 2 root root    0 Oct 13 12:08 power
> -rw-r--r-- 1 root root 4096 Oct 13 13:25 uevent

I agree, that's also messy, but a PCI device has a standard PCI address
format, so just by looking at those you know it's a child PCI device.
We can write code that understands how to parse that and understands
what it is by the address.  On the other hand we're dropping uuids into
a directory.  We can write code that understands how to parse it, but
how do we know that it's a child device vs some other attribute for the
parent?

> > Under the device we have "mtty2", shouldn't that be
> > "mdev_supported_type", which then links to mtty2?  Otherwise a user
> > needs to decode from the link what this attribute is.
> >   
> 
> I thought it should show type, so that by looking at 'ls' output user
> should be able to find type_id.

The type_id should be shown by actually reading the link, not by the
link name itself, the same way that the iommu_group link for a device
isn't the group number, it links to the group number but uses a
standard link name.

> > Also here's an example of those vendor sysfs entries per device.  So
> > long as the vendor never expects a tool like libvirt to manipulate
> > attributes there, I can see how that could be pretty powerful.
> >   
> 
> Yes, it is good to have vendor specific entries, libvirt might not
> report/use it. That would be more useful for system admin to get extra
> information manually that libvirt doesn't report.
> 
> 
> > Moving down to the mdev_supported_types, I've updated mtty so that it
> > actually adjusts available instance, and we can now see a link under
> > the devices for mtty2.
> > 
> > Also worth noting is that a link for the device appears
> > in /sys/bus/mdev/devices.
> > 
> > BTW, specifying this device for QEMU vfio-pci is where the sysfsdev
> > option comes into play:
> > 
> > -device
> > vfio-pci,sysfsdev=/sys/devices/mtty/e68189be-700e-41f7-93a3-b5351e79c470
> > 
> > Which raises another question, we can tell through the vfio interfaces
> > that this is exposes as a PCI device, by creating a container
> > (open(/dev/vfio/vfio)), setting an iommu (ioctl(VFIO_SET_IOMMU)),
> > adding the group to the container (ioctl(VFIO_GROUP_SET_CONTAINER)),
> > getting the device (ioctl(VFIO_GROUP_GET_DEVICE_FD)), and finally
> > getting the device info (ioctl(VFIO_DEVICE_GET_INFO)) and checking the
> > flag bit that says the API is PCI.  That's a long path to go and has
> > stumbling blocks like the type of iommu that's available for the given
> > platform.  How do we make that manageable?   
> 
> Do you want device type to be expressed in sysfs? Then that should be
> done from vendor driver. vfio_mdev module is now a shim layer, so mdev
> core or vfio_mdev module don't know what device type flag vendor driver
> had set.

Right, the vendor driver would need to expose this, the mdev layers are
device agnostic, they don't know or care which device API is being
exposed.  The other question is whether it needs to be part of the
initial implementation or can we assume pci for now and add something
later.  I guess we already have our proof to the contrary with the
IBM ccw device that libvirt can't simply assume pci.  I see that many
devices in sysfs have a subsystem link, which seems rather appropriate,
but we're not creating a real pci device, so linking to /sys/bus/pci
or /sys/class/pci_bus both seem invalid.  Is that a dead end?  We could
always expose vfio_device_info.flags, but that seems pretty ugly as
well, plus the sysfs mdev interface is not vfio specific.  What if we
had a "device_api" attribute which the vendor driver would show as
"vfio-pci"?  Therefore the mdev interface is not tied to vfio, but we
clearly show that a given type_id clearly exports a vfio-pci
interface.  Thanks,

Alex



reply via email to

[Prev in Thread] Current Thread [Next in Thread]