qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [RFC v4 01/21] vfio-user: introduce vfio-user protocol specification


From: Thanos Makatos
Subject: RE: [RFC v4 01/21] vfio-user: introduce vfio-user protocol specification
Date: Mon, 14 Feb 2022 13:10:42 +0000


> -----Original Message-----
> From: Qemu-devel <qemu-devel-
> bounces+thanos.makatos=nutanix.com@nongnu.org> On Behalf Of John
> Johnson
> Sent: 12 January 2022 00:44
> To: qemu-devel@nongnu.org
> Subject: [RFC v4 01/21] vfio-user: introduce vfio-user protocol specification
> 
> From: Thanos Makatos <thanos.makatos@nutanix.com>
> 
> This patch introduces the vfio-user protocol specification (formerly
> known as VFIO-over-socket), which is designed to allow devices to be
> emulated outside QEMU, in a separate process. vfio-user reuses the
> existing VFIO defines, structs and concepts.
> 
> It has been earlier discussed as an RFC in:
> "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> 
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Thanos Makatos <thanos.makatos@nutanix.com>
> Signed-off-by: John Levon <john.levon@nutanix.com>
> ---
>  docs/devel/index.rst     |    1 +
>  docs/devel/vfio-user.rst | 1810
> ++++++++++++++++++++++++++++++++++++++++++++++
>  MAINTAINERS              |    6 +
>  3 files changed, 1817 insertions(+)
>  create mode 100644 docs/devel/vfio-user.rst
> 
> diff --git a/docs/devel/index.rst b/docs/devel/index.rst
> index afd9375..23d2c30 100644
> --- a/docs/devel/index.rst
> +++ b/docs/devel/index.rst
> @@ -48,3 +48,4 @@ modifying QEMU's source code.
>     trivial-patches
>     submitting-a-patch
>     submitting-a-pull-request
> +   vfio-user
> diff --git a/docs/devel/vfio-user.rst b/docs/devel/vfio-user.rst
> new file mode 100644
> index 0000000..97a7506
> --- /dev/null
> +++ b/docs/devel/vfio-user.rst
> @@ -0,0 +1,1810 @@
> +.. include:: <isonum.txt>
> +********************************
> +vfio-user Protocol Specification
> +********************************
> +
> +--------------
> +Version_ 0.9.1
> +--------------
> +
> +.. contents:: Table of Contents
> +
> +Introduction
> +============
> +vfio-user is a protocol that allows a device to be emulated in a separate
> +process outside of a Virtual Machine Monitor (VMM). vfio-user devices consist
> +of a generic VFIO device type, living inside the VMM, which we call the 
> client,
> +and the core device implementation, living outside the VMM, which we call the
> +server.
> +
> +The vfio-user specification is partly based on the
> +`Linux VFIO ioctl interface 
> <https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__www.kernel.org_doc_html_latest_driver-
> 2Dapi_vfio.html&d=DwIBAg&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvt
> w6ogtti46atk736SI4vgsJiUKIyDE&m=ngoOFuShfuAMJEVdV4gT7GrSMsClZglKPZXI
> UUeqln2aRgULhPnXxADWhHOl4DBy&s=9CWGM-
> s7UoGGB9GFCKuhOj_7UwoSttlAcxrpEWTjdYI&e= >`_.
> +
> +VFIO is a mature and stable API, backed by an extensively used framework. The
> +existing VFIO client implementation in QEMU (``qemu/hw/vfio/``) can be 
> largely
> +re-used, though there is nothing in this specification that requires that
> +particular implementation. None of the VFIO kernel modules are required for
> +supporting the protocol, on either the client or server side. Some source
> +definitions in VFIO are re-used for vfio-user.
> +
> +The main idea is to allow a virtual device to function in a separate process 
> in
> +the same host over a UNIX domain socket. A UNIX domain socket (``AF_UNIX``)
> is
> +chosen because file descriptors can be trivially sent over it, which in turn
> +allows:
> +
> +* Sharing of client memory for DMA with the server.
> +* Sharing of server memory with the client for fast MMIO.
> +* Efficient sharing of eventfd's for triggering interrupts.
> +
> +Other socket types could be used which allow the server to run in a separate
> +guest in the same host (``AF_VSOCK``) or remotely (``AF_INET``). 
> Theoretically
> +the underlying transport does not necessarily have to be a socket, however we
> do
> +not examine such alternatives. In this protocol version we focus on using a
> UNIX
> +domain socket and introduce basic support for the other two types of sockets
> +without considering performance implications.
> +
> +While passing of file descriptors is desirable for performance reasons, 
> support
> +is not necessary for either the client or the server in order to implement 
> the
> +protocol. There is always an in-band, message-passing fall back mechanism.
> +
> +Overview
> +========
> +
> +VFIO is a framework that allows a physical device to be securely passed
> through
> +to a user space process; the device-specific kernel driver does not drive the
> +device at all.  Typically, the user space process is a VMM and the device is
> +passed through to it in order to achieve high performance. VFIO provides an
> API
> +and the required functionality in the kernel. QEMU has adopted VFIO to allow 
> a
> +guest to directly access physical devices, instead of emulating them in
> +software.
> +
> +vfio-user reuses the core VFIO concepts defined in its API, but implements 
> them
> +as messages to be sent over a socket. It does not change the kernel-based
> VFIO
> +in any way, in fact none of the VFIO kernel modules need to be loaded to use
> +vfio-user. It is also possible for the client to concurrently use the current
> +kernel-based VFIO for one device, and vfio-user for another device.
> +
> +VFIO Device Model
> +-----------------
> +
> +A device under VFIO presents a standard interface to the user process. Many 
> of
> +the VFIO operations in the existing interface use the ``ioctl()`` system 
> call, and
> +references to the existing interface are called the ``ioctl()`` 
> implementation in
> +this document.
> +
> +The following sections describe the set of messages that implement the vfio-
> user
> +interface over a socket. In many cases, the messages are analogous to data
> +structures used in the ``ioctl()`` implementation. Messages derived from the
> +``ioctl()`` will have a name derived from the ``ioctl()`` command name.  
> E.g., the
> +``VFIO_DEVICE_GET_INFO`` ``ioctl()`` command becomes a
> +``VFIO_USER_DEVICE_GET_INFO`` message.  The purpose of this reuse is to
> share as
> +much code as feasible with the ``ioctl()`` implementation``.
> +
> +Connection Initiation
> +^^^^^^^^^^^^^^^^^^^^^
> +
> +After the client connects to the server, the initial client message is
> +``VFIO_USER_VERSION`` to propose a protocol version and set of capabilities
> to
> +apply to the session. The server replies with a compatible version and set of
> +capabilities it supports, or closes the connection if it cannot support the
> +advertised version.
> +
> +Device Information
> +^^^^^^^^^^^^^^^^^^
> +
> +The client uses a ``VFIO_USER_DEVICE_GET_INFO`` message to query the
> server for
> +information about the device. This information includes:
> +
> +* The device type and whether it supports reset (``VFIO_DEVICE_FLAGS_``),
> +* the number of device regions, and
> +* the device presents to the client the number of interrupt types the device
> +  supports.
> +
> +Region Information
> +^^^^^^^^^^^^^^^^^^
> +
> +The client uses ``VFIO_USER_DEVICE_GET_REGION_INFO`` messages to query
> the
> +server for information about the device's regions. This information 
> describes:
> +
> +* Read and write permissions, whether it can be memory mapped, and whether
> it
> +  supports additional capabilities (``VFIO_REGION_INFO_CAP_``).
> +* Region index, size, and offset.
> +
> +When a device region can be mapped by the client, the server provides a file
> +descriptor which the client can ``mmap()``. The server is responsible for
> +polling for client updates to memory mapped regions.
> +
> +Region Capabilities
> +"""""""""""""""""""
> +
> +Some regions have additional capabilities that cannot be described adequately
> +by the region info data structure. These capabilities are returned in the
> +region info reply in a list similar to PCI capabilities in a PCI device's
> +configuration space.
> +
> +Sparse Regions
> +""""""""""""""
> +A region can be memory-mappable in whole or in part. When only a subset of a
> +region can be mapped by the client, a
> ``VFIO_REGION_INFO_CAP_SPARSE_MMAP``
> +capability is included in the region info reply. This capability describes
> +which portions can be mapped by the client.
> +
> +.. Note::
> +   For example, in a virtual NVMe controller, sparse regions can be used so
> +   that accesses to the NVMe registers (found in the beginning of BAR0) are
> +   trapped (an infrequent event), while allowing direct access to the 
> doorbells
> +   (an extremely frequent event as every I/O submission requires a write to
> +   BAR0), found in the next page after the NVMe registers in BAR0.
> +
> +Device-Specific Regions
> +"""""""""""""""""""""""
> +
> +A device can define regions additional to the standard ones (e.g. PCI indexes
> +0-8). This is achieved by including a ``VFIO_REGION_INFO_CAP_TYPE``
> capability
> +in the region info reply of a device-specific region. Such regions are 
> reflected
> +in ``struct vfio_user_device_info.num_regions``. Thus, for PCI devices this
> +value can be equal to, or higher than, ``VFIO_PCI_NUM_REGIONS``.
> +
> +Region I/O via file descriptors
> +-------------------------------
> +
> +For unmapped regions, region I/O from the client is done via
> +``VFIO_USER_REGION_READ/WRITE``.  As an optimization, ioeventfds or
> ioregionfds
> +may be configured for sub-regions of some regions. A client may request
> +information on these sub-regions via
> ``VFIO_USER_DEVICE_GET_REGION_IO_FDS``; by
> +configuring the returned file descriptors as ioeventfds or ioregionfds, the
> +server can be directly notified of I/O (for example, by KVM) without taking a
> +trip through the client.
> +
> +Interrupts
> +^^^^^^^^^^
> +
> +The client uses ``VFIO_USER_DEVICE_GET_IRQ_INFO`` messages to query the
> server
> +for the device's interrupt types. The interrupt types are specific to the bus
> +the device is attached to, and the client is expected to know the 
> capabilities
> +of each interrupt type. The server can signal an interrupt by directly 
> injecting
> +interrupts into the guest via an event file descriptor. The client configures
> +how the server signals an interrupt with ``VFIO_USER_SET_IRQS`` messages.
> +
> +Device Read and Write
> +^^^^^^^^^^^^^^^^^^^^^
> +
> +When the guest executes load or store operations to an unmapped device
> region,
> +the client forwards these operations to the server with
> +``VFIO_USER_REGION_READ`` or ``VFIO_USER_REGION_WRITE`` messages.
> The server
> +will reply with data from the device on read operations or an
> acknowledgement on
> +write operations. See `Read and Write Operations`_.
> +
> +Client memory access
> +--------------------
> +
> +The client uses ``VFIO_USER_DMA_MAP`` and ``VFIO_USER_DMA_UNMAP``
> messages to
> +inform the server of the valid DMA ranges that the server can access on 
> behalf
> +of a device (typically, VM guest memory). DMA memory may be accessed by
> the
> +server via ``VFIO_USER_DMA_READ`` and ``VFIO_USER_DMA_WRITE``
> messages over the
> +socket. In this case, the "DMA" part of the naming is a misnomer.
> +
> +Actual direct memory access of client memory from the server is possible if 
> the
> +client provides file descriptors the server can ``mmap()``. Note that 
> ``mmap()``
> +privileges cannot be revoked by the client, therefore file descriptors should
> +only be exported in environments where the client trusts the server not to
> +corrupt guest memory.
> +
> +See `Read and Write Operations`_.
> +
> +Client/server interactions
> +==========================
> +
> +Socket
> +------
> +
> +A server can serve:
> +
> +1) one or more clients, and/or
> +2) one or more virtual devices, belonging to one or more clients.
> +
> +The current protocol specification requires a dedicated socket per
> +client/server connection. It is a server-side implementation detail whether a
> +single server handles multiple virtual devices from the same or multiple
> +clients. The location of the socket is implementation-specific. Multiplexing
> +clients, devices, and servers over the same socket is not supported in this
> +version of the protocol.
> +
> +Authentication
> +--------------
> +
> +For ``AF_UNIX``, we rely on OS mandatory access controls on the socket files,
> +therefore it is up to the management layer to set up the socket as required.
> +Socket types that span guests or hosts will require a proper authentication
> +mechanism. Defining that mechanism is deferred to a future version of the
> +protocol.
> +
> +Command Concurrency
> +-------------------
> +
> +A client may pipeline multiple commands without waiting for previous
> command
> +replies.  The server will process commands in the order they are received.  A
> +consequence of this is if a client issues a command with the *No_reply* bit,
> +then subsequently issues a command without *No_reply*, the older command
> will
> +have been processed before the reply to the younger command is sent by the
> +server.  The client must be aware of the device's capability to process
> +concurrent commands if pipelining is used.  For example, pipelining allows
> +multiple client threads to concurrently access device regions; the client 
> must
> +ensure these accesses obey device semantics.
> +
> +An example is a frame buffer device, where the device may allow concurrent
> +access to different areas of video memory, but may have indeterminate
> behavior
> +if concurrent accesses are performed to command or status registers.
> +
> +Note that unrelated messages sent from the server to the client can appear in
> +between a client to server request/reply and vice versa.
> +
> +Implementers should be prepared for certain commands to exhibit potentially
> +unbounded latencies.  For example, ``VFIO_USER_DEVICE_RESET`` may take an
> +arbitrarily long time to complete; clients should take care not to block
> +unnecessarily.
> +
> +Socket Disconnection Behavior
> +-----------------------------
> +The server and the client can disconnect from each other, either 
> intentionally
> +or unexpectedly. Both the client and the server need to know how to handle
> such
> +events.
> +
> +Server Disconnection
> +^^^^^^^^^^^^^^^^^^^^
> +A server disconnecting from the client may indicate that:
> +
> +1) A virtual device has been restarted, either intentionally (e.g. because 
> of a
> +   device update) or unintentionally (e.g. because of a crash).
> +2) A virtual device has been shut down with no intention to be restarted.
> +
> +It is impossible for the client to know whether or not a failure is
> +intermittent or innocuous and should be retried, therefore the client should
> +reset the VFIO device when it detects the socket has been disconnected.
> +Error recovery will be driven by the guest's device error handling
> +behavior.
> +
> +Client Disconnection
> +^^^^^^^^^^^^^^^^^^^^
> +The client disconnecting from the server primarily means that the client
> +has exited. Currently, this means that the guest is shut down so the device 
> is
> +no longer needed therefore the server can automatically exit. However, there
> +can be cases where a client disconnection should not result in a server exit:
> +
> +1) A single server serving multiple clients.
> +2) A multi-process QEMU upgrading itself step by step, which is not yet
> +   implemented.
> +
> +Therefore in order for the protocol to be forward compatible, the server
> should
> +respond to a client disconnection as follows:
> +
> + - all client memory regions are unmapped and cleaned up (including closing
> any
> +   passed file descriptors)
> + - all IRQ file descriptors passed from the old client are closed
> + - the device state should otherwise be retained
> +
> +The expectation is that when a client reconnects, it will re-establish IRQ 
> and
> +client memory mappings.
> +
> +If anything happens to the client (such as qemu really did exit), the control
> +stack will know about it and can clean up resources accordingly.
> +
> +Security Considerations
> +-----------------------
> +
> +Speaking generally, vfio-user clients should not trust servers, and vice 
> versa.
> +Standard tools and mechanisms should be used on both sides to validate input
> and
> +prevent against denial of service scenarios, buffer overflow, etc.
> +
> +Request Retry and Response Timeout
> +----------------------------------
> +A failed command is a command that has been successfully sent and has been
> +responded to with an error code. Failure to send the command in the first 
> place
> +(e.g. because the socket is disconnected) is a different type of error 
> examined
> +earlier in the disconnect section.
> +
> +.. Note::
> +   QEMU's VFIO retries certain operations if they fail. While this makes 
> sense
> +   for real HW, we don't know for sure whether it makes sense for virtual
> +   devices.
> +
> +Defining a retry and timeout scheme is deferred to a future version of the
> +protocol.
> +
> +Message sizes
> +-------------
> +
> +Some requests have an ``argsz`` field. In a request, it defines the maximum
> +expected reply payload size, which should be at least the size of the fixed
> +reply payload headers defined here. The *request* payload size is defined by
> the
> +usual ``msg_size`` field in the header, not the ``argsz`` field.
> +
> +In a reply, the server sets ``argsz`` field to the size needed for a full
> +payload size. This may be less than the requested maximum size. This may be
> +larger than the requested maximum size: in that case, the full payload is not
> +included in the reply, but the ``argsz`` field in the reply indicates the 
> needed
> +size, allowing a client to allocate a larger buffer for holding the reply 
> before
> +trying again.
> +
> +In addition, during negotiation (see  `Version`_), the client and server may
> +each specify a ``max_data_xfer_size`` value; this defines the maximum data
> that
> +may be read or written via one of the
> ``VFIO_USER_DMA/REGION_READ/WRITE``
> +messages; see `Read and Write Operations`_.
> +
> +Protocol Specification
> +======================
> +
> +To distinguish from the base VFIO symbols, all vfio-user symbols are prefixed
> +with ``vfio_user`` or ``VFIO_USER``. In this revision, all data is in the
> +endianness of the host system, although this may be relaxed in future
> +revisions in cases where the client and server run on different hosts
> +with different endianness.
> +
> +Unless otherwise specified, all sizes should be presumed to be in bytes.
> +
> +.. _Commands:
> +
> +Commands
> +--------
> +The following table lists the VFIO message command IDs, and whether the
> +message command is sent from the client or the server.
> +
> +======================================  =========
> =================
> +Name                                    Command    Request Direction
> +======================================  =========
> =================
> +``VFIO_USER_VERSION``                   1          client -> server
> +``VFIO_USER_DMA_MAP``                   2          client -> server
> +``VFIO_USER_DMA_UNMAP``                 3          client -> server
> +``VFIO_USER_DEVICE_GET_INFO``           4          client -> server
> +``VFIO_USER_DEVICE_GET_REGION_INFO``    5          client -> server
> +``VFIO_USER_DEVICE_GET_REGION_IO_FDS``  6          client -> server
> +``VFIO_USER_DEVICE_GET_IRQ_INFO``       7          client -> server
> +``VFIO_USER_DEVICE_SET_IRQS``           8          client -> server
> +``VFIO_USER_REGION_READ``               9          client -> server
> +``VFIO_USER_REGION_WRITE``              10         client -> server
> +``VFIO_USER_DMA_READ``                  11         server -> client
> +``VFIO_USER_DMA_WRITE``                 12         server -> client
> +``VFIO_USER_DEVICE_RESET``              13         client -> server
> +``VFIO_USER_DIRTY_PAGES``               14         client -> server
> +======================================  =========
> =================
> +
> +Header
> +------
> +
> +All messages, both command messages and reply messages, are preceded by a
> +16-byte header that contains basic information about the message. The header
> is
> +followed by message-specific data described in the sections below.
> +
> ++----------------+--------+-------------+
> +| Name           | Offset | Size        |
> ++================+========+=============+
> +| Message ID     | 0      | 2           |
> ++----------------+--------+-------------+
> +| Command        | 2      | 2           |
> ++----------------+--------+-------------+
> +| Message size   | 4      | 4           |
> ++----------------+--------+-------------+
> +| Flags          | 8      | 4           |
> ++----------------+--------+-------------+
> +|                | +-----+------------+ |
> +|                | | Bit | Definition | |
> +|                | +=====+============+ |
> +|                | | 0-3 | Type       | |
> +|                | +-----+------------+ |
> +|                | | 4   | No_reply   | |
> +|                | +-----+------------+ |
> +|                | | 5   | Error      | |
> +|                | +-----+------------+ |
> ++----------------+--------+-------------+
> +| Error          | 12     | 4           |
> ++----------------+--------+-------------+
> +| <message data> | 16     | variable    |
> ++----------------+--------+-------------+
> +
> +* *Message ID* identifies the message, and is echoed in the command's reply
> +  message. Message IDs belong entirely to the sender, can be re-used (even
> +  concurrently) and the receiver must not make any assumptions about their
> +  uniqueness.
> +* *Command* specifies the command to be executed, listed in Commands_. It
> is
> +  also set in the reply header.
> +* *Message size* contains the size of the entire message, including the 
> header.
> +* *Flags* contains attributes of the message:
> +
> +  * The *Type* bits indicate the message type.
> +
> +    *  *Command* (value 0x0) indicates a command message.
> +    *  *Reply* (value 0x1) indicates a reply message acknowledging a previous
> +       command with the same message ID.
> +  * *No_reply* in a command message indicates that no reply is needed for 
> this
> +    command.  This is commonly used when multiple commands are sent, and
> only
> +    the last needs acknowledgement.
> +  * *Error* in a reply message indicates the command being acknowledged had
> +    an error. In this case, the *Error* field will be valid.
> +
> +* *Error* in a reply message is an optional UNIX errno value. It may be zero
> +  even if the Error bit is set in Flags. It is reserved in a command message.
> +
> +Each command message in Commands_ must be replied to with a reply
> message,
> +unless the message sets the *No_Reply* bit.  The reply consists of the header
> +with the *Reply* bit set, plus any additional data.
> +
> +If an error occurs, the reply message must only include the reply header.
> +
> +As the header is standard in both requests and replies, it is not included in
> +the command-specific specifications below; each message definition should be
> +appended to the standard header, and the offsets are given from the end of
> the
> +standard header.
> +
> +``VFIO_USER_VERSION``
> +---------------------
> +
> +.. _Version:
> +
> +This is the initial message sent by the client after the socket connection is
> +established; the same format is used for the server's reply.
> +
> +Upon establishing a connection, the client must send a ``VFIO_USER_VERSION``
> +message proposing a protocol version and a set of capabilities. The server
> +compares these with the versions and capabilities it supports and sends a
> +``VFIO_USER_VERSION`` reply according to the following rules.
> +
> +* The major version in the reply must be the same as proposed. If the client
> +  does not support the proposed major, it closes the connection.
> +* The minor version in the reply must be equal to or less than the minor
> +  version proposed.
> +* The capability list must be a subset of those proposed. If the server
> +  requires a capability the client did not include, it closes the connection.
> +
> +The protocol major version will only change when incompatible protocol
> changes
> +are made, such as changing the message format. The minor version may
> change
> +when compatible changes are made, such as adding new messages or
> capabilities,
> +Both the client and server must support all minor versions less than the
> +maximum minor version it supports. E.g., an implementation that supports
> +version 1.3 must also support 1.0 through 1.2.
> +
> +When making a change to this specification, the protocol version number must
> +be included in the form "added in version X.Y"
> +
> +Request
> +^^^^^^^
> +
> +==============  ======  ====
> +Name            Offset  Size
> +==============  ======  ====
> +version major   0       2
> +version minor   2       2
> +version data    4       variable (including terminating NUL). Optional.
> +==============  ======  ====
> +
> +The version data is an optional UTF-8 encoded JSON byte array with the
> following
> +format:
> +
> ++--------------+--------+-----------------------------------+
> +| Name         | Type   | Description                       |
> ++==============+========+===================================+
> +| capabilities | object | Contains common capabilities that |
> +|              |        | the sender supports. Optional.    |
> ++--------------+--------+-----------------------------------+
> +
> +Capabilities:
> +
> ++--------------------+--------+------------------------------------------------+
> +| Name               | Type   | Description                                  
>   |
> ++====================+========+=================================
> ===============+
> +| max_msg_fds        | number | Maximum number of file descriptors that can
> be |
> +|                    |        | received by the sender in one message.       
>   |
> +|                    |        | Optional. If not specified then the receiver 
>   |
> +|                    |        | must assume a value of ``1``.                
>   |
> ++--------------------+--------+------------------------------------------------+
> +| max_data_xfer_size | number | Maximum ``count`` for data transfer
> messages;  |
> +|                    |        | see `Read and Write Operations`_. Optional,  
>   |
> +|                    |        | with a default value of 1048576 bytes.       
>   |
> ++--------------------+--------+------------------------------------------------+
> +| migration          | object | Migration capability parameters. If missing  
>   |
> +|                    |        | then migration is not supported by the 
> sender. |
> ++--------------------+--------+------------------------------------------------+
> +
> +The migration capability contains the following name/value pairs:
> +
> ++--------+--------+-----------------------------------------------+
> +| Name   | Type   | Description                                   |
> ++========+========+=============================================
> ==+
> +| pgsize | number | Page size of dirty pages bitmap. The smallest |
> +|        |        | between the client and the server is used.    |
> ++--------+--------+-----------------------------------------------+
> +
> +Reply
> +^^^^^
> +
> +The same message format is used in the server's reply with the semantics
> +described above.
> +
> +``VFIO_USER_DMA_MAP``
> +---------------------
> +
> +This command message is sent by the client to the server to inform it of the
> +memory regions the server can access. It must be sent before the server can
> +perform any DMA to the client. It is normally sent directly after the version
> +handshake is completed, but may also occur when memory is added to the
> client,
> +or if the client uses a vIOMMU.
> +
> +Request
> +^^^^^^^
> +
> +The request payload for this message is a structure of the following format:
> +
> ++-------------+--------+-------------+
> +| Name        | Offset | Size        |
> ++=============+========+=============+
> +| argsz       | 0      | 4           |
> ++-------------+--------+-------------+
> +| flags       | 4      | 4           |
> ++-------------+--------+-------------+
> +|             | +-----+------------+ |
> +|             | | Bit | Definition | |
> +|             | +=====+============+ |
> +|             | | 0   | readable   | |
> +|             | +-----+------------+ |
> +|             | | 1   | writeable  | |
> +|             | +-----+------------+ |
> ++-------------+--------+-------------+
> +| offset      | 8      | 8           |
> ++-------------+--------+-------------+
> +| address     | 16     | 8           |
> ++-------------+--------+-------------+
> +| size        | 24     | 8           |
> ++-------------+--------+-------------+
> +
> +* *argsz* is the size of the above structure. Note there is no reply payload,
> +  so this field differs from other message types.
> +* *flags* contains the following region attributes:
> +
> +  * *readable* indicates that the region can be read from.
> +
> +  * *writeable* indicates that the region can be written to.
> +
> +* *offset* is the file offset of the region with respect to the associated 
> file
> +  descriptor, or zero if the region is not mappable
> +* *address* is the base DMA address of the region.
> +* *size* is the size of the region.
> +
> +This structure is 32 bytes in size, so the message size is 16 + 32 bytes.
> +
> +If the DMA region being added can be directly mapped by the server, a file
> +descriptor must be sent as part of the message meta-data. The region can be
> +mapped via the mmap() system call. On ``AF_UNIX`` sockets, the file 
> descriptor
> +must be passed as ``SCM_RIGHTS`` type ancillary data.  Otherwise, if the DMA
> +region cannot be directly mapped by the server, no file descriptor must be 
> sent
> +as part of the message meta-data and the DMA region can be accessed by the
> +server using ``VFIO_USER_DMA_READ`` and ``VFIO_USER_DMA_WRITE``
> messages,
> +explained in `Read and Write Operations`_. A command to map over an existing
> +region must be failed by the server with ``EEXIST`` set in error field in the
> +reply.
> +
> +Reply
> +^^^^^
> +
> +There is no payload in the reply message.
> +
> +``VFIO_USER_DMA_UNMAP``
> +-----------------------
> +
> +This command message is sent by the client to the server to inform it that a
> +DMA region, previously made available via a ``VFIO_USER_DMA_MAP``
> command
> +message, is no longer available for DMA. It typically occurs when memory is
> +subtracted from the client or if the client uses a vIOMMU. The DMA region is
> +described by the following structure:
> +
> +Request
> +^^^^^^^
> +
> +The request payload for this message is a structure of the following format:
> +
> ++--------------+--------+------------------------+
> +| Name         | Offset | Size                   |
> ++==============+========+========================+
> +| argsz        | 0      | 4                      |
> ++--------------+--------+------------------------+
> +| flags        | 4      | 4                      |
> ++--------------+--------+------------------------+
> +|              | +-----+-----------------------+ |
> +|              | | Bit | Definition            | |
> +|              | +=====+=======================+ |
> +|              | | 0   | get dirty page bitmap | |
> +|              | +-----+-----------------------+ |
> ++--------------+--------+------------------------+
> +| address      | 8      | 8                      |
> ++--------------+--------+------------------------+
> +| size         | 16     | 8                      |
> ++--------------+--------+------------------------+
> +
> +* *argsz* is the maximum size of the reply payload.
> +* *flags* contains the following DMA region attributes:
> +
> +  * *get dirty page bitmap* indicates that a dirty page bitmap must be
> +    populated before unmapping the DMA region. The client must provide a
> +    `VFIO Bitmap`_ structure, explained below, immediately following this
> +    entry.
> +
> +* *address* is the base DMA address of the DMA region.
> +* *size* is the size of the DMA region.
> +
> +The address and size of the DMA region being unmapped must match exactly a
> +previous mapping. The size of request message depends on whether or not the
> +*get dirty page bitmap* bit is set in Flags:
> +
> +* If not set, the size of the total request message is: 16 + 24.
> +
> +* If set, the size of the total request message is: 16 + 24 + 16.
> +
> +.. _VFIO Bitmap:
> +
> +VFIO Bitmap Format
> +""""""""""""""""""
> +
> ++--------+--------+------+
> +| Name   | Offset | Size |
> ++========+========+======+
> +| pgsize | 0      | 8    |
> ++--------+--------+------+
> +| size   | 8      | 8    |
> ++--------+--------+------+
> +
> +* *pgsize* is the page size for the bitmap, in bytes.
> +* *size* is the size for the bitmap, in bytes, excluding the VFIO bitmap 
> header.
> +
> +Reply
> +^^^^^
> +
> +Upon receiving a ``VFIO_USER_DMA_UNMAP`` command, if the file descriptor
> is
> +mapped then the server must release all references to that DMA region before
> +replying, which potentially includes in-flight DMA transactions.
> +
> +The server responds with the original DMA entry in the request. If the
> +*get dirty page bitmap* bit is set in flags in the request, then
> +the server also includes the `VFIO Bitmap`_ structure sent in the request,
> +followed by the corresponding dirty page bitmap, where each bit represents
> +one page of size *pgsize* in `VFIO Bitmap`_ .
> +
> +The total size of the total reply message is:
> +16 + 24 + (16 + *size* in `VFIO Bitmap`_ if *get dirty page bitmap* is set).
> +
> +``VFIO_USER_DEVICE_GET_INFO``
> +-----------------------------
> +
> +This command message is sent by the client to the server to query for basic
> +information about the device.
> +
> +Request
> +^^^^^^^
> +
> ++-------------+--------+--------------------------+
> +| Name        | Offset | Size                     |
> ++=============+========+==========================+
> +| argsz       | 0      | 4                        |
> ++-------------+--------+--------------------------+
> +| flags       | 4      | 4                        |
> ++-------------+--------+--------------------------+
> +|             | +-----+-------------------------+ |
> +|             | | Bit | Definition              | |
> +|             | +=====+=========================+ |
> +|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
> +|             | +-----+-------------------------+ |
> +|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
> +|             | +-----+-------------------------+ |
> ++-------------+--------+--------------------------+
> +| num_regions | 8      | 4                        |
> ++-------------+--------+--------------------------+
> +| num_irqs    | 12     | 4                        |
> ++-------------+--------+--------------------------+
> +
> +* *argsz* is the maximum size of the reply payload
> +* all other fields must be zero.
> +
> +Reply
> +^^^^^
> +
> ++-------------+--------+--------------------------+
> +| Name        | Offset | Size                     |
> ++=============+========+==========================+
> +| argsz       | 0      | 4                        |
> ++-------------+--------+--------------------------+
> +| flags       | 4      | 4                        |
> ++-------------+--------+--------------------------+
> +|             | +-----+-------------------------+ |
> +|             | | Bit | Definition              | |
> +|             | +=====+=========================+ |
> +|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
> +|             | +-----+-------------------------+ |
> +|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
> +|             | +-----+-------------------------+ |
> ++-------------+--------+--------------------------+
> +| num_regions | 8      | 4                        |
> ++-------------+--------+--------------------------+
> +| num_irqs    | 12     | 4                        |
> ++-------------+--------+--------------------------+
> +
> +* *argsz* is the size required for the full reply payload (16 bytes today)
> +* *flags* contains the following device attributes.
> +
> +  * ``VFIO_DEVICE_FLAGS_RESET`` indicates that the device supports the
> +    ``VFIO_USER_DEVICE_RESET`` message.
> +  * ``VFIO_DEVICE_FLAGS_PCI`` indicates that the device is a PCI device.
> +
> +* *num_regions* is the number of memory regions that the device exposes.
> +* *num_irqs* is the number of distinct interrupt types that the device 
> supports.
> +
> +This version of the protocol only supports PCI devices. Additional devices 
> may
> +be supported in future versions.
> +
> +``VFIO_USER_DEVICE_GET_REGION_INFO``
> +------------------------------------
> +
> +This command message is sent by the client to the server to query for
> +information about device regions. The VFIO region info structure is defined 
> in
> +``<linux/vfio.h>`` (``struct vfio_region_info``).
> +
> +Request
> +^^^^^^^
> +
> ++------------+--------+------------------------------+
> +| Name       | Offset | Size                         |
> ++============+========+==============================+
> +| argsz      | 0      | 4                            |
> ++------------+--------+------------------------------+
> +| flags      | 4      | 4                            |
> ++------------+--------+------------------------------+
> +| index      | 8      | 4                            |
> ++------------+--------+------------------------------+
> +| cap_offset | 12     | 4                            |
> ++------------+--------+------------------------------+
> +| size       | 16     | 8                            |
> ++------------+--------+------------------------------+
> +| offset     | 24     | 8                            |
> ++------------+--------+------------------------------+
> +
> +* *argsz* the maximum size of the reply payload
> +* *index* is the index of memory region being queried, it is the only field
> +  that is required to be set in the command message.
> +* all other fields must be zero.
> +
> +Reply
> +^^^^^
> +
> ++------------+--------+------------------------------+
> +| Name       | Offset | Size                         |
> ++============+========+==============================+
> +| argsz      | 0      | 4                            |
> ++------------+--------+------------------------------+
> +| flags      | 4      | 4                            |
> ++------------+--------+------------------------------+
> +|            | +-----+-----------------------------+ |
> +|            | | Bit | Definition                  | |
> +|            | +=====+=============================+ |
> +|            | | 0   | VFIO_REGION_INFO_FLAG_READ  | |
> +|            | +-----+-----------------------------+ |
> +|            | | 1   | VFIO_REGION_INFO_FLAG_WRITE | |
> +|            | +-----+-----------------------------+ |
> +|            | | 2   | VFIO_REGION_INFO_FLAG_MMAP  | |
> +|            | +-----+-----------------------------+ |
> +|            | | 3   | VFIO_REGION_INFO_FLAG_CAPS  | |
> +|            | +-----+-----------------------------+ |
> ++------------+--------+------------------------------+
> ++------------+--------+------------------------------+
> +| index      | 8      | 4                            |
> ++------------+--------+------------------------------+
> +| cap_offset | 12     | 4                            |
> ++------------+--------+------------------------------+
> +| size       | 16     | 8                            |
> ++------------+--------+------------------------------+
> +| offset     | 24     | 8                            |
> ++------------+--------+------------------------------+
> +
> +* *argsz* is the size required for the full reply payload (region info 
> structure
> +  plus the size of any region capabilities)
> +* *flags* are attributes of the region:
> +
> +  * ``VFIO_REGION_INFO_FLAG_READ`` allows client read access to the region.
> +  * ``VFIO_REGION_INFO_FLAG_WRITE`` allows client write access to the
> region.
> +  * ``VFIO_REGION_INFO_FLAG_MMAP`` specifies the client can mmap() the
> region.
> +    When this flag is set, the reply will include a file descriptor in its
> +    meta-data. On ``AF_UNIX`` sockets, the file descriptors will be passed as
> +    ``SCM_RIGHTS`` type ancillary data.
> +  * ``VFIO_REGION_INFO_FLAG_CAPS`` indicates additional capabilities found in
> the
> +    reply.
> +
> +* *index* is the index of memory region being queried, it is the only field
> +  that is required to be set in the command message.
> +* *cap_offset* describes where additional region capabilities can be found.
> +  cap_offset is relative to the beginning of the VFIO region info structure.
> +  The data structure it points is a VFIO cap header defined in
> +  ``<linux/vfio.h>``.
> +* *size* is the size of the region.
> +* *offset* is the offset that should be given to the mmap() system call for
> +  regions with the MMAP attribute. It is also used as the base offset when
> +  mapping a VFIO sparse mmap area, described below.
> +
> +VFIO region capabilities
> +""""""""""""""""""""""""
> +
> +The VFIO region information can also include a capabilities list. This list 
> is
> +similar to a PCI capability list - each entry has a common header that
> +identifies a capability and where the next capability in the list can be 
> found.
> +The VFIO capability header format is defined in ``<linux/vfio.h>`` (``struct
> +vfio_info_cap_header``).
> +
> +VFIO cap header format
> +""""""""""""""""""""""
> +
> ++---------+--------+------+
> +| Name    | Offset | Size |
> ++=========+========+======+
> +| id      | 0      | 2    |
> ++---------+--------+------+
> +| version | 2      | 2    |
> ++---------+--------+------+
> +| next    | 4      | 4    |
> ++---------+--------+------+
> +
> +* *id* is the capability identity.
> +* *version* is a capability-specific version number.
> +* *next* specifies the offset of the next capability in the capability list. 
> It
> +  is relative to the beginning of the VFIO region info structure.
> +
> +VFIO sparse mmap cap header
> +"""""""""""""""""""""""""""
> +
> ++------------------+----------------------------------+
> +| Name             | Value                            |
> ++==================+==================================+
> +| id               | VFIO_REGION_INFO_CAP_SPARSE_MMAP |
> ++------------------+----------------------------------+
> +| version          | 0x1                              |
> ++------------------+----------------------------------+
> +| next             | <next>                           |
> ++------------------+----------------------------------+
> +| sparse mmap info | VFIO region info sparse mmap     |
> ++------------------+----------------------------------+
> +
> +This capability is defined when only a subrange of the region supports
> +direct access by the client via mmap(). The VFIO sparse mmap area is defined 
> in
> +``<linux/vfio.h>`` (``struct vfio_region_sparse_mmap_area`` and ``struct
> +vfio_region_info_cap_sparse_mmap``).
> +
> +VFIO region info cap sparse mmap
> +""""""""""""""""""""""""""""""""
> +
> ++----------+--------+------+
> +| Name     | Offset | Size |
> ++==========+========+======+
> +| nr_areas | 0      | 4    |
> ++----------+--------+------+
> +| reserved | 4      | 4    |
> ++----------+--------+------+
> +| offset   | 8      | 8    |
> ++----------+--------+------+
> +| size     | 16     | 9    |
> ++----------+--------+------+
> +| ...      |        |      |
> ++----------+--------+------+
> +
> +* *nr_areas* is the number of sparse mmap areas in the region.
> +* *offset* and size describe a single area that can be mapped by the client.
> +  There will be *nr_areas* pairs of offset and size. The offset will be 
> added to
> +  the base offset given in the ``VFIO_USER_DEVICE_GET_REGION_INFO`` to
> form the
> +  offset argument of the subsequent mmap() call.
> +
> +The VFIO sparse mmap area is defined in ``<linux/vfio.h>`` (``struct
> +vfio_region_info_cap_sparse_mmap``).
> +
> +VFIO region type cap header
> +"""""""""""""""""""""""""""
> +
> ++------------------+---------------------------+
> +| Name             | Value                     |
> ++==================+===========================+
> +| id               | VFIO_REGION_INFO_CAP_TYPE |
> ++------------------+---------------------------+
> +| version          | 0x1                       |
> ++------------------+---------------------------+
> +| next             | <next>                    |
> ++------------------+---------------------------+
> +| region info type | VFIO region info type     |
> ++------------------+---------------------------+
> +
> +This capability is defined when a region is specific to the device.
> +
> +VFIO region info type cap
> +"""""""""""""""""""""""""
> +
> +The VFIO region info type is defined in ``<linux/vfio.h>``
> +(``struct vfio_region_info_cap_type``).
> +
> ++---------+--------+------+
> +| Name    | Offset | Size |
> ++=========+========+======+
> +| type    | 0      | 4    |
> ++---------+--------+------+
> +| subtype | 4      | 4    |
> ++---------+--------+------+
> +
> +The only device-specific region type and subtype supported by vfio-user is
> +``VFIO_REGION_TYPE_MIGRATION`` (3) and
> ``VFIO_REGION_SUBTYPE_MIGRATION`` (1).
> +
> +``VFIO_USER_DEVICE_GET_REGION_IO_FDS``
> +--------------------------------------
> +
> +Clients can access regions via ``VFIO_USER_REGION_READ/WRITE`` or, if
> provided, by
> +``mmap()`` of a file descriptor provided by the server.
> +
> +``VFIO_USER_DEVICE_GET_REGION_IO_FDS`` provides an alternative access
> mechanism via
> +file descriptors. This is an optional feature intended for performance
> +improvements where an underlying sub-system (such as KVM) supports
> communication
> +across such file descriptors to the vfio-user server, without needing to
> +round-trip through the client.
> +
> +The server returns an array of sub-regions for the requested region. Each
> +sub-region describes a span (offset and size) of a region, along with the
> +requested file descriptor notification mechanism to use.  Each sub-region in 
> the
> +response message may choose to use a different method, as defined below.
> The
> +two mechanisms supported in this specification are ioeventfds and 
> ioregionfds.
> +
> +The server in addition returns a file descriptor in the ancillary data; 
> clients
> +are expected to configure each sub-region's file descriptor with the 
> requested
> +notification method. For example, a client could configure KVM with the
> +requested ioeventfd via a ``KVM_IOEVENTFD`` ``ioctl()``.
> +
> +Request
> +^^^^^^^
> +
> ++-------------+--------+------+
> +| Name        | Offset | Size |
> ++=============+========+======+
> +| argsz       | 0      | 4    |
> ++-------------+--------+------+
> +| flags       | 4      | 4    |
> ++-------------+--------+------+
> +| index       | 8      | 4    |
> ++-------------+--------+------+
> +| count       | 12     | 4    |
> ++-------------+--------+------+
> +
> +* *argsz* the maximum size of the reply payload
> +* *index* is the index of memory region being queried
> +* all other fields must be zero
> +
> +The client must set ``flags`` to zero and specify the region being queried in
> +the ``index``.
> +
> +Reply
> +^^^^^
> +
> ++-------------+--------+------+
> +| Name        | Offset | Size |
> ++=============+========+======+
> +| argsz       | 0      | 4    |
> ++-------------+--------+------+
> +| flags       | 4      | 4    |
> ++-------------+--------+------+
> +| index       | 8      | 4    |
> ++-------------+--------+------+
> +| count       | 12     | 4    |
> ++-------------+--------+------+
> +| sub-regions | 16     | ...  |
> ++-------------+--------+------+
> +
> +* *argsz* is the size of the region IO FD info structure plus the
> +  total size of the sub-region array. Thus, each array entry "i" is at offset
> +  i * ((argsz - 32) / count). Note that currently this is 40 bytes for both 
> IO
> +  FD types, but this is not to be relied on. As elsewhere, this indicates the
> +  full reply payload size needed.
> +* *flags* must be zero
> +* *index* is the index of memory region being queried
> +* *count* is the number of sub-regions in the array
> +* *sub-regions* is the array of Sub-Region IO FD info structures
> +
> +The reply message will additionally include at least one file descriptor in 
> the
> +ancillary data. Note that more than one sub-region may share the same file
> +descriptor.
> +
> +Note that it is the client's responsibility to verify the requested values 
> (for
> +example, that the requested offset does not exceed the region's bounds).
> +
> +Each sub-region given in the response has one of two possible structures,
> +depending whether *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD`` or
> +``VFIO_USER_IO_FD_TYPE_IOREGIONFD``:
> +
> +Sub-Region IO FD info format (ioeventfd)
> +""""""""""""""""""""""""""""""""""""""""
> +
> ++-----------+--------+------+
> +| Name      | Offset | Size |
> ++===========+========+======+
> +| offset    | 0      | 8    |
> ++-----------+--------+------+
> +| size      | 8      | 8    |
> ++-----------+--------+------+
> +| fd_index  | 16     | 4    |
> ++-----------+--------+------+
> +| type      | 20     | 4    |
> ++-----------+--------+------+
> +| flags     | 24     | 4    |
> ++-----------+--------+------+
> +| padding   | 28     | 4    |
> ++-----------+--------+------+
> +| datamatch | 32     | 8    |
> ++-----------+--------+------+
> +
> +* *offset* is the offset of the start of the sub-region within the region
> +  requested ("physical address offset" for the region)
> +* *size* is the length of the sub-region. This may be zero if the access 
> size is
> +  not relevant, which may allow for optimizations
> +* *fd_index* is the index in the ancillary data of the FD to use for 
> ioeventfd
> +  notification; it may be shared.
> +* *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD``
> +* *flags* is any of:
> +
> +  * ``KVM_IOEVENTFD_FLAG_DATAMATCH``
> +  * ``KVM_IOEVENTFD_FLAG_PIO``
> +  * ``KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY`` (FIXME: makes sense?)
> +
> +* *datamatch* is the datamatch value if needed
> +
> +See https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__www.kernel.org_doc_Documentation_virtual_kvm_api.txt&d=DwIBAg&c=
> s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6ogtti46atk736SI4vgsJiUKIyDE
> &m=ngoOFuShfuAMJEVdV4gT7GrSMsClZglKPZXIUUeqln2aRgULhPnXxADWhHOl
> 4DBy&s=D1hlJS25Ny7l_WLpHhCzIn7vdmBG4HMgjTgxv-9frDU&e= , *4.59
> +KVM_IOEVENTFD* for further context on the ioeventfd-specific fields.
> +
> +Sub-Region IO FD info format (ioregionfd)
> +"""""""""""""""""""""""""""""""""""""""""
> +
> ++-----------+--------+------+
> +| Name      | Offset | Size |
> ++===========+========+======+
> +| offset    | 0      | 8    |
> ++-----------+--------+------+
> +| size      | 8      | 8    |
> ++-----------+--------+------+
> +| fd_index  | 16     | 4    |
> ++-----------+--------+------+
> +| type      | 20     | 4    |
> ++-----------+--------+------+
> +| flags     | 24     | 4    |
> ++-----------+--------+------+
> +| padding   | 28     | 4    |
> ++-----------+--------+------+
> +| user_data | 32     | 8    |
> ++-----------+--------+------+
> +
> +* *offset* is the offset of the start of the sub-region within the region
> +  requested ("physical address offset" for the region)
> +* *size* is the length of the sub-region. This may be zero if the access 
> size is
> +  not relevant, which may allow for optimizations;
> ``KVM_IOREGION_POSTED_WRITES``
> +  must be set in *flags* in this case
> +* *fd_index* is the index in the ancillary data of the FD to use for 
> ioregionfd
> +  messages; it may be shared
> +* *type* is ``VFIO_USER_IO_FD_TYPE_IOREGIONFD``
> +* *flags* is any of:
> +
> +  * ``KVM_IOREGION_PIO``
> +  * ``KVM_IOREGION_POSTED_WRITES``
> +
> +* *user_data* is an opaque value passed back to the server via a message on
> the
> +  file descriptor
> +
> +For further information on the ioregionfd-specific fields, see:
> +https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__lore.kernel.org_kvm_cover.1613828726.git.eafanasova-
> 40gmail.com_&d=DwIBAg&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6
> ogtti46atk736SI4vgsJiUKIyDE&m=ngoOFuShfuAMJEVdV4gT7GrSMsClZglKPZXIUU
> eqln2aRgULhPnXxADWhHOl4DBy&s=OOeFsvNXwrHKJ_usTp2EzZuq5leMI9ggCxJ
> 0g5sPuw8&e=
> +
> +(FIXME: update with final API docs.)
> +
> +``VFIO_USER_DEVICE_GET_IRQ_INFO``
> +---------------------------------
> +
> +This command message is sent by the client to the server to query for
> +information about device interrupt types. The VFIO IRQ info structure is
> +defined in ``<linux/vfio.h>`` (``struct vfio_irq_info``).
> +
> +Request
> +^^^^^^^
> +
> ++-------+--------+---------------------------+
> +| Name  | Offset | Size                      |
> ++=======+========+===========================+
> +| argsz | 0      | 4                         |
> ++-------+--------+---------------------------+
> +| flags | 4      | 4                         |
> ++-------+--------+---------------------------+
> +|       | +-----+--------------------------+ |
> +|       | | Bit | Definition               | |
> +|       | +=====+==========================+ |
> +|       | | 0   | VFIO_IRQ_INFO_EVENTFD    | |
> +|       | +-----+--------------------------+ |
> +|       | | 1   | VFIO_IRQ_INFO_MASKABLE   | |
> +|       | +-----+--------------------------+ |
> +|       | | 2   | VFIO_IRQ_INFO_AUTOMASKED | |
> +|       | +-----+--------------------------+ |
> +|       | | 3   | VFIO_IRQ_INFO_NORESIZE   | |
> +|       | +-----+--------------------------+ |
> ++-------+--------+---------------------------+
> +| index | 8      | 4                         |
> ++-------+--------+---------------------------+
> +| count | 12     | 4                         |
> ++-------+--------+---------------------------+
> +
> +* *argsz* is the maximum size of the reply payload (16 bytes today)
> +* index is the index of IRQ type being queried (e.g.
> ``VFIO_PCI_MSIX_IRQ_INDEX``)
> +* all other fields must be zero
> +
> +Reply
> +^^^^^
> +
> ++-------+--------+---------------------------+
> +| Name  | Offset | Size                      |
> ++=======+========+===========================+
> +| argsz | 0      | 4                         |
> ++-------+--------+---------------------------+
> +| flags | 4      | 4                         |
> ++-------+--------+---------------------------+
> +|       | +-----+--------------------------+ |
> +|       | | Bit | Definition               | |
> +|       | +=====+==========================+ |
> +|       | | 0   | VFIO_IRQ_INFO_EVENTFD    | |
> +|       | +-----+--------------------------+ |
> +|       | | 1   | VFIO_IRQ_INFO_MASKABLE   | |
> +|       | +-----+--------------------------+ |
> +|       | | 2   | VFIO_IRQ_INFO_AUTOMASKED | |
> +|       | +-----+--------------------------+ |
> +|       | | 3   | VFIO_IRQ_INFO_NORESIZE   | |
> +|       | +-----+--------------------------+ |
> ++-------+--------+---------------------------+
> +| index | 8      | 4                         |
> ++-------+--------+---------------------------+
> +| count | 12     | 4                         |
> ++-------+--------+---------------------------+
> +
> +* *argsz* is the size required for the full reply payload (16 bytes today)
> +* *flags* defines IRQ attributes:
> +
> +  * ``VFIO_IRQ_INFO_EVENTFD`` indicates the IRQ type can support server
> eventfd
> +    signalling.
> +  * ``VFIO_IRQ_INFO_MASKABLE`` indicates that the IRQ type supports the
> ``MASK``
> +    and ``UNMASK`` actions in a ``VFIO_USER_DEVICE_SET_IRQS`` message.
> +  * ``VFIO_IRQ_INFO_AUTOMASKED`` indicates the IRQ type masks itself after
> being
> +    triggered, and the client must send an ``UNMASK`` action to receive new
> +    interrupts.
> +  * ``VFIO_IRQ_INFO_NORESIZE`` indicates ``VFIO_USER_SET_IRQS`` operations
> setup
> +    interrupts as a set, and new sub-indexes cannot be enabled without 
> disabling
> +    the entire type.
> +* index is the index of IRQ type being queried
> +* count describes the number of interrupts of the queried type.
> +
> +``VFIO_USER_DEVICE_SET_IRQS``
> +-----------------------------
> +
> +This command message is sent by the client to the server to set actions for
> +device interrupt types. The VFIO IRQ set structure is defined in
> +``<linux/vfio.h>`` (``struct vfio_irq_set``).
> +
> +Request
> +^^^^^^^
> +
> ++-------+--------+------------------------------+
> +| Name  | Offset | Size                         |
> ++=======+========+==============================+
> +| argsz | 0      | 4                            |
> ++-------+--------+------------------------------+
> +| flags | 4      | 4                            |
> ++-------+--------+------------------------------+
> +|       | +-----+-----------------------------+ |
> +|       | | Bit | Definition                  | |
> +|       | +=====+=============================+ |
> +|       | | 0   | VFIO_IRQ_SET_DATA_NONE      | |
> +|       | +-----+-----------------------------+ |
> +|       | | 1   | VFIO_IRQ_SET_DATA_BOOL      | |
> +|       | +-----+-----------------------------+ |
> +|       | | 2   | VFIO_IRQ_SET_DATA_EVENTFD   | |
> +|       | +-----+-----------------------------+ |
> +|       | | 3   | VFIO_IRQ_SET_ACTION_MASK    | |
> +|       | +-----+-----------------------------+ |
> +|       | | 4   | VFIO_IRQ_SET_ACTION_UNMASK  | |
> +|       | +-----+-----------------------------+ |
> +|       | | 5   | VFIO_IRQ_SET_ACTION_TRIGGER | |
> +|       | +-----+-----------------------------+ |
> ++-------+--------+------------------------------+
> +| index | 8      | 4                            |
> ++-------+--------+------------------------------+
> +| start | 12     | 4                            |
> ++-------+--------+------------------------------+
> +| count | 16     | 4                            |
> ++-------+--------+------------------------------+
> +| data  | 20     | variable                     |
> ++-------+--------+------------------------------+
> +
> +* *argsz* is the size of the VFIO IRQ set request payload, including any 
> *data*
> +  field. Note there is no reply payload, so this field differs from other
> +  message types.
> +* *flags* defines the action performed on the interrupt range. The ``DATA``
> +  flags describe the data field sent in the message; the ``ACTION`` flags
> +  describe the action to be performed. The flags are mutually exclusive for
> +  both sets.
> +
> +  * ``VFIO_IRQ_SET_DATA_NONE`` indicates there is no data field in the
> command.
> +    The action is performed unconditionally.
> +  * ``VFIO_IRQ_SET_DATA_BOOL`` indicates the data field is an array of
> boolean
> +    bytes. The action is performed if the corresponding boolean is true.
> +  * ``VFIO_IRQ_SET_DATA_EVENTFD`` indicates an array of event file
> descriptors
> +    was sent in the message meta-data. These descriptors will be signalled 
> when
> +    the action defined by the action flags occurs. In ``AF_UNIX`` sockets, 
> the
> +    descriptors are sent as ``SCM_RIGHTS`` type ancillary data.
> +    If no file descriptors are provided, this de-assigns the specified
> +    previously configured interrupts.
> +  * ``VFIO_IRQ_SET_ACTION_MASK`` indicates a masking event. It can be used
> with
> +    ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to mask an
> interrupt,
> +    or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the
> guest masks
> +    the interrupt.
> +  * ``VFIO_IRQ_SET_ACTION_UNMASK`` indicates an unmasking event. It can be
> used
> +    with ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to
> unmask an
> +    interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event
> when the
> +    guest unmasks the interrupt.
> +  * ``VFIO_IRQ_SET_ACTION_TRIGGER`` indicates a triggering event. It can be
> used
> +    with ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to
> trigger an
> +    interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event
> when the
> +    server triggers the interrupt.
> +
> +* *index* is the index of IRQ type being setup.
> +* *start* is the start of the sub-index being set.
> +* *count* describes the number of sub-indexes being set. As a special case, a
> +  count (and start) of 0, with data flags of ``VFIO_IRQ_SET_DATA_NONE``
> disables
> +  all interrupts of the index.
> +* *data* is an optional field included when the
> +  ``VFIO_IRQ_SET_DATA_BOOL`` flag is present. It contains an array of
> booleans
> +  that specify whether the action is to be performed on the corresponding
> +  index. It's used when the action is only performed on a subset of the range
> +  specified.
> +
> +Not all interrupt types support every combination of data and action flags.
> +The client must know the capabilities of the device and IRQ index before it
> +sends a ``VFIO_USER_DEVICE_SET_IRQ`` message.
> +
> +In typical operation, a specific IRQ may operate as follows:
> +
> +1. The client sends a ``VFIO_USER_DEVICE_SET_IRQ`` message with
> +   ``flags=(VFIO_IRQ_SET_DATA_EVENTFD|VFIO_IRQ_SET_ACTION_TRIGGER)``
> along
> +   with an eventfd. This associates the IRQ with a particular eventfd on the
> +   server side.
> +
> +#. The client may send a ``VFIO_USER_DEVICE_SET_IRQ`` message with
> +
> ``flags=(VFIO_IRQ_SET_DATA_EVENTFD|VFIO_IRQ_SET_ACTION_MASK/UNMAS
> K)`` along
> +   with another eventfd. This associates the given eventfd with the
> +   mask/unmask state on the server side.
> +
> +#. The server may trigger the IRQ by writing 1 to the eventfd.
> +
> +#. The server may mask/unmask an IRQ which will write 1 to the corresponding
> +   mask/unmask eventfd, if there is one.
> +
> +5. A client may trigger a device IRQ itself, by sending a
> +   ``VFIO_USER_DEVICE_SET_IRQ`` message with
> +
> ``flags=(VFIO_IRQ_SET_DATA_NONE/BOOL|VFIO_IRQ_SET_ACTION_TRIGGER)``
> .
> +
> +6. A client may mask or unmask the IRQ, by sending a
> +   ``VFIO_USER_DEVICE_SET_IRQ`` message with
> +
> ``flags=(VFIO_IRQ_SET_DATA_NONE/BOOL|VFIO_IRQ_SET_ACTION_MASK/UN
> MASK)``.
> +
> +Reply
> +^^^^^
> +
> +There is no payload in the reply.
> +
> +.. _Read and Write Operations:
> +
> +Note that all of these operations must be supported by the client and/or
> server,
> +even if the corresponding memory or device region has been shared as
> mappable.
> +
> +The ``count`` field must not exceed the value of ``max_data_xfer_size`` of 
> the
> +peer, for both reads and writes.
> +
> +``VFIO_USER_REGION_READ``
> +-------------------------
> +
> +If a device region is not mappable, it's not directly accessible by the 
> client
> +via ``mmap()`` of the underlying file descriptor. In this case, a client can
> +read from a device region with this message.
> +
> +Request
> +^^^^^^^
> +
> ++--------+--------+----------+
> +| Name   | Offset | Size     |
> ++========+========+==========+
> +| offset | 0      | 8        |
> ++--------+--------+----------+
> +| region | 8      | 4        |
> ++--------+--------+----------+
> +| count  | 12     | 4        |
> ++--------+--------+----------+
> +
> +* *offset* into the region being accessed.
> +* *region* is the index of the region being accessed.
> +* *count* is the size of the data to be transferred.
> +
> +Reply
> +^^^^^
> +
> ++--------+--------+----------+
> +| Name   | Offset | Size     |
> ++========+========+==========+
> +| offset | 0      | 8        |
> ++--------+--------+----------+
> +| region | 8      | 4        |
> ++--------+--------+----------+
> +| count  | 12     | 4        |
> ++--------+--------+----------+
> +| data   | 16     | variable |
> ++--------+--------+----------+
> +
> +* *offset* into the region accessed.
> +* *region* is the index of the region accessed.
> +* *count* is the size of the data transferred.
> +* *data* is the data that was read from the device region.
> +
> +``VFIO_USER_REGION_WRITE``
> +--------------------------
> +
> +If a device region is not mappable, it's not directly accessible by the 
> client
> +via mmap() of the underlying fd. In this case, a client can write to a device
> +region with this message.
> +
> +Request
> +^^^^^^^
> +
> ++--------+--------+----------+
> +| Name   | Offset | Size     |
> ++========+========+==========+
> +| offset | 0      | 8        |
> ++--------+--------+----------+
> +| region | 8      | 4        |
> ++--------+--------+----------+
> +| count  | 12     | 4        |
> ++--------+--------+----------+
> +| data   | 16     | variable |
> ++--------+--------+----------+
> +
> +* *offset* into the region being accessed.
> +* *region* is the index of the region being accessed.
> +* *count* is the size of the data to be transferred.
> +* *data* is the data to write
> +
> +Reply
> +^^^^^
> +
> ++--------+--------+----------+
> +| Name   | Offset | Size     |
> ++========+========+==========+
> +| offset | 0      | 8        |
> ++--------+--------+----------+
> +| region | 8      | 4        |
> ++--------+--------+----------+
> +| count  | 12     | 4        |
> ++--------+--------+----------+
> +
> +* *offset* into the region accessed.
> +* *region* is the index of the region accessed.
> +* *count* is the size of the data transferred.
> +
> +``VFIO_USER_DMA_READ``
> +-----------------------
> +
> +If the client has not shared mappable memory, the server can use this message
> to
> +read from guest memory.
> +
> +Request
> +^^^^^^^
> +
> ++---------+--------+----------+
> +| Name    | Offset | Size     |
> ++=========+========+==========+
> +| address | 0      | 8        |
> ++---------+--------+----------+
> +| count   | 8      | 8        |
> ++---------+--------+----------+
> +
> +* *address* is the client DMA memory address being accessed. This address
> must have
> +  been previously exported to the server with a ``VFIO_USER_DMA_MAP``
> message.
> +* *count* is the size of the data to be transferred.
> +
> +Reply
> +^^^^^
> +
> ++---------+--------+----------+
> +| Name    | Offset | Size     |
> ++=========+========+==========+
> +| address | 0      | 8        |
> ++---------+--------+----------+
> +| count   | 8      | 8        |
> ++---------+--------+----------+
> +| data    | 16     | variable |
> ++---------+--------+----------+
> +
> +* *address* is the client DMA memory address being accessed.
> +* *count* is the size of the data transferred.
> +* *data* is the data read.
> +
> +``VFIO_USER_DMA_WRITE``
> +-----------------------
> +
> +If the client has not shared mappable memory, the server can use this message
> to
> +write to guest memory.
> +
> +Request
> +^^^^^^^
> +
> ++---------+--------+----------+
> +| Name    | Offset | Size     |
> ++=========+========+==========+
> +| address | 0      | 8        |
> ++---------+--------+----------+
> +| count   | 8      | 8        |
> ++---------+--------+----------+
> +| data    | 16     | variable |
> ++---------+--------+----------+
> +
> +* *address* is the client DMA memory address being accessed. This address
> must have
> +  been previously exported to the server with a ``VFIO_USER_DMA_MAP``
> message.
> +* *count* is the size of the data to be transferred.
> +* *data* is the data to write
> +
> +Reply
> +^^^^^
> +
> ++---------+--------+----------+
> +| Name    | Offset | Size     |
> ++=========+========+==========+
> +| address | 0      | 8        |
> ++---------+--------+----------+
> +| count   | 8      | 4        |
> ++---------+--------+----------+
> +
> +* *address* is the client DMA memory address being accessed.
> +* *count* is the size of the data transferred.
> +
> +``VFIO_USER_DEVICE_RESET``
> +--------------------------
> +
> +This command message is sent from the client to the server to reset the 
> device.
> +Neither the request or reply have a payload.
> +
> +``VFIO_USER_DIRTY_PAGES``
> +-------------------------
> +
> +This command is analogous to ``VFIO_IOMMU_DIRTY_PAGES``. It is sent by the
> client
> +to the server in order to control logging of dirty pages, usually during a 
> live
> +migration.
> +
> +Dirty page tracking is optional for server implementation; clients should not
> +rely on it.
> +
> +Request
> +^^^^^^^
> +
> ++-------+--------+-----------------------------------------+
> +| Name  | Offset | Size                                    |
> ++=======+========+=========================================+
> +| argsz | 0      | 4                                       |
> ++-------+--------+-----------------------------------------+
> +| flags | 4      | 4                                       |
> ++-------+--------+-----------------------------------------+
> +|       | +-----+----------------------------------------+ |
> +|       | | Bit | Definition                             | |
> +|       | +=====+========================================+ |
> +|       | | 0   | VFIO_IOMMU_DIRTY_PAGES_FLAG_START      | |
> +|       | +-----+----------------------------------------+ |
> +|       | | 1   | VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP       | |
> +|       | +-----+----------------------------------------+ |
> +|       | | 2   | VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP | |
> +|       | +-----+----------------------------------------+ |
> ++-------+--------+-----------------------------------------+
> +
> +* *argsz* is the size of the VFIO dirty bitmap info structure for
> +  ``START/STOP``; and for ``GET_BITMAP``, the maximum size of the reply
> payload
> +
> +* *flags* defines the action to be performed by the server:
> +
> +  * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_START`` instructs the server to start
> logging
> +    pages it dirties. Logging continues until explicitly disabled by
> +    ``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP``.
> +
> +  * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP`` instructs the server to stop
> logging
> +    dirty pages.
> +
> +  * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP`` requests the server to
> return
> +    the dirty bitmap for a specific IOVA range. The IOVA range is specified 
> by
> +    a "VFIO Bitmap Range" structure, which must immediately follow this
> +    "VFIO Dirty Pages" structure. See `VFIO Bitmap Range Format`_.
> +    This operation is only valid if logging of dirty pages has been 
> previously
> +    started.
> +
> +  These flags are mutually exclusive with each other.
> +
> +This part of the request is analogous to VFIO's ``struct
> +vfio_iommu_type1_dirty_bitmap``.
> +
> +.. _VFIO Bitmap Range Format:
> +
> +VFIO Bitmap Range Format
> +""""""""""""""""""""""""
> +
> ++--------+--------+------+
> +| Name   | Offset | Size |
> ++========+========+======+
> +| iova   | 0      | 8    |
> ++--------+--------+------+
> +| size   | 8      | 8    |
> ++--------+--------+------+
> +| bitmap | 16     | 24   |
> ++--------+--------+------+
> +
> +* *iova* is the IOVA offset
> +
> +* *size* is the size of the IOVA region
> +
> +* *bitmap* is the VFIO Bitmap explained in `VFIO Bitmap`_.
> +
> +This part of the request is analogous to VFIO's ``struct
> +vfio_iommu_type1_dirty_bitmap_get``.
> +
> +Reply
> +^^^^^
> +
> +For ``VFIO_IOMMU_DIRTY_PAGES_FLAG_START`` or
> +``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP``, there is no reply payload.
> +
> +For ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP``, the reply payload is
> as follows:
> +
> ++--------------+--------+-----------------------------------------+
> +| Name         | Offset | Size                                    |
> ++==============+========+=======================================
> ==+
> +| argsz        | 0      | 4                                       |
> ++--------------+--------+-----------------------------------------+
> +| flags        | 4      | 4                                       |
> ++--------------+--------+-----------------------------------------+
> +|              | +-----+----------------------------------------+ |
> +|              | | Bit | Definition                             | |
> +|              | +=====+========================================+ |
> +|              | | 2   | VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP | |
> +|              | +-----+----------------------------------------+ |
> ++--------------+--------+-----------------------------------------+
> +| bitmap range | 8      | 40                                      |
> ++--------------+--------+-----------------------------------------+
> +| bitmap       | 48     | variable                                |
> ++--------------+--------+-----------------------------------------+
> +
> +* *argsz* is the size required for the full reply payload (dirty pages 
> structure
> +  + bitmap range structure + actual bitmap)
> +* *flags* is ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP``
> +* *bitmap range* is the same bitmap range struct provided in the request, as
> +  defined in `VFIO Bitmap Range Format`_.
> +* *bitmap* is the actual dirty pages bitmap corresponding to the range 
> request
> +
> +VFIO Device Migration Info
> +--------------------------
> +
> +A device may contain a migration region (of type
> +``VFIO_REGION_TYPE_MIGRATION``).  The beginning of the region must
> contain
> +``struct vfio_device_migration_info``, defined in ``<linux/vfio.h>``. This
> +subregion is accessed like any other part of a standard vfio-user region
> +using ``VFIO_USER_REGION_READ``/``VFIO_USER_REGION_WRITE``.
> +
> ++---------------+--------+-----------------------------+
> +| Name          | Offset | Size                        |
> ++===============+========+=============================+
> +| device_state  | 0      | 4                           |
> ++---------------+--------+-----------------------------+
> +|               | +-----+----------------------------+ |
> +|               | | Bit | Definition                 | |
> +|               | +=====+============================+ |
> +|               | | 0   | VFIO_DEVICE_STATE_RUNNING  | |
> +|               | +-----+----------------------------+ |
> +|               | | 1   | VFIO_DEVICE_STATE_SAVING   | |
> +|               | +-----+----------------------------+ |
> +|               | | 2   | VFIO_DEVICE_STATE_RESUMING | |
> +|               | +-----+----------------------------+ |
> ++---------------+--------+-----------------------------+
> +| reserved      | 4      | 4                           |
> ++---------------+--------+-----------------------------+
> +| pending_bytes | 8      | 8                           |
> ++---------------+--------+-----------------------------+
> +| data_offset   | 16     | 8                           |
> ++---------------+--------+-----------------------------+
> +| data_size     | 24     | 8                           |
> ++---------------+--------+-----------------------------+
> +
> +* *device_state* defines the state of the device:
> +
> +  The client initiates device state transition by writing the intended state.
> +  The server must respond only after it has successfully transitioned to the 
> new
> +  state. If an error occurs then the server must respond to the
> +  ``VFIO_USER_REGION_WRITE`` operation with the Error field set accordingly
> and
> +  must remain at the previous state, or in case of internal error it must
> +  transition to the error state, defined as
> +  ``VFIO_DEVICE_STATE_RESUMING | VFIO_DEVICE_STATE_SAVING``. The
> client must
> +  re-read the device state in order to determine it afresh.
> +
> +  The following device states are defined:
> +
> +  +-----------+---------+----------+-----------------------------------+
> +  | _RESUMING | _SAVING | _RUNNING | Description                       |
> +
> +===========+=========+==========+===============================
> ====+
> +  | 0         | 0       | 0        | Device is stopped.                |
> +  +-----------+---------+----------+-----------------------------------+
> +  | 0         | 0       | 1        | Device is running, default state. |
> +  +-----------+---------+----------+-----------------------------------+
> +  | 0         | 1       | 0        | Stop-and-copy state               |
> +  +-----------+---------+----------+-----------------------------------+
> +  | 0         | 1       | 1        | Pre-copy state                    |
> +  +-----------+---------+----------+-----------------------------------+
> +  | 1         | 0       | 0        | Resuming                          |
> +  +-----------+---------+----------+-----------------------------------+
> +  | 1         | 0       | 1        | Invalid state                     |
> +  +-----------+---------+----------+-----------------------------------+
> +  | 1         | 1       | 0        | Error state                       |
> +  +-----------+---------+----------+-----------------------------------+
> +  | 1         | 1       | 1        | Invalid state                     |
> +  +-----------+---------+----------+-----------------------------------+
> +
> +  Valid state transitions are shown in the following table:
> +
> +  
> +-------------------------+---------+---------+---------------+----------+----------+
> +  | |darr| From / To |rarr| | Stopped | Running | Stop-and-copy | Pre-copy |
> Resuming |
> +
> +=========================+=========+=========+===============+==
> ========+==========+
> +  | Stopped                 |    \-   |    1    |       0       |    0     | 
>     0    |
> +  
> +-------------------------+---------+---------+---------------+----------+----------+
> +  | Running                 |    1    |    \-   |       1       |    1     | 
>     1    |
> +  
> +-------------------------+---------+---------+---------------+----------+----------+
> +  | Stop-and-copy           |    1    |    1    |       \-      |    0     | 
>     0    |
> +  
> +-------------------------+---------+---------+---------------+----------+----------+
> +  | Pre-copy                |    0    |    0    |       1       |    \-    | 
>     0    |
> +  
> +-------------------------+---------+---------+---------------+----------+----------+
> +  | Resuming                |    0    |    1    |       0       |    0     | 
>     \-   |
> +  
> +-------------------------+---------+---------+---------------+----------+----------+
> +
> +  A device is migrated to the destination as follows:
> +
> +  * The source client transitions the device state from the running state to
> +    the pre-copy state. This transition is optional for the client but must 
> be
> +    supported by the server. The source server starts sending device state 
> data
> +    to the source client through the migration region while the device is
> +    running.
> +
> +  * The source client transitions the device state from the running state or 
> the
> +    pre-copy state to the stop-and-copy state. The source server stops the
> +    device, saves device state and sends it to the source client through the
> +    migration region.
> +
> +  The source client is responsible for sending the migration data to the
> +  destination client.

What happens when the device needs to dirty memory when transitioning to the 
stop-and-copy state? I would expect the source client to request dirty pages 
for mapped DMA regions (or use the "get dirty page bitmap"/ 
VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP bit when unmapping the DMA region) and 
transmit those pages to the client destination. I've looked at 
linux/include/uapi/linux/vfio.h and it's not clear to me.

> +
> +  A device is resumed on the destination as follows:
> +
> +  * The destination client transitions the device state from the running 
> state
> +    to the resuming state. The destination server uses the device state data
> +    received through the migration region to resume the device.
> +
> +  * The destination client provides saved device state to the destination
> +    server and then transitions the device to back to the running state.
> +
> +* *reserved* This field is reserved and any access to it must be ignored by 
> the
> +  server.
> +
> +* *pending_bytes* Remaining bytes to be migrated by the server. This field is
> +  read only.
> +
> +* *data_offset* Offset in the migration region where the client must:
> +
> +  * read from, during the pre-copy or stop-and-copy state, or
> +
> +  * write to, during the resuming state.
> +
> +  This field is read only.
> +
> +* *data_size* Contains the size, in bytes, of the amount of data copied to:
> +
> +  * the source migration region by the source server during the pre-copy or
> +    stop-and copy state, or
> +
> +  * the destination migration region by the destination client during the
> +    resuming state.
> +
> +Device-specific data must be stored at any position after
> +``struct vfio_device_migration_info``. Note that the migration region can be
> +memory mappable, even partially. In practise, only the migration data portion
> +can be memory mapped.
> +
> +The client processes device state data during the pre-copy and the
> +stop-and-copy state in the following iterative manner:
> +
> +  1. The client reads ``pending_bytes`` to mark a new iteration. Repeated 
> reads
> +     of this field is an idempotent operation. If there are no migration data
> +     to be consumed then the next step depends on the current device state:
> +
> +     * pre-copy: the client must try again.
> +
> +     * stop-and-copy: this procedure can end and the device can now start
> +       resuming on the destination.
> +
> +  2. The client reads ``data_offset``; at this point the server must make
> +     available a portion of migration data at this offset to be read by the
> +     client, which must happen *before* completing the read operation. The
> +     amount of data to be read must be stored in the ``data_size`` field, 
> which
> +     the client reads next.
> +
> +  3. The client reads ``data_size`` to determine the amount of migration data
> +     available.
> +
> +  4. The client reads and processes the migration data.
> +
> +  5. Go to step 1.
> +
> +Note that the client can transition the device from the pre-copy state to the
> +stop-and-copy state at any time; ``pending_bytes`` does not need to become
> zero.
> +
> +The client initializes the device state on the destination by setting the
> +device state in the resuming state and writing the migration data to the
> +destination migration region at ``data_offset`` offset. The client can write 
> the
> +source migration data in an iterative manner and the server must consume this
> +data before completing each write operation, updating the ``data_offset``
> field.
> +The server must apply the source migration data on the device resume state.
> The
> +client must write data on the same order and transaction size as read.
> +
> +If an error occurs then the server must fail the read or write operation. It 
> is
> +an implementation detail of the client how to handle errors.
> +
> +Appendices
> +==========
> +
> +Unused VFIO ``ioctl()`` commands
> +--------------------------------
> +
> +The following VFIO commands do not have an equivalent vfio-user command:
> +
> +* ``VFIO_GET_API_VERSION``
> +* ``VFIO_CHECK_EXTENSION``
> +* ``VFIO_SET_IOMMU``
> +* ``VFIO_GROUP_GET_STATUS``
> +* ``VFIO_GROUP_SET_CONTAINER``
> +* ``VFIO_GROUP_UNSET_CONTAINER``
> +* ``VFIO_GROUP_GET_DEVICE_FD``
> +* ``VFIO_IOMMU_GET_INFO``
> +
> +However, once support for live migration for VFIO devices is finalized some
> +of the above commands may have to be handled by the client in their
> +corresponding vfio-user form. This will be addressed in a future protocol
> +version.
> +
> +VFIO groups and containers
> +^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +The current VFIO implementation includes group and container idioms that
> +describe how a device relates to the host IOMMU. In the vfio-user
> +implementation, the IOMMU is implemented in SW by the client, and is not
> +visible to the server. The simplest idea would be that the client put each
> +device into its own group and container.
> +
> +Backend Program Conventions
> +---------------------------
> +
> +vfio-user backend program conventions are based on the vhost-user ones.
> +
> +* The backend program must not daemonize itself.
> +* No assumptions must be made as to what access the backend program has
> on the
> +  system.
> +* File descriptors 0, 1 and 2 must exist, must have regular
> +  stdin/stdout/stderr semantics, and can be redirected.
> +* The backend program must honor the SIGTERM signal.
> +* The backend program must accept the following commands line options:
> +
> +  * ``--socket-path=PATH``: path to UNIX domain socket,
> +  * ``--fd=FDNUM``: file descriptor for UNIX domain socket, incompatible with
> +    ``--socket-path``
> +* The backend program must be accompanied with a JSON file stored under
> +  ``/usr/share/vfio-user``.
> +
> +TODO add schema similar to docs/interop/vhost-user.json.
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7543eb4..1258e11 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1900,6 +1900,12 @@ F: hw/vfio/ap.c
>  F: docs/system/s390x/vfio-ap.rst
>  L: qemu-s390x@nongnu.org
> 
> +vfio-user
> +M: John G Johnson <john.g.johnson@oracle.com>
> +M: Thanos Makatos <thanos.makatos@nutanix.com>
> +S: Supported
> +F: docs/devel/vfio-user.rst
> +
>  vhost
>  M: Michael S. Tsirkin <mst@redhat.com>
>  S: Supported
> --
> 1.8.3.1
> 


reply via email to

[Prev in Thread] Current Thread [Next in Thread]