qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v14 Kernel 1/7] vfio: KABI for migration interface for device


From: Yan Zhao
Subject: Re: [PATCH v14 Kernel 1/7] vfio: KABI for migration interface for device state
Date: Thu, 19 Mar 2020 23:06:56 -0400
User-agent: Mutt/1.9.4 (2018-02-28)

On Fri, Mar 20, 2020 at 10:34:40AM +0800, Alex Williamson wrote:
> On Thu, 19 Mar 2020 21:30:39 -0400
> Yan Zhao <address@hidden> wrote:
> 
> > On Thu, Mar 19, 2020 at 09:09:21PM +0800, Alex Williamson wrote:
> > > On Thu, 19 Mar 2020 01:05:54 -0400
> > > Yan Zhao <address@hidden> wrote:
> > >   
> > > > On Thu, Mar 19, 2020 at 11:49:26AM +0800, Alex Williamson wrote:  
> > > > > On Wed, 18 Mar 2020 21:17:03 -0400
> > > > > Yan Zhao <address@hidden> wrote:
> > > > >     
> > > > > > On Thu, Mar 19, 2020 at 03:41:08AM +0800, Kirti Wankhede wrote:    
> > > > > > > - Defined MIGRATION region type and sub-type.
> > > > > > > 
> > > > > > > - Defined vfio_device_migration_info structure which will be 
> > > > > > > placed at the
> > > > > > >   0th offset of migration region to get/set VFIO device related
> > > > > > >   information. Defined members of structure and usage on 
> > > > > > > read/write access.
> > > > > > > 
> > > > > > > - Defined device states and state transition details.
> > > > > > > 
> > > > > > > - Defined sequence to be followed while saving and resuming VFIO 
> > > > > > > device.
> > > > > > > 
> > > > > > > Signed-off-by: Kirti Wankhede <address@hidden>
> > > > > > > Reviewed-by: Neo Jia <address@hidden>
> > > > > > > ---
> > > > > > >  include/uapi/linux/vfio.h | 227 
> > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > >  1 file changed, 227 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > > > > index 9e843a147ead..d0021467af53 100644
> > > > > > > --- a/include/uapi/linux/vfio.h
> > > > > > > +++ b/include/uapi/linux/vfio.h
> > > > > > > @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
> > > > > > >  #define VFIO_REGION_TYPE_PCI_VENDOR_MASK (0xffff)
> > > > > > >  #define VFIO_REGION_TYPE_GFX                    (1)
> > > > > > >  #define VFIO_REGION_TYPE_CCW                     (2)
> > > > > > > +#define VFIO_REGION_TYPE_MIGRATION              (3)
> > > > > > >  
> > > > > > >  /* sub-types for VFIO_REGION_TYPE_PCI_* */
> > > > > > >  
> > > > > > > @@ -379,6 +380,232 @@ struct vfio_region_gfx_edid {
> > > > > > >  /* sub-types for VFIO_REGION_TYPE_CCW */
> > > > > > >  #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD        (1)
> > > > > > >  
> > > > > > > +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
> > > > > > > +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * The structure vfio_device_migration_info is placed at the 0th 
> > > > > > > offset of
> > > > > > > + * the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO 
> > > > > > > device related
> > > > > > > + * migration information. Field accesses from this structure are 
> > > > > > > only supported
> > > > > > > + * at their native width and alignment. Otherwise, the result is 
> > > > > > > undefined and
> > > > > > > + * vendor drivers should return an error.
> > > > > > > + *
> > > > > > > + * device_state: (read/write)
> > > > > > > + *      - The user application writes to this field to inform 
> > > > > > > the vendor driver
> > > > > > > + *        about the device state to be transitioned to.
> > > > > > > + *      - The vendor driver should take the necessary actions to 
> > > > > > > change the
> > > > > > > + *        device state. After successful transition to a given 
> > > > > > > state, the
> > > > > > > + *        vendor driver should return success on 
> > > > > > > write(device_state, state)
> > > > > > > + *        system call. If the device state transition fails, the 
> > > > > > > vendor driver
> > > > > > > + *        should return an appropriate -errno for the fault 
> > > > > > > condition.
> > > > > > > + *      - On the user application side, if the device state 
> > > > > > > transition fails,
> > > > > > > + *         that is, if write(device_state, state) returns an 
> > > > > > > error, read
> > > > > > > + *         device_state again to determine the current state of 
> > > > > > > the device from
> > > > > > > + *         the vendor driver.
> > > > > > > + *      - The vendor driver should return previous state of the 
> > > > > > > device unless
> > > > > > > + *        the vendor driver has encountered an internal error, 
> > > > > > > in which case
> > > > > > > + *        the vendor driver may report the device_state 
> > > > > > > VFIO_DEVICE_STATE_ERROR.
> > > > > > > + *      - The user application must use the device reset ioctl 
> > > > > > > to recover the
> > > > > > > + *        device from VFIO_DEVICE_STATE_ERROR state. If the 
> > > > > > > device is
> > > > > > > + *        indicated to be in a valid device state by reading 
> > > > > > > device_state, the
> > > > > > > + *        user application may attempt to transition the device 
> > > > > > > to any valid
> > > > > > > + *        state reachable from the current state or terminate 
> > > > > > > itself.
> > > > > > > + *
> > > > > > > + *      device_state consists of 3 bits:
> > > > > > > + *      - If bit 0 is set, it indicates the _RUNNING state. If 
> > > > > > > bit 0 is clear,
> > > > > > > + *        it indicates the _STOP state. When the device state is 
> > > > > > > changed to
> > > > > > > + *        _STOP, driver should stop the device before write() 
> > > > > > > returns.
> > > > > > > + *      - If bit 1 is set, it indicates the _SAVING state, which 
> > > > > > > means that the
> > > > > > > + *        driver should start gathering device state information 
> > > > > > > that will be
> > > > > > > + *        provided to the VFIO user application to save the 
> > > > > > > device's state.
> > > > > > > + *      - If bit 2 is set, it indicates the _RESUMING state, 
> > > > > > > which means that
> > > > > > > + *        the driver should prepare to resume the device. Data 
> > > > > > > provided through
> > > > > > > + *        the migration region should be used to resume the 
> > > > > > > device.
> > > > > > > + *      Bits 3 - 31 are reserved for future use. To preserve 
> > > > > > > them, the user
> > > > > > > + *      application should perform a read-modify-write operation 
> > > > > > > on this
> > > > > > > + *      field when modifying the specified bits.
> > > > > > > + *
> > > > > > > + *  +------- _RESUMING
> > > > > > > + *  |+------ _SAVING
> > > > > > > + *  ||+----- _RUNNING
> > > > > > > + *  |||
> > > > > > > + *  000b => Device Stopped, not saving or resuming
> > > > > > > + *  001b => Device running, which is the default state
> > > > > > > + *  010b => Stop the device & save the device state, 
> > > > > > > stop-and-copy state
> > > > > > > + *  011b => Device running and save the device state, pre-copy 
> > > > > > > state
> > > > > > > + *  100b => Device stopped and the device state is resuming
> > > > > > > + *  101b => Invalid state
> > > > > > > + *  110b => Error state
> > > > > > > + *  111b => Invalid state
> > > > > > > + *
> > > > > > > + * State transitions:
> > > > > > > + *
> > > > > > > + *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy 
> > > > > > >   _STOP
> > > > > > > + *                (100b)     (001b)     (011b)        (010b)     
> > > > > > >   (000b)
> > > > > > > + * 0. Running or default state
> > > > > > > + *                             |
> > > > > > > + *
> > > > > > > + * 1. Normal Shutdown (optional)
> > > > > > > + *                             
> > > > > > > |------------------------------------->|
> > > > > > > + *
> > > > > > > + * 2. Save the state or suspend
> > > > > > > + *                             
> > > > > > > |------------------------->|---------->|
> > > > > > > + *
> > > > > > > + * 3. Save the state during live migration
> > > > > > > + *                             
> > > > > > > |----------->|------------>|---------->|
> > > > > > > + *
> > > > > > > + * 4. Resuming
> > > > > > > + *                  |<---------|
> > > > > > > + *
> > > > > > > + * 5. Resumed
> > > > > > > + *                  |--------->|
> > > > > > > + *
> > > > > > > + * 0. Default state of VFIO device is _RUNNNG when the user 
> > > > > > > application starts.
> > > > > > > + * 1. During normal shutdown of the user application, the user 
> > > > > > > application may
> > > > > > > + *    optionally change the VFIO device state from _RUNNING to 
> > > > > > > _STOP. This
> > > > > > > + *    transition is optional. The vendor driver must support 
> > > > > > > this transition but
> > > > > > > + *    must not require it.
> > > > > > > + * 2. When the user application saves state or suspends the 
> > > > > > > application, the
> > > > > > > + *    device state transitions from _RUNNING to stop-and-copy 
> > > > > > > and then to _STOP.
> > > > > > > + *    On state transition from _RUNNING to stop-and-copy, driver 
> > > > > > > must stop the
> > > > > > > + *    device, save the device state and send it to the 
> > > > > > > application through the
> > > > > > > + *    migration region. The sequence to be followed for such 
> > > > > > > transition is given
> > > > > > > + *    below.
> > > > > > > + * 3. In live migration of user application, the state 
> > > > > > > transitions from _RUNNING
> > > > > > > + *    to pre-copy, to stop-and-copy, and to _STOP.
> > > > > > > + *    On state transition from _RUNNING to pre-copy, the driver 
> > > > > > > should start
> > > > > > > + *    gathering the device state while the application is still 
> > > > > > > running and send
> > > > > > > + *    the device state data to application through the migration 
> > > > > > > region.
> > > > > > > + *    On state transition from pre-copy to stop-and-copy, the 
> > > > > > > driver must stop
> > > > > > > + *    the device, save the device state and send it to the user 
> > > > > > > application
> > > > > > > + *    through the migration region.
> > > > > > > + *    Vendor drivers must support the pre-copy state even for 
> > > > > > > implementations
> > > > > > > + *    where no data is provided to the user before the 
> > > > > > > stop-and-copy state. The
> > > > > > > + *    user must not be required to consume all migration data 
> > > > > > > before the device
> > > > > > > + *    transitions to a new state, including the stop-and-copy 
> > > > > > > state.
> > > > > > > + *    The sequence to be followed for above two transitions is 
> > > > > > > given below.
> > > > > > > + * 4. To start the resuming phase, the device state should be 
> > > > > > > transitioned from
> > > > > > > + *    the _RUNNING to the _RESUMING state.
> > > > > > > + *    In the _RESUMING state, the driver should use the device 
> > > > > > > state data
> > > > > > > + *    received through the migration region to resume the device.
> > > > > > > + * 5. After providing saved device data to the driver, the 
> > > > > > > application should
> > > > > > > + *    change the state from _RESUMING to _RUNNING.
> > > > > > > + *
> > > > > > > + * reserved:
> > > > > > > + *      Reads on this field return zero and writes are ignored.
> > > > > > > + *
> > > > > > > + * pending_bytes: (read only)
> > > > > > > + *      The number of pending bytes still to be migrated from 
> > > > > > > the vendor driver.
> > > > > > > + *
> > > > > > > + * data_offset: (read only)
> > > > > > > + *      The user application should read data_offset in the 
> > > > > > > migration region
> > > > > > > + *      from where the user application should read the device 
> > > > > > > data during the
> > > > > > > + *      _SAVING state or write the device data during the 
> > > > > > > _RESUMING state. See
> > > > > > > + *      below for details of sequence to be followed.
> > > > > > > + *
> > > > > > > + * data_size: (read/write)
> > > > > > > + *      The user application should read data_size to get the 
> > > > > > > size in bytes of
> > > > > > > + *      the data copied in the migration region during the 
> > > > > > > _SAVING state and
> > > > > > > + *      write the size in bytes of the data copied in the 
> > > > > > > migration region
> > > > > > > + *      during the _RESUMING state.
> > > > > > > + *
> > > > > > > + * The format of the migration region is as follows:
> > > > > > > + *  
> > > > > > > ------------------------------------------------------------------
> > > > > > > + * |vfio_device_migration_info|    data section                  
> > > > > > >     |
> > > > > > > + * |                          |     
> > > > > > > ///////////////////////////////  |
> > > > > > > + * 
> > > > > > > ------------------------------------------------------------------
> > > > > > > + *   ^                              ^
> > > > > > > + *  offset 0-trapped part        data_offset
> > > > > > > + *
> > > > > > > + * The structure vfio_device_migration_info is always followed 
> > > > > > > by the data
> > > > > > > + * section in the region, so data_offset will always be nonzero. 
> > > > > > > The offset
> > > > > > > + * from where the data is copied is decided by the kernel 
> > > > > > > driver. The data
> > > > > > > + * section can be trapped, mapped, or partitioned, depending on 
> > > > > > > how the kernel
> > > > > > > + * driver defines the data section. The data section partition 
> > > > > > > can be defined
> > > > > > > + * as mapped by the sparse mmap capability. If mmapped, 
> > > > > > > data_offset should be
> > > > > > > + * page aligned, whereas initial section which contains the
> > > > > > > + * vfio_device_migration_info structure, might not end at the 
> > > > > > > offset, which is
> > > > > > > + * page aligned. The user is not required to access through mmap 
> > > > > > > regardless
> > > > > > > + * of the capabilities of the region mmap.
> > > > > > > + * The vendor driver should determine whether and how to 
> > > > > > > partition the data
> > > > > > > + * section. The vendor driver should return data_offset 
> > > > > > > accordingly.
> > > > > > > + *
> > > > > > > + * The sequence to be followed for the _SAVING|_RUNNING device 
> > > > > > > state or
> > > > > > > + * pre-copy phase and for the _SAVING device state or 
> > > > > > > stop-and-copy phase is as
> > > > > > > + * follows:
> > > > > > > + * a. Read pending_bytes, indicating the start of a new 
> > > > > > > iteration to get device
> > > > > > > + *    data. Repeated read on pending_bytes at this stage should 
> > > > > > > have no side
> > > > > > > + *    effects.
> > > > > > > + *    If pending_bytes == 0, the user application should not 
> > > > > > > iterate to get data
> > > > > > > + *    for that device.
> > > > > > > + *    If pending_bytes > 0, perform the following steps.
> > > > > > > + * b. Read data_offset, indicating that the vendor driver should 
> > > > > > > make data
> > > > > > > + *    available through the data section. The vendor driver 
> > > > > > > should return this
> > > > > > > + *    read operation only after data is available from (region + 
> > > > > > > data_offset)
> > > > > > > + *    to (region + data_offset + data_size).
> > > > > > > + * c. Read data_size, which is the amount of data in bytes 
> > > > > > > available through
> > > > > > > + *    the migration region.
> > > > > > > + *    Read on data_offset and data_size should return the offset 
> > > > > > > and size of
> > > > > > > + *    the current buffer if the user application reads 
> > > > > > > data_offset and
> > > > > > > + *    data_size more than once here.      
> > > > > > If data region is mmaped, merely reading data_offset and data_size
> > > > > > cannot let kernel know what are correct values to return.
> > > > > > Consider to add a read operation which is trapped into kernel to let
> > > > > > kernel exactly know it needs to move to the next offset and update 
> > > > > > data_size
> > > > > > ?    
> > > > > 
> > > > > Both operations b. and c. above are to trapped registers, operation d.
> > > > > below may potentially be to an mmap'd area, which is why we have step
> > > > > f. which indicates to the vendor driver that the data has been
> > > > > consumed.  Does that address your concern?  Thanks,
> > > > >    
> > > > No. :)
> > > > the problem is about semantics of data_offset, data_size, and
> > > > pending_bytes.
> > > > b and c do not tell kernel that the data is read by user.
> > > > so, without knowing step d happen, kernel cannot update pending_bytes to
> > > > be returned in step f.  
> > > 
> > > Sorry, I'm still not understanding, I see step f. as the indicator
> > > you're looking for.  The user reads pending_bytes to indicate the data
> > > in the migration area has been consumed.  The vendor driver updates its
> > > internal state on that read and returns the updated value for
> > > pending_bytes.  Thanks,
> > >   
> > we could not regard reading of pending_bytes as an indicator of
> > migration data consumed.
> > 
> > for 1, in migration thread, read of pending_bytes is called every
> > iteration, but reads of data_size & data_offset are not (they are
> > skippable). so it's possible that the sequence is like
> > (1) reading of pending_bytes
> > (2) reading of pending_bytes
> > (3) reading of pending_bytes
> > (4) reading of data_offset & data_size
> > (5) reading of pending_bytes
> > 
> > for 2, it's not right to force kernel to understand qemu's sequence and
> > decide that only a read of pending_bytes after reads of data_offset & 
> > data_size
> > indicates data has been consumed.
> > 
> > Agree?
> 
> No, not really.  We're defining an API that enables the above sequence,
> but doesn't require the kernel to understand QEMU's sequence.
> Specifically, pending_bytes may be read without side-effects except for
> when data is queued to read through the data area of the region.  The
> user queues data to read by reading data_offset.  The user then reads
> data_size to determine the currently available data chunk size.  This
> is followed by consuming the data from the region offset + data_offset.
> Only after reading data_offset does the read of pending_bytes signal to
> the vendor driver that the user has consumed the data.
> 
> If the user were to re-read pending_bytes before consuming the data,
> then the data_offset and data_size they may have read is invalid and
> they've violated the defined protocol.  We do not, nor do I think we
> could, make this a fool proof interface.  The user must adhere to the
> protocol, but I believe the specific sequence you've identified is
> fully enabled here.  Please confirm.  Thanks,
> 
 c. Read data_size, which is the amount of data in bytes available through
  the migration region.
  Read on data_offset and data_size should return the offset and size of
  the current buffer if the user application reads data_offset and
  data_size more than once here.      

so, if the sequence is like this:
 (1) reading of pending_bytes
 (2) reading of data_offset & data_size
 (3) reading of data_offset & data_size
 (4) reading of data_offset & data_size
 (5) reading of pending_bytes
(2)-(4) should return the same values (and different values are allowed)
In step (5), pending_bytes should be the value in step (1) - data_size in
step (4).

Is this understanding right?

Thanks
Yan

> 
> > > > > > > + * d. Read data_size bytes of data from (region + data_offset) 
> > > > > > > from the
> > > > > > > + *    migration region.
> > > > > > > + * e. Process the data.
> > > > > > > + * f. Read pending_bytes, which indicates that the data from the 
> > > > > > > previous
> > > > > > > + *    iteration has been read. If pending_bytes > 0, go to step 
> > > > > > > b.
> > > > > > > + *
> > > > > > > + * If an error occurs during the above sequence, the vendor 
> > > > > > > driver can return
> > > > > > > + * an error code for next read() or write() operation, which 
> > > > > > > will terminate the
> > > > > > > + * loop. The user application should then take the next 
> > > > > > > necessary action, for
> > > > > > > + * example, failing migration or terminating the user 
> > > > > > > application.
> > > > > > > + *
> > > > > > > + * The user application can transition from the _SAVING|_RUNNING
> > > > > > > + * (pre-copy state) to the _SAVING (stop-and-copy) state 
> > > > > > > regardless of the
> > > > > > > + * number of pending bytes. The user application should iterate 
> > > > > > > in _SAVING
> > > > > > > + * (stop-and-copy) until pending_bytes is 0.
> > > > > > > + *
> > > > > > > + * The sequence to be followed while _RESUMING device state is 
> > > > > > > as follows:
> > > > > > > + * While data for this device is available, repeat the following 
> > > > > > > steps:
> > > > > > > + * a. Read data_offset from where the user application should 
> > > > > > > write data.
> > > > > > > + * b. Write migration data starting at the migration region + 
> > > > > > > data_offset for
> > > > > > > + *    the length determined by data_size from the migration 
> > > > > > > source.
> > > > > > > + * c. Write data_size, which indicates to the vendor driver that 
> > > > > > > data is
> > > > > > > + *    written in the migration region. Vendor driver should 
> > > > > > > apply the
> > > > > > > + *    user-provided migration region data to the device resume 
> > > > > > > state.
> > > > > > > + *
> > > > > > > + * For the user application, data is opaque. The user 
> > > > > > > application should write
> > > > > > > + * data in the same order as the data is received and the data 
> > > > > > > should be of
> > > > > > > + * same transaction size at the source.
> > > > > > > + */
> > > > > > > +
> > > > > > > +struct vfio_device_migration_info {
> > > > > > > + __u32 device_state;         /* VFIO device state */
> > > > > > > +#define VFIO_DEVICE_STATE_STOP      (0)
> > > > > > > +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> > > > > > > +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> > > > > > > +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> > > > > > > +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | 
> > > > > > > \
> > > > > > > +                              VFIO_DEVICE_STATE_SAVING |  \
> > > > > > > +                              VFIO_DEVICE_STATE_RESUMING)
> > > > > > > +
> > > > > > > +#define VFIO_DEVICE_STATE_VALID(state) \
> > > > > > > + (state & VFIO_DEVICE_STATE_RESUMING ? \
> > > > > > > + (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING 
> > > > > > > : 1)
> > > > > > > +
> > > > > > > +#define VFIO_DEVICE_STATE_IS_ERROR(state) \
> > > > > > > + ((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING 
> > > > > > > | \
> > > > > > > +                                       
> > > > > > > VFIO_DEVICE_STATE_RESUMING))
> > > > > > > +
> > > > > > > +#define VFIO_DEVICE_STATE_SET_ERROR(state) \
> > > > > > > + ((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
> > > > > > > +                                      VFIO_DEVICE_STATE_RESUMING)
> > > > > > > +
> > > > > > > + __u32 reserved;
> > > > > > > + __u64 pending_bytes;
> > > > > > > + __u64 data_offset;
> > > > > > > + __u64 data_size;
> > > > > > > +} __attribute__((packed));
> > > > > > > +
> > > > > > >  /*
> > > > > > >   * The MSIX mappable capability informs that MSIX data of a BAR 
> > > > > > > can be mmapped
> > > > > > >   * which allows direct access to non-MSIX registers which 
> > > > > > > happened to be within
> > > > > > > -- 
> > > > > > > 2.7.0
> > > > > > >       
> > > > > >     
> > > > >     
> > > >   
> > >   
> > 
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]