[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] host side todo list for virtio rdma
From: |
Michael S. Tsirkin |
Subject: |
[Qemu-devel] host side todo list for virtio rdma |
Date: |
Wed, 19 Jul 2017 05:05:37 +0300 |
Here are some thoughts on bits that are still missing to get a working
virtio-rdma, with some suggestions. These are very preliminary but I
feel I kept these in my head (and discussed offline) for too long. All
of the below is just my personal humble opinion.
Feature Requirements:
The basic requirement is to be able to do RDMA to/from
VM memory, with support for VM migration and/or memory
overcommit and/or autonuma and/or THP.
Why are migration/overcommit/autonuma required?
Without these, you can do RDMA with device passthrough,
with likely better performance.
Feature Non-requirements:
It's not a requirement to support RDMA without VM exits,
e.g. like with device passthrough. While avoiding exits improves
performance, it would be handy to more than RDMA,
so there seems no reason to require it from RDMA when we
do not have it for e.g. network.
Assumptions:
It's OK to assume specific hardware capabilities at least initially.
High level architecture:
Follows the same lines as most other virtio devices:
+-----------------------------------
+
+ guest kernel
+ ^
+-------------|----------------------
+ v
+ host kernel (kvm, vhost)
+
+ ^
+-------------|----------------------
+ v
+
+ host userspace (QEMU, vhost-user)
+
+-----------------------------------
Each request is forwarded by host kernel to QEMU,
that executes it using the ibverbs library.
Most of this should be implementable host-side using existing
software. However, several issues remain and would need
infrastructure changes, as outlined below.
Host-side todo list for virtio-rdma support:
- Memory registration for guest userspace.
Register memory region verb accepts a single virtual address,
which supplies both the on-wire key for access and the
range of memory to access. Guest kernel turns this into a
list of pages (e.g. by get_user_pages); when forwarded to host this
turns into a s/g list of virtual addresses in QEMU address space.
Suggestion: add a new verb, along the lines of ibv_register_physical,
which splits these two parameters, accepting the on-wire VA key
and separately a list of userspace virtual address/size pairs.
- Memory registration for guest kernels.
Another ability used by some in-kernel users is registering all of memory.
Ranges not actually present are never accessed - this is OK as
kernel users are trusted. Memory hotplug changes which ranges
are present.
Suggestion: add some throw-away memory and map all
non-present ranges there. Add ibv_reregister_physical_mr or similar
API to update mappings on guest memory hotplug/unplug.
- Memory overcommit/autonuma/THP.
This includes techniques such as swap,KSM,COW, page migration.
All these rely on ability to move pages around without
breaking hardware access.
Suggestion: for hardware that supports it,
enabling on-demand paging for all registered memory seems
to address the issue more or less transparently to guests.
This isn't supported by all hardware but might be
at least a reasonable first step.
- Migration: memory tracking.
Migration requires detecting hardware access to pages
either on write (pre-copy) or any access (post-copy).
Post copy just requires ODP support to work with
userfaultfd properly.
Pre-copy would require a write-tracking API along
the lines of one exposed by KVM or vhost.
Each tracked page would be write-protected (causing faults on
hardware access) on hardware write fault is generated
and recorded, page is made writeable.
- Migration: moving QP numbers.
QP numbers are exposed on the wire and so must move together
with the VM.
Suggestion: allow specifying QP number when creating a QP.
To avoid conflicts between multiple users, initial version can limit
library to a single user per device. Multiple VMs can simply
attach to distinct VFs.
- Migration: moving QP state.
When migrating the VM, a QP has to be torn down
on source and created on destination.
We have to migrate e.g. the current PSN - but what
should happen when a new packet arrives on source
after QP has been torn down?
Suggestion 1: move QP to a special state "suspended" and ignore
packets, or cause source to retransmit with e.g. an out of
resources error. Retransmit counter might need to be
adjusted compared to what guest requested to account
for the extra need to retransmit.
Is there a good existing QP state that does this?
Suggestion 2: forward packets to destination somehow.
Might overload the fabric as we are crossing e.g.
pci bus multiple times.
- Migration: network update
ROCE v1 and infiniband seem to tie connections to
hardware specific GIDs which can not be moved by software.
Suggestion: limit migration to RoCE v2 initially.
- Migration: packet loss recovery.
As a RoCE address moves across the network, network has
to be updated which takes time, meanwhile packet loss seems
to be hard to avoid.
Suggestion: limit initial support to hardware that is
able to recover from occasional packet drops, with
some slowdown.
- Migration: suspend/resume API?
It might be easier to pack up state of all resources
such as all QP numbers and state of all QPs etc
in a single memory buffer, migrate then unpack on destination.
Removes need for 2 separate APIs for suspended state and
for specifying QPN on creation.
This creates a format for serialization that will have to
be maintained in a compatible way - it is not clear that
the maintainance overhead is worth the potential
simplification, if any.
That's it - I hope this helps, feel free to discuss, preferably copying
virtio-dev (subscription required for now, people are looking into
fixing this, sorry about that).
Thanks!
--
MST
- [Qemu-devel] host side todo list for virtio rdma,
Michael S. Tsirkin <=