[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] 答复: Re: [RFC] virtio-fc: draft idea of virtual fibre c

From: Hannes Reinecke
Subject: Re: [Qemu-devel] 答复: Re: [RFC] virtio-fc: draft idea of virtual fibre channel HBA
Date: Tue, 16 May 2017 08:34:44 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.0

On 05/15/2017 07:21 PM, Paolo Bonzini wrote:
> Thread necromancy after doing my homework and studying a bunch of specs...
>>>> I'd propose to update
>>>> struct virtio_scsi_config
>>>> with a field 'u8 initiator_id[8]'
>>>> and
>>>> struct virtio_scsi_req_cmd
>>>> with a field 'u8 target_id[8]'
>>>> and do away with the weird LUN remapping qemu has nowadays.
>>> Does it mean we dont need to provide '-drive' and '-device scsi-hd'
>>> option in qemu command line? so the guest can get its own LUNs
>>> through fc switch, right?
>> No, you still would need that (at least initially).
>> But with the modifications above we can add tooling around qemu to
>> establish the correct (host) device mappings.
>> Without it we
>> a) have no idea from the host side which devices should be attached to
>> any given guest
>> b) have no idea from the guest side what the initiator and target IDs
>> are; which will get _really_ tricky if someone decides to use persistent
>> reservations from within the guest...
>> For handling NPIV proper we would need to update qemu
>> a) locate the NPIV host based on the initiator ID from the guest
> 1) How would the initiator ID (8 bytes) relate to the WWNN/WWPN (2*8
> bytes) on the host?  Likewise for the target ID which, as I understand
> it, matches the rport's WWNN/WWPN in Linux's FC transport.
Actually, there's no need to keep WWNN and WWPN separate. The original
idea was to have a WWNN (world-wide node name) to refer to the system,
and the WWPN (world-wide port name) to refer to the FC port.
But as most FC cards are standalone each card will have a unique WWNN,
and a unique WWPN per port.
So if the card only has one port, it'll have one WWNN and one WWPN.
And in basically all instances the one is derived from the other.
Plus SAM only knows about a single initiator identifier; it's a FC
pecularity that it has _two_.

So indeed, it might be better to keep it in a broader sense.

Maybe a union with an overall size of 256 byte (to hold the iSCSI iqn
string), which for FC carries the WWPN and the WWNN?

> 2) If the initiator ID is the moral equivalent of a MAC address,
> shouldn't it be the host that provides the initiator ID to the host in
> the virtio-scsi config space?  (From your proposal, I'd guess it's the
> latter, but maybe I am not reading correctly).
That would be dependent on the emulation. For emulated SCSI disk I guess
we need to specify it in the commandline somewhere, but for scsi
passthrough we could grab it from the underlying device.

> 3) An initiator ID in virtio-scsi config space is orthogonal to an
> initiator IDs in the request.  The former is host->guest, the latter is
> guest->host and can be useful to support virtual (nested) NPIV.
I don't think so. My idea is to have the initiator ID tied to the virtio
queue, so it wouldn't really matter _who_ sets the ID.
On the host we would use the (guest) initiator ID to establish the
connection between the virtio queue and the underlying device, be it a
qemu block device or a 'real' host block device.

>> b) stop exposing the devices attached to that NPIV host to the guest
> What do you mean exactly?
That's one of the longer term plans I have.
When doing NPIV currently all devices from the NPIV host appear on the
host. Including all partitions, LVM devices and what not.
This can lead to unwanted side-effects (systemd helpfully enable the
swap device on a partition for the host, when the actual block device is
being passed through to a guest ...). So ideally I would _not_ parse any
partitition or metadata on those devices. But that requires an a-priory
knowledge like "That device whose number I don't know and whose identity
is unknown but which I'm sure will appear shortly is going to be
forwarded to a guest."
If we make the (guest) initiator ID identical to the NPIV WWPN we can
tag the _host_ to not expose any partitions on any LUNs, making the
above quite easy.

>> c) establish a 'rescan' routine to capture any state changes (LUN
>> remapping etc) of the NPIV host.
> You'd also need "target add" and "target removed" events.  At this
> point, this looks a lot less virtio-scsi and a lot more like virtio-fc
> (with a 'cooked' FCP-based format of its own).
Yeah, that's a long shot indeed.

> At this point, I can think of several ways  to do this, one being SG_IO
> in QEMU while the other are more exoteric.
> 1) use virtio-scsi with userspace passthrough (current solution).
> Advantages:
> - guests can be stopped/restarted across hosts with different HBAs
> - completely oblivious to host HBA driver
> - no new guest drivers are needed (well, almost due to above issues)
> - out-of-the-box support for live migration, albeit with hacks required
> such as Hyper-V's two WWNN/WWPN pairs
> Disadvantages:
> - no full FCP support
> - guest devices exposed as /dev nodes to the host
> 2) the exact opposite: use the recently added "mediated device
> passthrough" (mdev) framework to present a "fake" PCI device to the
> guest.  mdev is currently used for vGPU and will also be used by s390
> for CCW passthrough.  It lets the host driver take care of device
> emulation, and the result is similar to an SR-IOV virtual function but
> without requiring SR-IOV in the host.  The PCI device would presumably
> reuse in the guest the same driver as the host.
> Advantages:
> - no new guest drivers are needed
> - solution confined entirely within the host driver
> - each driver can use its own native 'cooked' format for FC frames
> Disadvantages:
> - specific to each HBA driver
> - guests cannot be stopped/restarted across hosts with different HBAs
> - it's still device passthrough, so live migration is a mess (and would
> require guest-specific code in QEMU)
> 3) handle passthrough with a kernel driver.  Under this model, the guest
> uses the virtio device, but the passthrough of commands and TMFs is
> performed by the host driver.  The host driver grows the option to
> present an NPIV vport through a vhost interface (*not* the same thing as
> LIO's vhost-scsi target, but a similar API with a different /dev node or
> even one node per scsi_host).
> We can then choose whether to do it with virtio-scsi or with a new
> virtio-fc.
> Advantages:
> - guests can be stopped/restarted across hosts with different HBAs
> - no need to do the "two WWNN/WWPN pairs" hack for live migration,
> unlike e.g. Hyper-V
> - a bit Rube Goldberg, but the vhost interface can be consumed by any
> userspace program, not just by virtual machines
> Disadvantages:
> - requires a new generalized vhost-scsi (or vhost-fc) layer
> - not sure about support for live migration (what to do about in-flight
> commands?)
> I don't know the Linux code well enough to know if it would require code
> specific to each HBA driver.  Maybe just some refactoring.
> 4) same as (3), but in userspace with a "macvtap" like layer (e.g.,
> socket+bind creates an NPIV vport).  This layer can work on some kind of
> FCP encapsulation, not the raw thing, and virtio-fc could be designed
> according to a similar format for simplicity.
> Advantages:
> - less dependencies on kernel code
> - simplest for live migration
> - most flexible for userspace usage
> Disadvantages:
> - possibly two packs of cats to herd (SCSI + networking)?
> - haven't thought much about it, so I'm not sure about the feasibility
> Again, I don't know the Linux code well enough to know if it would
> require code specific to each HBA driver.
> If we can get the hardware manufacturers (and the SCSI maintainers...)
> on board, (3) would probably be pretty easy to achieve, even accounting
> for the extra complication of writing a virtio-fc specification.  Really
> just one hardware manufacturer, the others would follow suit.
With option (1) and the target/initiator ID extensions we should be able
to get basic NPIV support to work, and would even be able to handle
reservations in a sane manner.

(4) would require raw FCP frame access, which is one thing we do _not_
have. Each card (except for the pure FCoE ones like bnx2fc, fnic, and
fcoe) only allows access to pre-formatted I/O commands. And has it's own
mechanism for generatind sequence IDs etc. So anything requiring raw FCP
access is basically out of the game.

(3) would be feasible, as it would effectively mean 'just' to update the
current NPIV mechanism. However, this would essentially lock us in for
FC; any other types (think NVMe) will require yet another solution.

(2) sounds interesting, but I'd have to have a look into the code to
figure out if it could easily be done.

> (2) would probably be what the manufacturers like best, but it would be
> worse for lock in.  Or... they would like it best *because* it would be
> worse for lock in.
> The main disadvantage of (2)/(3) against (1) is more complex testing.  I
> guess we can add a vhost-fc target for testing to LIO, so as not to
> require an FC card for guest development.  And if it is still a problem
> 'cause configfs requires root, we can add a fake FC target in QEMU.
Overall, I would vote to specify a new virtio scsi format _first_,
keeping in mind all of these options.
(1), (3), and (4) all require an update anyway :-)

The big advantage I see with (1) is that it can be added with just some
code changes to qemu and virtio-scsi. Every other option require some
vendor buy-in, which inevitably leads to more discussions, delays, and
more complex interaction (changes to qemu, virtio, _and_ the affected HBAs).

While we're at it: We also need a 'timeout' field to the virtion request
structure. I even posted an RFC for it :-)


Dr. Hannes Reinecke                Teamlead Storage & Networking
address@hidden                                 +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

reply via email to

[Prev in Thread] Current Thread [Next in Thread]