[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Verita

From: Rakesh Ranjan
Subject: Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support
Date: Wed, 30 Nov 2016 04:20:03 +0000
User-agent: Microsoft-MacOutlook/

Hello Stefan,

>>>>> Why does the client have to know about failover if it's connected to
>>>>>a server process on the same host?  I thought the server process
>>>>>manages networking issues (like the actual protocol to speak to other
>>>>>VxHS nodes and for failover).

Just to comment on this, the model being followed within HyperScale is to
allow application I/O continuity (resiliency) in various cases as
mentioned below. It really adds value for consumer/customer and tries to
avoid culprits for single points of failure.

1. HyperScale storage service failure (QNIO Server)
        - Daemon managing local storage for VMs and runs on each compute node
        - Daemon can run as a service on Hypervisor itself as well as within VSA
(Virtual Storage Appliance or Virtual Machine running on the hypervisor),
which depends on ecosystem where HyperScale is supported
        - Daemon or storage service down/crash/crash-in-loop shouldn¹t lead to 
huge impact on all the VMs running on that hypervisor or compute node
hence providing service level resiliency is very useful for
          application I/O continuity in such case.

        - The service failure handling can be only done at the client side and
not at the server side since service running as a server itself is down.
        - Client detects an I/O error and depending on the logic, it does
application I/O failover to another available/active QNIO server or
HyperScale Storage service running on different compute node
(reflection/replication node)
        - Once the orig/old server comes back online, client gets/receives
negotiated error (not a real application error) to do the application I/O
failback to the original server or local HyperScale storage service to get
better I/O performance.
2. Local physical storage or media failure
        - Once server or HyperScale storage service detects the media or local
disk failure, depending on the vDisk (guest disk) configuration, if
another storage copy is available
          on different compute node then it internally handles the local
fault and serves the application read and write requests otherwise
application or client gets the fault.
        - Client doesn¹t know about any I/O failure since Server or Storage
service manages/handles the fault tolerance.
        - In such case, in order to get some I/O performance benefit, once
client gets a negotiated error (not an application error) from local
server or storage service,
          client can initiate I/O failover and can directly send
application I/O to another compute node where storage copy is available to
serve the application need instead of sending it locally where media is


On 11/29/16, 4:45 PM, "ashish mittal" <address@hidden> wrote:

>+ Rakesh from Veritas
>On Mon, Nov 28, 2016 at 6:17 AM, Stefan Hajnoczi <address@hidden>
>> On Mon, Nov 28, 2016 at 10:23:41AM +0000, Ketan Nilangekar wrote:
>>> On 11/25/16, 5:05 PM, "Stefan Hajnoczi" <address@hidden> wrote:
>>>     On Fri, Nov 25, 2016 at 08:27:26AM +0000, Ketan Nilangekar wrote:
>>>     > On 11/24/16, 9:38 PM, "Stefan Hajnoczi" <address@hidden>
>>>     >     On Thu, Nov 24, 2016 at 11:31:14AM +0000, Ketan Nilangekar
>>>     >     > On 11/24/16, 4:41 PM, "Stefan Hajnoczi"
>>><address@hidden> wrote:
>>>     >     >     On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan
>>>Nilangekar wrote:
>>>     >     >     > On 11/24/16, 4:07 AM, "Paolo Bonzini"
>>><address@hidden> wrote:
>>>     >     >     > >On 23/11/2016 23:09, ashish mittal wrote:
>>>     >     >     > >> On the topic of protocol security -
>>>     >     >     > >>
>>>     >     >     > >> Would it be enough for the first patch to
>>>implement only
>>>     >     >     > >> authentication and not encryption?
>>>     >     >     > >
>>>     >     >     > >Yes, of course.  However, as we introduce more and
>>>more QEMU-specific
>>>     >     >     > >characteristics to a protocol that is already
>>>QEMU-specific (it doesn't
>>>     >     >     > >do failover, etc.), I am still not sure of the
>>>actual benefit of using
>>>     >     >     > >libqnio versus having an NBD server or FUSE driver.
>>>     >     >     > >
>>>     >     >     > >You have already mentioned performance, but the
>>>design has changed so
>>>     >     >     > >much that I think one of the two things has to
>>>change: either failover
>>>     >     >     > >moves back to QEMU and there is no (closed source)
>>>translator running on
>>>     >     >     > >the node, or the translator needs to speak a
>>>well-known and
>>>     >     >     > >already-supported protocol.
>>>     >     >     >
>>>     >     >     > IMO design has not changed. Implementation has
>>>changed significantly. I would propose that we keep resiliency/failover
>>>code out of QEMU driver and implement it entirely in libqnio as planned
>>>in a subsequent revision. The VxHS server does not need to
>>>understand/handle failover at all.
>>>     >     >     >
>>>     >     >     > Today libqnio gives us significantly better
>>>performance than any NBD/FUSE implementation. We know because we have
>>>prototyped with both. Significant improvements to libqnio are also in
>>>the pipeline which will use cross memory attach calls to further boost
>>>performance. Ofcourse a big reason for the performance is also the
>>>HyperScale storage backend but we believe this method of IO
>>>tapping/redirecting can be leveraged by other solutions as well.
>>>     >     >
>>>     >     >     By "cross memory attach" do you mean
>>>     >     >     process_vm_readv(2)/process_vm_writev(2)?
>>>     >     >
>>>     >     > Ketan> Yes.
>>>     >     >
>>>     >     >     That puts us back to square one in terms of security.
>>>You have
>>>     >     >     (untrusted) QEMU + (untrusted) libqnio directly
>>>accessing the memory of
>>>     >     >     another process on the same machine.  That process is
>>>therefore also
>>>     >     >     untrusted and may only process data for one guest so
>>>that guests stay
>>>     >     >     isolated from each other.
>>>     >     >
>>>     >     > Ketan> Understood but this will be no worse than the
>>>current network based communication between qnio and vxhs server. And
>>>although we have questions around QEMU trust/vulnerability issues, we
>>>are looking to implement basic authentication scheme between libqnio
>>>and vxhs server.
>>>     >
>>>     >     This is incorrect.
>>>     >
>>>     >     Cross memory attach is equivalent to ptrace(2) (i.e.
>>>debugger) access.
>>>     >     It means process A reads/writes directly from/to process B
>>>memory.  Both
>>>     >     processes must have the same uid/gid.  There is no trust
>>>     >     between them.
>>>     >
>>>     > Ketan> Not if vxhs server is running as root and initiating the
>>>cross mem attach. Which is also why we are proposing a basic
>>>authentication mechanism between qemu-vxhs. But anyway the cross memory
>>>attach is for a near future implementation.
>>>     >
>>>     >     Network communication does not require both processes to
>>>have the same
>>>     >     uid/gid.  If you want multiple QEMU processes talking to a
>>>single server
>>>     >     there must be a trust boundary between client and server.
>>>The server
>>>     >     can validate the input from the client and reject undesired
>>>     >
>>>     > Ketan> This is what we are trying to propose. With the addition
>>>of authentication between qemu-vxhs server, we should be able to
>>>achieve this. Question is, would that be acceptable?
>>>     >
>>>     >     Hope this makes sense now.
>>>     >
>>>     >     Two architectures that implement the QEMU trust model
>>>correctly are:
>>>     >
>>>     >     1. Cross memory attach: each QEMU process has a dedicated
>>>vxhs server
>>>     >        process to prevent guests from attacking each other.
>>>This is where I
>>>     >        said you might as well put the code inside QEMU since
>>>there is no
>>>     >        isolation anyway.  From what you've said it sounds like
>>>the vxhs
>>>     >        server needs a host-wide view and is responsible for all
>>>     >        running on the host, so I guess we have to rule out this
>>>     >        architecture.
>>>     >
>>>     >     2. Network communication: one vxhs server process and
>>>multiple guests.
>>>     >        Here you might as well use NBD or iSCSI because it
>>>already exists and
>>>     >        the vxhs driver doesn't add any unique functionality over
>>>     >        protocols.
>>>     >
>>>     > Ketan> NBD does not give us the performance we are trying to
>>>achieve. Besides NBD does not have any authentication support.
>>>     NBD over TCP supports TLS with X.509 certificate authentication.  I
>>>     think Daniel Berrange mentioned that.
>>> Ketan> I saw the patch to nbd that was merged in 2015. Before that NBD
>>>did not have any auth as Daniel Berrange mentioned.
>>>     NBD over AF_UNIX does not need authentication because it relies on
>>>     permissions for access control.  Each guest should have its own
>>>     domain socket that it connects to.  That socket can only see
>>>     that have been assigned to the guest.
>>>     > There is a hybrid 2.a approach which uses both 1 & 2 but I¹d
>>>keep that for a later discussion.
>>>     Please discuss it now so everyone gets on the same page.  I think
>>>     is a big gap and we need to communicate so that progress can be
>>> Ketan> The approach was to use cross mem attach for IO path and a
>>>simplified network IO lib for resiliency/failover. Did not want to
>>>derail the current discussion hence the suggestion to take it up later.
>> Why does the client have to know about failover if it's connected to a
>> server process on the same host?  I thought the server process manages
>> networking issues (like the actual protocol to speak to other VxHS nodes
>> and for failover).
>>>     >     >     There's an easier way to get even better performance:
>>>get rid of libqnio
>>>     >     >     and the external process.  Move the code from the
>>>external process into
>>>     >     >     QEMU to eliminate the
>>>process_vm_readv(2)/process_vm_writev(2) and
>>>     >     >     context switching.
>>>     >     >
>>>     >     >     Can you remind me why there needs to be an external
>>>     >     >
>>>     >     > Ketan>  Apart from virtualizing the available direct
>>>attached storage on the compute, vxhs storage backend (the external
>>>process) provides features such as storage QoS, resiliency, efficient
>>>use of direct attached storage, automatic storage recovery points
>>>(snapshots) etc. Implementing this in QEMU is not practical and not the
>>>purpose of proposing this driver.
>>>     >
>>>     >     This sounds similar to what QEMU and Linux (file systems,
>>>     >     etc) already do.  It brings to mind a third architecture:
>>>     >
>>>     >     3. A Linux driver or file system.  Then QEMU opens a raw
>>>block device.
>>>     >        This is what the Ceph rbd block driver in Linux does.
>>>     >        architecture has a kernel-userspace boundary so vxhs does
>>>not have to
>>>     >        trust QEMU.
>>>     >
>>>     >     I suggest Architecture #2.  You'll be able to deploy on
>>>existing systems
>>>     >     because QEMU already supports NBD or iSCSI.  Use the time
>>>you gain from
>>>     >     switching to this architecture on benchmarking and
>>>optimizing NBD or
>>>     >     iSCSI so performance is closer to your goal.
>>>     >
>>>     > Ketan> We have made a choice to go with QEMU driver approach
>>>after serious evaluation of most if not all standard IO tapping
>>>mechanisms including NFS, NBD and FUSE. None of these has been able to
>>>deliver the performance that we have set ourselves to achieve. Hence
>>>the effort to propose this new IO tap which we believe will provide an
>>>alternate to the existing mechanisms and hopefully benefit the
>>>     I thought the VxHS block driver was another network block driver
>>>     GlusterFS or Sheepdog but you are actually proposing a new local
>>>I/O tap
>>>     with the goal of better performance.
>>> Ketan> The VxHS block driver is a new local IO tap with the goal of
>>>better performance specifically when used with the VxHS server. This
>>>coupled with shared mem IPC (like cross mem attach) could be a much
>>>better IO tap option for qemu users. This will also avoid context
>>>switch between qemu/network stack to service which happens today in NBD.
>>>     Please share fio(1) or other standard benchmark configuration
>>>files and
>>>     performance results.
>>> Ketan> We have fio results with the VxHS storage backend which I am
>>>not sure I can share in a public forum.
>>>     NBD and libqnio wire protocols have comparable performance
>>>     characteristics.  There is no magic that should give either one a
>>>     fundamental edge over the other.  Am I missing something?
>>> Ketan> I have not seen the NBD code but few things which we considered
>>>and are part of libqnio (though not exclusively) are low protocol
>>>overhead, threading model, queueing, latencies, memory pools, zero data
>>>copies in user-land, scatter-gather write/read etc. Again these are not
>>>exclusive to libqnio but could give one protocol the edge over the
>>>other. Also part of the ³magic² is also in the VxHS storage backend
>>>which is able to ingest the IOs with lower latencies.
>>>     The main performance difference is probably that libqnio opens 8
>>>     simultaneous connections but that's not unique to the wire
>>>     What happens when you run 8 NBD simultaneous TCP connections?
>>> Ketan> Possibly. We have not benchmarked this.
>> There must be benchmark data if you want to add a new feature or modify
>> existing code for performance reasons.  This rule is followed in QEMU so
>> that performance changes are justified.
>> I'm afraid that when you look into the performance you'll find that any
>> performance difference between NBD and this VxHS patch series is due to
>> implementation differences that can be ported across to QEMU NBD, rather
>> than wire protocol differences.
>> If that's the case then it would save a lot of time to use NBD over
>> AF_UNIX for now.  You could focus efforts on achieving the final
>> architecture you've explained with cross memory attach.
>> Please take a look at vhost-user-scsi, which folks from Nutanix are
>> currently working on.  See "[PATCH v2 0/3] Introduce vhost-user-scsi and
>> sample application" on qemu-devel.  It is a true zero-copy local I/O tap
>> because it shares guest RAM.  This is more efficient than cross memory
>> attach's single memory copy.  It does not require running the server as
>> root.  This is the #1 thing you should evaluate for your final
>> architecture.
>> vhost-user-scsi works on the virtio-scsi emulation level.  That means
>> the server must implement the virtio-scsi vring and device emulation.
>> It is not a block driver.  By hooking in at this level you can achieve
>> the best performance but you lose all QEMU block layer functionality and
>> need to implement your own SCSI target.  You also need to consider live
>> migration.
>> Stefan

reply via email to

[Prev in Thread] Current Thread [Next in Thread]