qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Re: Re: Re: [PATCH 0/1] introduce nvmf block driver


From: Stefan Hajnoczi
Subject: Re: Re: Re: Re: [PATCH 0/1] introduce nvmf block driver
Date: Tue, 8 Jun 2021 14:36:51 +0100

On Tue, Jun 08, 2021 at 09:03:21PM +0800, zhenwei pi wrote:
> On 6/8/21 8:59 PM, Stefan Hajnoczi wrote:
> > On Tue, Jun 08, 2021 at 08:19:20PM +0800, zhenwei pi wrote:
> > > On 6/8/21 4:07 PM, Stefan Hajnoczi wrote:
> > > > On Tue, Jun 08, 2021 at 10:52:05AM +0800, zhenwei pi wrote:
> > > > > On 6/7/21 11:08 PM, Stefan Hajnoczi wrote:
> > > > > > On Mon, Jun 07, 2021 at 09:32:52PM +0800, zhenwei pi wrote:
> > > > > > > Since 2020, I started to develop a userspace NVMF initiator 
> > > > > > > library:
> > > > > > > https://github.com/bytedance/libnvmf
> > > > > > > and released v0.1 recently.
> > > > > > > 
> > > > > > > Also developed block driver for QEMU side:
> > > > > > > https://github.com/pizhenwei/qemu/tree/block-nvmf
> > > > > > > 
> > > > > > > Test with linux kernel NVMF target (TCP), QEMU gets about 220K 
> > > > > > > IOPS,
> > > > > > > it seems good.
> > > > > > 
> > > > > > How does the performance compare to the Linux kernel NVMeoF 
> > > > > > initiator?
> > > > > > 
> > > > > > In case you're interested, some Red Hat developers have started to
> > > > > > working on a new library called libblkio. For now it supports 
> > > > > > io_uring
> > > > > > but PCI NVMe and virtio-blk are on the roadmap. The library supports
> > > > > > blocking, event-driven, and polling modes. There isn't a direct 
> > > > > > overlap
> > > > > > with libnvmf but maybe they can learn from each other.
> > > > > > https://gitlab.com/libblkio/libblkio/-/blob/main/docs/blkio.rst
> > > > > > 
> > > > > > Stefan
> > > > > > 
> > > > > 
> > > > > I'm sorry about that no enough information of QEMU block nvmf driver 
> > > > > and
> > > > > libnvmf.
> > > > > 
> > > > > Kernel initiator & userspace initiator
> > > > > Rather than io_uring/libaio + kernel initiator solution(read 500K+ 
> > > > > IOPS &
> > > > > write 200K+ IOPS), I prefer QEMU block nvmf + libnvmf(RW 200K+ IOPS):
> > > > > 1, I don't have to upgrade host kernel. I can also run it on a lower 
> > > > > version
> > > > > of kernel.
> > > > > 2, During re-connection if target side hits a panic, initiator side 
> > > > > would
> > > > > not get 'D' state(uninterruptable state in kernel), QEMU always could 
> > > > > be
> > > > > killed.
> > > > > 3, It's easier to trouble shoot for a userspace application.
> > > > 
> > > > I see, thanks for sharing.
> > > > 
> > > > > Default NVMe-OF IO queues
> > > > > The mechanism of QEMU+libnvmf:
> > > > > 1, QEMU iothread creates a request and dispatches it to NVMe-OF IO 
> > > > > queues
> > > > > thread by lockless list.
> > > > > 2, QEMU iothread tries to kick NVMe-OF IO queue thread.
> > > > > 3, NVMe-OF IO queue thread processes request and returns response to 
> > > > > the
> > > > > QEMU iothread.
> > > > > 
> > > > > When the QEMU iothread reaches the limitation, 4 NVMe-OF IO queues get
> > > > > better performance.
> > > > 
> > > > Can you explain this bottleneck? Even with 4 NVMe-oF IO queues there is
> > > > still just 1 IOThread submitting requests, so why are 4 IO queues faster
> > > > than 1?
> > > > 
> > > > Stefan
> > > > 
> > > 
> > > QEMU + libiscsi solution uses iothread send/recv TCP and processes iSCSI
> > > PDU directly, it could get about 60K IOPS. Let's look at the perf report 
> > > of
> > > the iothread:
> > > +   35.06%      [k] entry_SYSCALL_64_after_hwframe
> > > +   33.13%      [k] do_syscall_64
> > > +   19.70%      [.] 0x0000000100000000
> > > +   18.31%      [.] __libc_send
> > > +   18.02%      [.] iscsi_tcp_service
> > > +   16.30%      [k] __x64_sys_sendto
> > > +   16.24%      [k] __sys_sendto
> > > +   15.69%      [k] sock_sendmsg
> > > +   15.56%      [k] tcp_sendmsg
> > > +   14.25%      [k] __tcp_transmit_skb
> > > +   13.94%      [k] 0x0000000000001000
> > > +   13.78%      [k] tcp_sendmsg_locked
> > > +   13.67%      [k] __ip_queue_xmit
> > > +   13.00%      [k] tcp_write_xmit
> > > +   12.07%      [k] __tcp_push_pending_frames
> > > +   11.91%      [k] inet_recvmsg
> > > +   11.78%      [k] tcp_recvmsg
> > > +   11.73%      [k] ip_output
> > > 
> > > The bottleneck of this case is TCP, so libnvmf dispatches request to other
> > > threads by lockless list to reduce the overhead of TCP. It gets more
> > > effective to process requests from guest.
> > 
> > Are IOThread %usr and %sys CPU utilization close to 100%?
> > 
> > Stefan
> > 
> 
> Yes.

I'm a surprised it's so low.

This will be become more complicated when the QEMU block layer gains
multi-queue support. The virtio-blk virtqueues can then have dedicated
IOThreads, except the data you posted suggests they will be limited by
host CPU. So maybe extra NVMe IO queues will still be desirable on top
of multi-queue QEMU block layer.

Stefan

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]