qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 0/6] eBPF RSS support for virtio-net


From: Yuri Benditovich
Subject: Re: [RFC PATCH 0/6] eBPF RSS support for virtio-net
Date: Tue, 10 Nov 2020 10:00:35 +0200



On Tue, Nov 10, 2020 at 4:23 AM Jason Wang <jasowang@redhat.com> wrote:

On 2020/11/9 下午9:33, Yuri Benditovich wrote:
>
>
> On Mon, Nov 9, 2020 at 4:14 AM Jason Wang <jasowang@redhat.com
> <mailto:jasowang@redhat.com>> wrote:
>
>
>     On 2020/11/5 下午11:13, Yuri Benditovich wrote:
>     > First of all, thank you for all your feedbacks
>     >
>     > Please help me to summarize and let us understand better what we
>     do in v2:
>     > Major questions are:
>     > 1. Building eBPF from source during qemu build vs. regenerating
>     it on
>     > demand and keeping in the repository
>     > Solution 1a (~ as in v1): keep instructions or ELF in H file,
>     generate
>     > it out of qemu build. In general we'll need to have BE and LE
>     binaries.
>     > Solution 1b: build ELF or instructions during QEMU build if llvm +
>     > clang exist. Then we will have only one (BE or LE, depending on
>     > current QEMU build)
>     > We agree with any solution - I believe you know the requirements
>     better.
>
>
>     I think we can go with 1a. (See Danial's comment)
>
>
>     >
>     > 2. Use libbpf or not
>     > In general we do not see any advantage of using libbpf. It works
>     with
>     > object files (does ELF parsing at time of loading), but it does
>     not do
>     > any magic.
>     > Solution 2a. Switch to libbpf, generate object files (LE and BE)
>     from
>     > source, keep them inside QEMU (~8k each) or aside
>
>
>     Can we simply use dynamic linking here?
>
>
> Can you please explain, where exactly you suggest to use dynamic linking?


Yes. If I understand your 2a properly, you meant static linking of
libbpf. So what I want to ask is the possibility of dynamic linking of
libbpf here.


As Daniel explained above, QEMU is always linked dynamically vs libraries.
Also I see the libbpf package does not even contain the static library.
If the build environment contains libbpf, the libbpf.so becomes runtime dependency, just as with other libs.
 

>
>     > Solution 2b. (as in v1) Use python script to parse object ->
>     > instructions (~2k each)
>     > We'd prefer not to use libbpf at the moment.
>     > If due to some unknown reason we'll find it useful in future, we
>     can
>     > switch to it, this does not create any incompatibility. Then
>     this will
>     > create a dependency on libbpf.so
>
>
>     I think we need to care about compatibility. E.g we need to enable
>     BTF
>     so I don't know how hard if we add BTF support in the current
>     design. It
>     would be probably OK it's not a lot of effort.
>
>
> As far as we understand BTF helps in BPF debugging and libbpf supports
> it as is.
> Without libbpf we in v1 load the BPF instructions only.
> If you think the BTF is mandatory (BTW, why?) I think it is better to
> switch to libbpf and keep the entire ELF in the qemu data.


It is used to make sure the BPF can do compile once run everywhere.

This is explained in detail in here:
https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html.


Thank you, then there is no question, we need to use libbpf.
 
Thanks


>
>
>     >
>     > 3. Keep instructions or ELF inside QEMU or as separate external file
>     > Solution 3a (~as in v1): Built-in array of instructions or ELF.
>     If we
>     > generate them out of QEMU build - keep 2 arrays or instructions
>     or ELF
>     > (BE and LE),
>     > Solution 3b: Install them as separate files (/usr/share/qemu).
>     > We'd prefer 3a:
>     >  Then there is a guarantee that the eBPF is built with exactly the
>     > same config structures as QEMU (qemu creates a mapping of its
>     > structures, eBPF uses them).
>     >  No need to take care on scenarios like 'file not found', 'file
>     is not
>     > suitable' etc
>
>
>     Yes, let's go 3a for upstream.
>
>
>     >
>     > 4. Is there some real request to have the eBPF for big-endian?
>     > If no, we can enable eBPF only for LE builds
>
>
>     We can go with LE first.
>
>     Thanks
>
>
>     >
>     > Jason, Daniel, Michael
>     > Can you please let us know what you think and why?
>     >
>     > On Thu, Nov 5, 2020 at 3:19 PM Daniel P. Berrangé
>     <berrange@redhat.com <mailto:berrange@redhat.com>
>     > <mailto:berrange@redhat.com <mailto:berrange@redhat.com>>> wrote:
>     >
>     >     On Thu, Nov 05, 2020 at 10:01:09AM +0000, Daniel P. Berrangé
>     wrote:
>     >     > On Thu, Nov 05, 2020 at 11:46:18AM +0800, Jason Wang wrote:
>     >     > >
>     >     > > On 2020/11/4 下午5:31, Daniel P. Berrangé wrote:
>     >     > > > On Wed, Nov 04, 2020 at 10:07:52AM +0800, Jason Wang
>     wrote:
>     >     > > > > On 2020/11/3 下午6:32, Yuri Benditovich wrote:
>     >     > > > > >
>     >     > > > > > On Tue, Nov 3, 2020 at 11:02 AM Jason Wang
>     >     <jasowang@redhat.com <mailto:jasowang@redhat.com>
>     <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>
>     >     > > > > > <mailto:jasowang@redhat.com
>     <mailto:jasowang@redhat.com>
>     >     <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>>>
>     wrote:
>     >     > > > > >
>     >     > > > > >
>     >     > > > > >      On 2020/11/3 上午2:51, Andrew Melnychenko wrote:
>     >     > > > > >      > Basic idea is to use eBPF to calculate and
>     steer
>     >     packets in TAP.
>     >     > > > > >      > RSS(Receive Side Scaling) is used to distribute
>     >     network packets
>     >     > > > > >      to guest virtqueues
>     >     > > > > >      > by calculating packet hash.
>     >     > > > > >      > eBPF RSS allows us to use RSS with vhost TAP.
>     >     > > > > >      >
>     >     > > > > >      > This set of patches introduces the usage of
>     eBPF
>     >     for packet steering
>     >     > > > > >      > and RSS hash calculation:
>     >     > > > > >      > * RSS(Receive Side Scaling) is used to
>     distribute
>     >     network packets to
>     >     > > > > >      > guest virtqueues by calculating packet hash
>     >     > > > > >      > * eBPF RSS suppose to be faster than already
>     >     existing 'software'
>     >     > > > > >      > implementation in QEMU
>     >     > > > > >      > * Additionally adding support for the usage of
>     >     RSS with vhost
>     >     > > > > >      >
>     >     > > > > >      > Supported kernels: 5.8+
>     >     > > > > >      >
>     >     > > > > >      > Implementation notes:
>     >     > > > > >      > Linux TAP TUNSETSTEERINGEBPF ioctl was used to
>     >     set the eBPF program.
>     >     > > > > >      > Added eBPF support to qemu directly through a
>     >     system call, see the
>     >     > > > > >      > bpf(2) for details.
>     >     > > > > >      > The eBPF program is part of the qemu and
>     >     presented as an array
>     >     > > > > >      of bpf
>     >     > > > > >      > instructions.
>     >     > > > > >      > The program can be recompiled by provided
>     >     Makefile.ebpf(need to
>     >     > > > > >      adjust
>     >     > > > > >      > 'linuxhdrs'),
>     >     > > > > >      > although it's not required to build QEMU with
>     >     eBPF support.
>     >     > > > > >      > Added changes to virtio-net and vhost, primary
>     >     eBPF RSS is used.
>     >     > > > > >      > 'Software' RSS used in the case of hash
>     >     population and as a
>     >     > > > > >      fallback option.
>     >     > > > > >      > For vhost, the hash population feature is not
>     >     reported to the guest.
>     >     > > > > >      >
>     >     > > > > >      > Please also see the documentation in PATCH 6/6.
>     >     > > > > >      >
>     >     > > > > >      > I am sending those patches as RFC to
>     initiate the
>     >     discussions
>     >     > > > > >      and get
>     >     > > > > >      > feedback on the following points:
>     >     > > > > >      > * Fallback when eBPF is not supported by
>     the kernel
>     >     > > > > >
>     >     > > > > >
>     >     > > > > >      Yes, and it could also a lacking of CAP_BPF.
>     >     > > > > >
>     >     > > > > >
>     >     > > > > >      > * Live migration to the kernel that doesn't
>     have
>     >     eBPF support
>     >     > > > > >
>     >     > > > > >
>     >     > > > > >      Is there anything that we needs special
>     treatment here?
>     >     > > > > >
>     >     > > > > > Possible case: rss=on, vhost=on, source system with
>     >     kernel 5.8
>     >     > > > > > (everything works) -> dest. system 5.6 (bpf does not
>     >     work), the adapter
>     >     > > > > > functions, but all the steering does not use
>     proper queues.
>     >     > > > >
>     >     > > > > Right, I think we need to disable vhost on dest.
>     >     > > > >
>     >     > > > >
>     >     > > > > >
>     >     > > > > >
>     >     > > > > >      > * Integration with current QEMU build
>     >     > > > > >
>     >     > > > > >
>     >     > > > > >      Yes, a question here:
>     >     > > > > >
>     >     > > > > >      1) Any reason for not using libbpf, e.g it
>     has been
>     >     shipped with some
>     >     > > > > >      distros
>     >     > > > > >
>     >     > > > > >
>     >     > > > > > We intentionally do not use libbpf, as it present only
>     >     on some distros.
>     >     > > > > > We can switch to libbpf, but this will disable bpf if
>     >     libbpf is not
>     >     > > > > > installed
>     >     > > > >
>     >     > > > > That's better I think.
>     >     > > > >
>     >     > > > >
>     >     > > > > >      2) It would be better if we can avoid shipping
>     >     bytecodes
>     >     > > > > >
>     >     > > > > >
>     >     > > > > >
>     >     > > > > > This creates new dependencies: llvm + clang + ...
>     >     > > > > > We would prefer byte code and ability to generate
>     it if
>     >     prerequisites
>     >     > > > > > are installed.
>     >     > > > >
>     >     > > > > It's probably ok if we treat the bytecode as a kind of
>     >     firmware.
>     >     > > > That is explicitly *not* OK for inclusion in Fedora. They
>     >     require that
>     >     > > > BPF is compiled from source, and rejected my
>     suggestion that
>     >     it could
>     >     > > > be considered a kind of firmware and thus have an
>     exception
>     >     from building
>     >     > > > from source.
>     >     > >
>     >     > >
>     >     > > Please refer what it was done in DPDK:
>     >     > >
>     >     > > http://git.dpdk.org/dpdk/tree/doc/guides/nics/tap.rst#n235
>     >     > >
>     >     > > I don't think what proposed here makes anything different.
>     >     >
>     >     > I'm not convinced that what DPDK does is acceptable to
>     Fedora either
>     >     > based on the responses I've received when asking about BPF
>     handling
>     >     > during build.  I wouldn't suprise me, however, if this was
>     simply
>     >     > missed by reviewers when accepting DPDK into Fedora,
>     because it is
>     >     > not entirely obvious unless you are looking closely.
>     >
>     >     FWIW, I'm pushing back against the idea that we have to
>     compile the
>     >     BPF code from master source, as I think it is reasonable to
>     have the
>     >     program embedded as a static array in the source code
>     similar to what
>     >     DPDK does.  It doesn't feel much different from other places
>     where
>     >     apps
>     >     use generated sources, and don't build them from the
>     original source
>     >     every time. eg "configure" is never re-generated from
>     >     "configure.ac <http://configure.ac> <http://configure.ac>"
>     >     by Fedora packagers, they just use the generated "configure"
>     script
>     >     as-is.
>     >
>     >     Regards,
>     >     Daniel
>     >     --
>     >     |: https://berrange.com     -o-
>     > https://www.flickr.com/photos/dberrange :|
>     >     |: https://libvirt.org        -o-
>     https://fstop138.berrange.com :|
>     >     |: https://entangle-photo.org   -o-
>     > https://www.instagram.com/dberrange :|
>     >
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]