From: Jason Wang
Subject: Re: [Qemu-devel] [PATCH 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
Date: Wed, 30 Nov 2016 19:12:44 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0

On 2016年11月30日 16:55, Wei Xu wrote:
On 2016年11月24日 12:17, Jason Wang wrote:

On 2016年11月01日 01:41, address@hidden wrote:
From: Wei Xu <address@hidden>

All the data packets in a tcp connection are cached
to a single buffer in every receive interval, and will
be sent out via a timer, the 'virtio_net_rsc_timeout'
controls the interval, this value may impact the
performance and response time of tcp connection,
50000(50us) is an experience value to gain a performance
improvement, since the whql test sends packets every 100us,
so '300000(300us)' passes the test case, it is the default
value as well, tune it via the command line parameter
'rsc_interval' within 'virtio-net-pci' device, for example,
to launch a guest with interval set as '500000':


The timer will only be triggered if the packets pool is not empty,
and it'll drain off all the cached packets.

'NetRscChain' is used to save the segments of IPv4/6 in a
VirtIONet device.

A new segment becomes a 'Candidate' as well as it passed sanity check,
the main handler of TCP includes TCP window update, duplicated
ACK check and the real data coalescing.

An 'Candidate' segment means:
1. Segment is within current window and the sequence is the expected one.
2. 'ACK' of the segment is in the valid window.

Sanity check includes:
1. Incorrect version in IP header
2. An IP options or IP fragment
3. Not a TCP packet
4. Sanity size check to prevent buffer overflow attack.
5. An ECN packet

Even though, there might more cases should be considered such as
ip identification other flags, while it breaks the test because
windows set it to the same even it's not a fragment.

Normally it includes 2 typical ways to handle a TCP control flag,
'bypass' and 'finalize', 'bypass' means should be sent out directly,
while 'finalize' means the packets should also be bypassed, but this
should be done after search for the same connection packets in the
pool and drain all of them out, this is to avoid out of order fragment.

All the 'SYN' packets will be bypassed since this always begin a new'
connection, other flags such 'URG/FIN/RST/CWR/ECE' will trigger a
finalization, because this normally happens upon a connection is going
to be closed, an 'URG' packet also finalize current coalescing unit.

Statistics can be used to monitor the basic coalescing status, the
'out of order' and 'out of window' means how many retransmitting packets,
thus describe the performance intuitively.

Signed-off-by: Wei Xu <address@hidden>
  hw/net/virtio-net.c                         | 602
  include/hw/virtio/virtio-net.h              |   5 +-
  include/hw/virtio/virtio.h                  |  76 ++++
  include/net/eth.h                           |   2 +
  include/standard-headers/linux/virtio_net.h |  14 +
  net/tap.c                                   |   3 +-
  6 files changed, 670 insertions(+), 32 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 06bfe4b..d1824d9 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -15,10 +15,12 @@
  #include "qemu/iov.h"
  #include "hw/virtio/virtio.h"
  #include "net/net.h"
+#include "net/eth.h"
  #include "net/checksum.h"
  #include "net/tap.h"
  #include "qemu/error-report.h"
  #include "qemu/timer.h"
+#include "qemu/sockets.h"
  #include "hw/virtio/virtio-net.h"
  #include "net/vhost_net.h"
  #include "hw/virtio/virtio-bus.h"
@@ -43,6 +45,24 @@
  #define endof(container, field) \
      (offsetof(container, field) + sizeof(((container *)0)->field))
+#define VIRTIO_NET_IP4_ADDR_SIZE   8        /* ipv4 saddr + daddr */

Only used once in the code, I don't see much value of this macro.

Just to keep it a bit readable.

Then you may want to replace this with sizeof(struct ...).

+#define VIRTIO_NET_TCP_FLAG         0x3F
+/* IPv4 max payload, 16 bits in the header */
+#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header))
+/* header length value in ip header without option */
+/* Purge coalesced packets timer interval, This value affects the
+   a lot, and should be tuned carefully, '300000'(300us) is the
+   value to pass the WHQL test, '50000' can gain 2x netperf
throughput with
+   tso/gso/gro 'off'. */
+#define VIRTIO_NET_RSC_INTERVAL  300000

This should be a property for virito-net and the above comment can be
the description of the property.

This is a value for a property, actually I hadn't found a place to put

There's a description filed of PropertyInfo, but for virtio properties may need more work. So we can leave this as is now.

  typedef struct VirtIOFeature {
      uint32_t flags;
      size_t end;
@@ -589,7 +609,12 @@ static uint64_t
virtio_net_guest_offloads_by_features(uint32_t features)
          (1ULL << VIRTIO_NET_F_GUEST_ECN)  |
          (1ULL << VIRTIO_NET_F_GUEST_UFO);
-    return guest_offloads_mask & features;
+    if (features & VIRTIO_NET_F_CTRL_GUEST_OFFLOADS) {
+        return (guest_offloads_mask & features) |
+               (1ULL << VIRTIO_NET_F_GUEST_RSC4);

Why need to care this, I believe RSC has nothing to do with peer's
offload setting?

There is some misunderstanding about how does the feature work
followed with a few subsequent comments, so let me clarify it first.

Currently RSC feature is bundled with 'VIRTIO_NET_F_CTRL_GUEST_OFFLOADS'
which means once guest driver reports supporting this feature during
driver initializing, then qemu will initialize RSC feature and use the
new header with RSC fields to communicate with guest.

Does it mean RSC depends on CTRL_GUEST_OFFLOADS? Any advantages?

While RSC won't coalescing packets before guest driver notify host to
enable it, the motivation of this is to support dynamically turn on/off
the feature from guest side, and don't need a new feature bit for this

So from the guest's point of view, it can see the new header but all
packets are still unchanged, once it enables the feature via control
queue, coalesced packets will comes to the queue.

I believe disabling it by default should be the work of dirver not qemu. When RSC is enabled and negotiated, it should start to coalesce packets like other offload features.

+    } else {
+        return guest_offloads_mask & features;
+    }
  static inline uint64_t virtio_net_supported_guest_offloads(VirtIONet
@@ -600,6 +625,7 @@ static inline uint64_t
virtio_net_supported_guest_offloads(VirtIONet *n)
  static void virtio_net_set_features(VirtIODevice *vdev, uint64_t
+    NetClientState *nc;
      VirtIONet *n = VIRTIO_NET(vdev);
      int i;
@@ -612,6 +638,22 @@ static void virtio_net_set_features(VirtIODevice
*vdev, uint64_t features)
+    if (virtio_has_feature(features,
+        n->guest_hdr_len = sizeof(struct virtio_net_hdr_rsc);

I'm confused, and don't see the connection here. You check
CTRL_GUEST_OFFLOADS but set RSC header here, I don't think

+        n->host_hdr_len = n->guest_hdr_len;
+        n->has_rsc_hdr = 1;

Why need this extra flag, can't we just check RSC feature instead?


+        for (i = 0; i < n->max_queues; i++) {
+            nc = qemu_get_subqueue(n->nic, i);
+            if (peer_has_vnet_hdr(n) &&
+                qemu_has_vnet_hdr_len(nc->peer, n->guest_hdr_len)) {
+                qemu_set_vnet_hdr_len(nc->peer, n->guest_hdr_len);
+                n->host_hdr_len = n->guest_hdr_len;
+            }
+        }
+    }

Need to move hdr len setting to another helper, otherwise it may be set
twice. Once for mrg_rxbuf and another is for RSC.

Do you know where should i put it to?

Introduce a new header and put all vnet header check and set logic there instead of doing this twice.

      if (n->has_vnet_hdr) {
          n->curr_guest_offloads =
@@ -1097,7 +1139,8 @@ static int receive_filter(VirtIONet *n, const
uint8_t *buf, int size)
      return 0;
-static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t
*buf, size_t size)
+static ssize_t virtio_net_do_receive(NetClientState *nc,
+                                     const uint8_t *buf, size_t size)
      VirtIONet *n = qemu_get_nic_opaque(nc);
      VirtIONetQueue *q = virtio_net_get_subqueue(nc);
@@ -1161,6 +1204,12 @@ static ssize_t
virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t
              receive_header(n, sg, elem->in_num, buf, size);
+            if (n->has_rsc_hdr) {
+                offset = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+                iov_from_buf(sg, elem->in_num, offset, \
+                             buf + offset, 4);

Don't get why this is needed.

This is to put the RSS fields.

Ok, looks like I don't find the code that store RSC fields. And you may want to unify the logic with mrg rxbuf header copy.

+            }
              offset = n->host_hdr_len;
              total += n->guest_hdr_len;
              guest_offset = n->guest_hdr_len;
@@ -1239,7 +1288,7 @@ static int32_t
virtio_net_flush_tx(VirtIONetQueue *q)
          ssize_t ret;
          unsigned int out_num;
          struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE
+ 1], *out_sg;
-        struct virtio_net_hdr_mrg_rxbuf mhdr;
+        struct virtio_net_hdr_rsc rsc_hdr;
          elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement));
          if (!elem) {
@@ -1256,26 +1305,27 @@ static int32_t
virtio_net_flush_tx(VirtIONetQueue *q)
          if (n->has_vnet_hdr) {
-            if (iov_to_buf(out_sg, out_num, 0, &mhdr,
n->guest_hdr_len) <
+            if (iov_to_buf(out_sg, out_num, 0, &rsc_hdr,
n->guest_hdr_len) <
                  n->guest_hdr_len) {
                  virtio_error(vdev, "virtio-net header incorrect");
                  virtqueue_detach_element(q->tx_vq, elem, 0);
                  return -EINVAL;

Unnecessary newline.

forgive my typo, maybe caused by the indent in my vi profile, thanks

              if (n->needs_vnet_hdr_swap) {
-                virtio_net_hdr_swap(vdev, (void *) &mhdr);
-                sg2[0].iov_base = &mhdr;
+                virtio_net_hdr_swap(vdev, (void *) &rsc_hdr);
+                sg2[0].iov_base = &rsc_hdr;
                  sg2[0].iov_len = n->guest_hdr_len;
                  out_num = iov_copy(&sg2[1], ARRAY_SIZE(sg2) - 1,
                                     out_sg, out_num,
                                     n->guest_hdr_len, -1);
                  if (out_num == VIRTQUEUE_MAX_SIZE) {
                      goto drop;
-        }
+                }

Unnecessary change.


                  out_num += 1;
                  out_sg = sg2;
-        }
+            }

Here too.



+    DEFINE_PROP_BIT64("guest_rsc4", VirtIONet, host_features,
+                    VIRTIO_NET_F_GUEST_RSC4, true),

Don't get why need DEFINE_XXX_BIT64, we still have left bits I believe.

+    DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout,
+                      VIRTIO_NET_RSC_INTERVAL),
diff --git a/include/hw/virtio/virtio-net.h
index 0ced975..56a8ce2 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -60,12 +60,15 @@ typedef struct VirtIONet {
      VirtIONetQueue *vqs;
      VirtQueue *ctrl_vq;
      NICState *nic;
+    QTAILQ_HEAD(, NetRscChain) rsc_chains;
+    uint32_t rsc_timeout;
      uint32_t tx_timeout;
      int32_t tx_burst;
      uint32_t has_vnet_hdr;
+    uint32_t has_rsc_hdr;
      size_t host_hdr_len;
      size_t guest_hdr_len;
-    uint32_t host_features;
+    uint64_t host_features;

Do we run out of host features? If yes, need an independent patch for this.


      uint8_t has_ufo;
      int mergeable_rx_bufs;
      uint8_t promisc;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b913aac..0006ce1 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -30,6 +30,8 @@
                                  (0x1ULL << VIRTIO_F_ANY_LAYOUT))
  struct VirtQueue;
+struct VirtIONet;
+typedef struct VirtIONet VirtIONet;
  static inline hwaddr vring_align(hwaddr addr,
                                               unsigned long align)
@@ -129,6 +131,80 @@ typedef struct VirtioDeviceClass {
      int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
  } VirtioDeviceClass;
+/* Coalesced packets type & status */
+typedef enum {
+    RSC_COALESCE,           /* Data been coalesced */

"Data has been" ?


+    RSC_FINAL,              /* Will terminate current connection */
+    RSC_NO_MATCH,           /* No matched in the buffer pool */
+    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp
ctrl, etc */

"to be bypassed" ?


+    RSC_CANDIDATE                /* Data want to be coalesced */
+typedef struct NetRscStat {
+    uint32_t received;
+    uint32_t coalesced;
+    uint32_t over_size;
+    uint32_t cache;
+    uint32_t empty_cache;
+    uint32_t no_match_cache;
+    uint32_t win_update;
+    uint32_t no_match;
+    uint32_t tcp_syn;
+    uint32_t tcp_ctrl_drain;
+    uint32_t dup_ack;
+    uint32_t dup_ack1;
+    uint32_t dup_ack2;
+    uint32_t pure_ack;
+    uint32_t ack_out_of_win;
+    uint32_t data_out_of_win;
+    uint32_t data_out_of_order;
+    uint32_t data_after_pure_ack;
+    uint32_t bypass_not_tcp;
+    uint32_t tcp_option;
+    uint32_t tcp_all_opt;
+    uint32_t ip_frag;
+    uint32_t ip_ecn;
+    uint32_t ip_hacked;
+    uint32_t ip_option;
+    uint32_t purge_failed;
+    uint32_t drain_failed;
+    uint32_t final_failed;
+    int64_t  timer;
+} NetRscStat;
+/* Rsc unit general info used to checking if can coalescing */
+typedef struct NetRscUnit {
+    void *ip;   /* ip header */
+    uint16_t *ip_plen;      /* data len pointer in ip header field */
+    struct tcp_header *tcp; /* tcp header */
+    uint16_t tcp_hdrlen;    /* tcp header len */
+ uint16_t payload; /* pure payload without virtio/eth/ip/tcp */
+} NetRscUnit;
+/* Coalesced segmant */
+typedef struct NetRscSeg {
+    QTAILQ_ENTRY(NetRscSeg) next;
+    void *buf;
+    size_t size;
+    uint16_t packets;
+    uint16_t dup_ack;
+    bool is_coalesced;      /* need recal ipv4 header checksum, mark
here */
+    NetRscUnit unit;
+    NetClientState *nc;
+} NetRscSeg;
+/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
+typedef struct NetRscChain {
+    QTAILQ_ENTRY(NetRscChain) next;
+    VirtIONet *n;                            /* VirtIONet */
+    uint16_t proto;
+    uint8_t  gso_type;
+    uint16_t max_payload;
+    QEMUTimer *drain_timer;
+    QTAILQ_HEAD(, NetRscSeg) buffers;
+    NetRscStat stat;
+} NetRscChain;

Why put the above in virtio.h? If it will not be used by other files,
why need put them in header file?

OK, I will put them to virtio-net.h.

Looks like virtio-net.c is better, no other file needs those.

  void virtio_instance_init_common(Object *proxy_obj, void *data,
                                   size_t vdev_size, const char
diff --git a/include/net/eth.h b/include/net/eth.h
index 2013175..5952ef2 100644
--- a/include/net/eth.h
+++ b/include/net/eth.h
@@ -177,6 +177,8 @@ struct tcp_hdr {
  #define TH_PUSH 0x08
  #define TH_ACK  0x10
  #define TH_URG  0x20
+#define TH_ECE  0x40
+#define TH_CWR  0x80

Let's put this in another patch.


      u_short th_win;      /* window */
      u_short th_sum;      /* checksum */
      u_short th_urp;      /* urgent pointer */
diff --git a/include/standard-headers/linux/virtio_net.h
index 30ff249..e67b36e 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -57,6 +57,9 @@
                       * Steering */
  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23    /* Set MAC address */
+/* Guest can handle coalesced ipv4-tcp packets */
+#define VIRTIO_NET_F_GUEST_RSC4    41

Why not use 24?

  #define VIRTIO_NET_F_GSO    6    /* Host handles pkts w/ any GSO
type */
  #endif /* VIRTIO_NET_NO_LEGACY */
@@ -94,6 +97,9 @@ struct virtio_net_hdr_v1 {
  #define VIRTIO_NET_HDR_GSO_UDP        3    /* GSO frame, IPv4 UDP
(UFO) */
  #define VIRTIO_NET_HDR_GSO_TCPV6    4    /* GSO frame, IPv6 TCP */
  #define VIRTIO_NET_HDR_GSO_ECN        0x80    /* TCP has ECN set */
+#define VIRTIO_NET_HDR_RSC_NONE     5   /* No packets coalesced */

Not sure this is really needed. Can we just use GSO_NONE?

Of course we can, but it is better to keep this feature distinguished.

Is there any advantages of doing this? I believe guest does not care about this.

And I believe we should not try to coalesce GSO packets since we're
lacking sufficient information for a correct rsc_pkts or rsc_dup_acks
from the backend.

+#define VIRTIO_NET_HDR_RSC_TCPV4    6 /* IPv4 TCP coalesced */
+#define VIRTIO_NET_HDR_RSC_TCPV6    7   /* IPv6 TCP coalesced */
      uint8_t gso_type;
      __virtio16 hdr_len;    /* Ethernet + IP + tcp/udp hdrs */
__virtio16 gso_size; /* Bytes to append to hdr_len per frame */
@@ -124,6 +130,14 @@ struct virtio_net_hdr_mrg_rxbuf {
      struct virtio_net_hdr hdr;
      __virtio16 num_buffers;    /* Number of merged rx buffers */
+/* This is the header to use when either one or both of GUEST_RSC4/6
+ * features have been negotiated. */
+struct virtio_net_hdr_rsc {
+    struct virtio_net_hdr_v1 hdr;

If RSC depends on VERSION_1, need to forbid creating RSC device without

How to do it?

Fail early on device_plugged.

also a question here, which header will be used if a device is not virtio 1.0 compliant?

Mergeable header is mandatory for 1.0 but selectable for 0.9x.

+    __virtio16 rsc_pkts;        /* Number of coalesced packets */
+    __virtio16 rsc_dup_acks;    /* Duplicated ack packets */
  #endif /* ...VIRTIO_NET_NO_LEGACY */
diff --git a/net/tap.c b/net/tap.c
index b6896a7..4557aa5 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -251,7 +251,8 @@ static void tap_set_vnet_hdr_len(NetClientState
*nc, int len)
      TAPState *s = DO_UPCAST(TAPState, nc, nc);
      assert(nc->info->type == NET_CLIENT_DRIVER_TAP);
-    assert(len == sizeof(struct virtio_net_hdr_mrg_rxbuf) ||
+    assert(len == sizeof(struct virtio_net_hdr_rsc) ||
+           len == sizeof(struct virtio_net_hdr_mrg_rxbuf) ||
             len == sizeof(struct virtio_net_hdr));
      tap_fd_set_vnet_hdr_len(s->fd, len);

