qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] virtio: avoid cost of -ftrivial-auto-var-init in hot path


From: Philippe Mathieu-Daudé
Subject: Re: [PATCH] virtio: avoid cost of -ftrivial-auto-var-init in hot path
Date: Thu, 5 Jun 2025 18:16:58 +0200
User-agent: Mozilla Thunderbird

On 5/6/25 14:50, Stefan Hajnoczi wrote:
On Thu, Jun 05, 2025 at 01:28:49PM +0200, Philippe Mathieu-Daudé wrote:
On 5/6/25 10:34, Daniel P. Berrangé wrote:
On Wed, Jun 04, 2025 at 03:18:43PM -0400, Stefan Hajnoczi wrote:
Since commit 7ff9ff039380 ("meson: mitigate against use of uninitialize
stack for exploits") the -ftrivial-auto-var-init=zero compiler option is
used to zero local variables. While this reduces security risks
associated with uninitialized stack data, it introduced a measurable
bottleneck in the virtqueue_split_pop() and virtqueue_packed_pop()
functions.

These virtqueue functions are in the hot path. They are called for each
element (request) that is popped from a VIRTIO device's virtqueue. Using
__attribute__((uninitialized)) on large stack variables in these
functions improves fio randread bs=4k iodepth=64 performance from 304k
to 332k IOPS (+9%).

IIUC, the 'hwaddr addr' variable is 8k in size, and the 'struct iovec iov'
array is 16k in size, so we have 24k on the stack that we're clearing and
then later writing the real value. Makes sense that this would have a
perf impact in a hotpath.

This issue was found using perf-top(1). virtqueue_split_pop() was one of
the top CPU consumers and the "annotate" feature showed that the memory
zeroing instructions at the beginning of the functions were hot.

When you say you found it with 'perf-top' was that just discovered by
accident, or was this usage of perf-top in response to users reporting
a performance degradation vs earlier QEMU ?

Would it make sense to move these to VirtQueue (since the structure
definition is local anyway)?

-- >8 --
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 85110bce374..b96c6ec603c 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -153,6 +153,12 @@ struct VirtQueue
      EventNotifier host_notifier;
      bool host_notifier_enabled;
      QLIST_ENTRY(VirtQueue) node;
+
+    /* Only used by virtqueue_pop() */
+    struct {
+        hwaddr addr[VIRTQUEUE_MAX_SIZE];
+        struct iovec iov[VIRTQUEUE_MAX_SIZE];
+    } pop;

This is an alternative. Using g_alloca() is another alternative.

Not a lot of these:

$ git grep -w g_alloca
backends/tpm/tpm_emulator.c:136:        buf = g_alloca(n);
tests/unit/test-char.c:1012:        be = g_alloca(sizeof(CharBackend));

The tpm_emulator.c use could be replaced by g_autofree g_malloc.

I chose __attribute__((uninitialized)) because it clearly documents the
reason why these variables need special treatment. In your patch the
"Only used by virtqueue_pop()" comment isn't enough to explain why these
variables should be located here. Someone might accidentally move them
back into virtqueue_pop() functions in the future if they are unaware of
the reason.

The only safe-net is a better comment.

I'm happy to change approaches based on the pros/cons. Why do you prefer
moving the local variables into VirtQueue?

I don't have a particular preference, I'm just wondering why these
vars have to be handled differently than the rest, by introducing
QEMU_UNINITIALIZED.

Anyway, no objection to this patch :)

Regards,

Phil.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]