[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] backend for blk or fs with guaranteed blocking/synchron

From: Artem Pisarenko
Subject: Re: [Qemu-devel] backend for blk or fs with guaranteed blocking/synchronous I/O
Date: Mon, 10 Sep 2018 21:06:55 +0600

It looks like things are even worse. Guest demonstrates strange timings
even without access to anything external to machine. I've added Paolo
Bonzini to CC, because issue looks related to cpu/tcg/memory stuff.

I've written simple test script running parallel 'dd' utility processes
operating on files located in RAM. QEMU machine with multiple vCPUs.
Moreover, it have separate NUMA nodes for each vCPU.
Script in brief: it accepts argument with desired processes count, for each
process it mounts tmpfs, binded to node memory, and runs 'dd', binded both
to node cpu and memory, which copies files located on that tmpfs.
It's expected that overall execution time of N parallel processes (or speed
of copying) should always be the same, not depending on N value (of course,
provided that N <= nodes_count and 'dd' is single-threaded). Because it's
just as simple as simple loops of instructions just loading and storing
values in memory local to each CPU. No common resources should be involved
- neither software (such as some target OS lock/mutex), nor hardware (such
as memory bus). It should be almost ideal parallelization.
But it's not only degradates when increasing N, but even does it
proportionally !!! Same test running oh host machine (just multicore, no
NUMA) shows expected results: it has degradation (because of common memory
bus), but with non-linear dependency on N.

Script ("test.sh"):
    # Preparation...
    if command -v numactl >/dev/null; then
    for i in $(seq 0 $((N - 1)));
      mkdir -p /mnt/testmnt_$i
      if [[ "$USE_NUMA_BIND" == 1 ]] ; then
TMPFS_EXTRA_OPT=",mpol=bind:$i"; fi
      mount -t tmpfs -o
size=25M,noatime,nodiratime,norelatime$TMPFS_EXTRA_OPT tmpfs /mnt/testmnt_$i
      dd if=/dev/zero of=/mnt/testmnt_$i/testfile_r bs=10M count=1
>/dev/null 2>&1
    # Running...
    for i in $(seq 0 $((N - 1)));
      if [[ "$USE_NUMA_BIND" == 1 ]] ; then PREFIX_RUN="numactl
--cpunodebind=$i --membind=$i"; fi
      $PREFIX_RUN dd if=/mnt/testmnt_$i/testfile_r
of=/mnt/testmnt_$i/testfile_w bs=100 count=100000 2>&1 | sed -n 's/^.*,
\(.*\)$/\1/p' &
    # Cleanup...
    for i in $(seq 0 $((N - 1))); do umount /mnt/testmnt_$i; done
    rm -rf /mnt/testmnt_*

Corresponding QEMU command line fragment:
    "-machine accel=tcg -m 2048 -icount 1,sleep=off -rtc clock=vm -smp 10
-cpu qemu64 -numa node -numa node -numa node -numa node -numa node -numa
node -numa node -numa node -numa node -numa node"
(Removing -icount or numa nodes don't change results.)

Example runs on my Intel Core i7-7700 host (adequate results):
  address@hidden:~$ sudo ./test.sh 1
  117 MB/s
  address@hidden:~$ sudo ./test.sh 10
  91,1 MB/s
  89,3 MB/s
  90,4 MB/s
  85,0 MB/s
  68,7 MB/s
  63,1 MB/s
  62,0 MB/s
  55,9 MB/s
  54,1 MB/s
  56,0 MB/s

Example runs on my tiny linux x86_64 guest (strange results):
  address@hidden:~# ./test.sh 1
  17.5 MB/s
  address@hidden:~# ./test.sh 10
  3.2 MB/s
  2.7 MB/s
  2.6 MB/s
  2.0 MB/s
  2.0 MB/s
  1.9 MB/s
  1.8 MB/s
  1.8 MB/s
  1.8 MB/s
  1.8 MB/s

Please, explain these results. Or maybe I wrong and it's normal ?

чт, 6 сент. 2018 г. в 16:24, Artem Pisarenko <address@hidden>:

> Hi all,
> I'm developing paravirtualized target linux system which runs multiple
> linux containers (LXC) inside itself. (For those, who unfamiliar with LXC,
> simply put, it's an isolated group of userspace processes with their own
> rootfs.) Each container should be provided access to its rootfs located at
> host and execution of container should be deterministic. Particularly, it
> means that container I/O operations must be synchronized within some
> predefined quantum of guest _virtual_ time, i.e. its I/O activity shouldn't
> be delayed by host performance or activities on host and other containers.
> In other words, guest should see it's like either infinite throughput and
> zero latency, or some predefined throughput/latency characteristics
> guaranteed per each rootfs.
> While other sources of non-determinism are seem to be eliminated (using
> TCG, -icount, etc.), asynchronous I/O still introduces it.
> What is scope of "(asynchronous) I/O" term within qemu? Is it something
> related to block devices layer only, or generic term, covering whole
> datapath between vCPU and backend?
> If it relates to block devices only, does usage of VirtFS guarantee
> deterministic access, or it still involves some asynchrony relative to
> guest virtual clock?
> Is it possible to force asynchronous I/O within qemu to be blocking by
> some external means (host OS configuration, hooks, etc.) ? I know, it may
> greatly slow down guest performance, but it's still better than nothing.
> Maybe some trivial patch can be made to qemu code at virtio, block backend
> or platform syscalls level?
> Maybe I/O automatically (and guaranteed) fallbacks to synchronous mode in
> some particular configurations, such as using block device with image
> located on tmpfs in RAM (either directly or via overlay fs) ? If so, it's
> great!
> Or maybe some other solutions exists?...
> Main problem is to organize access from guest linux to some file system at
> host (directory, mount point, image file... doesn't matter) in
> deterministic manner.
> Secondary problem is to optimize performance as much as possible by:
> - avoiding unnecessary overheads (e.g. using virtio infrastructure,
> preference virtfs over blk device, etc.);
> - allowing some asynchrony within defined quantum of time (e.g. 10ms),
> i.e. i/o order and speed are free to float within each quantum borders,
> while result seen by guest at end of quantum is always same.
> Actually, what I'm trying to achieve have direct contradiction with most
> people trying to avoid, because synchronous I/O degradates performance in
> vast majority of usage scenarios.
> Does anyone have any thoughts on this?
> Best regards,
>   Artem Pisarenko
> --
> С уважением,
>   Артем Писаренко

С уважением,
  Артем Писаренко

reply via email to

[Prev in Thread] Current Thread [Next in Thread]