Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroe

From:	Denis V. Lunev
Subject:	Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes()
Date:	Tue, 26 Apr 2016 00:00:06 +0300
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1

On 04/25/2016 10:35 PM, Eric Blake wrote:

On 04/25/2016 04:20 AM, Denis V. Lunev wrote:

On 04/25/2016 12:05 PM, Kevin Wolf wrote:

Am 23.04.2016 um 14:05 hat Denis V. Lunev geschrieben:

Unfortunately Linux kernel could send non-aligned requests to qemu-nbd
if the caller is using O_DIRECT and does not align in-memory data to
page. Thus qemu-nbd will call block layer with non-aligned requests.

At first glance, I'd argue that any caller using O_DIRECT without
obeying memory alignment restrictions is broken; why should qemu have to
work around such broken callers?

Memory requirements are followed. Minimal requirement which
is fixed in kernel API is 512 bytes. For here the broken part is kernel
as we have agreed on year ago, which splits requests in a
wrong way.

Unfortunately 3.10 RHEL7 based kernel misbehaves this
way.

You could remember the situation a year ago. The QEMU behaves
exactly in the same way. Usually people start probing memory
alignment for O_DIRECT writes from 512 bytes and kernel
accepts that. This situation is VERY common :(

The program is 100% reproducible. The following sequence
is performed:

#define _GNU_SOURCE

#include <stdio.h>
#include <malloc.h>
#include <string.h>
#include <sys/fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
     char *buf;
     int fd;

     if (argc != 2) {
         return -1;
     }

     fd = open(argv[1], O_WRONLY | O_DIRECT);

     do {
         buf = memalign(512, 1024 * 1024);
     } while (!(unsigned long)buf & (4096 - 1));

In other words, you are INTENTIONALLY grabbing an unaligned buffer for
use with an O_DIRECT fd, when O_DIRECT demands that you should be using
at least page alignment (4096 or greater).  Arguably, the bug is in your
program, not qemu.

Yes, as THIS IS THE TEST to reproduce the problem
observed in the 3rd party software. I have spent
some time narrowing it down to this.

That said, teaching qemu to split up a write_zeroes request into head,
tail, and aligned body, so at least the aligned part might benefit from
optimizations, seems like it might be worthwhile, particularly since my
pending NBD series changed from blk_write_zeroes (cluster-aligned) to
blk_pwrite_zeroes (byte-aligned), and it is conceivable that we  can
encounter a situation without O_DIRECT in the picture at all, where our
NBD server is connected to a client that specifically asks for the new
NBD_CMD_WRITE_ZEROES on any arbitrary byte alignment.

actually NBD server is just an innocent party which transparently
processes requests coming to it.

What will happen if we will write 4k into the middle of the zeroed
block with old code? The system will allocate 64k and write it
down. This code will skip this write at all. Thus the optimization
has some sense on its own.

     memset(buf, 0, 1024 * 1024);
     write(fd, buf, 1024 * 1024);
     return 0;
}

This program is compiled as a.out.

Before the patch:
hades ~/src/qemu $ qemu-img create -f qcow2 test.qcow2 64G
Formatting 'test.qcow2', fmt=qcow2 size=68719476736 encryption=off
cluster_size=65536 lazy_refcounts=off refcount_bits=16
hades ~/src/qemu $ sudo ./qemu-nbd  --connect=/dev/nbd3 test.qcow2
--detect-zeroes=on --aio=native --cache=none
hades ~/src/qemu $ sudo ./a.out /dev/nbd3

Here, I'd argue that the kernel NBD module is buggy, for allowing a
user-space app to pass an unaligned buffer to an O_DIRECT fd.  But
that's outside the realm of qemu code.

no. The victim is GENERIC block layer code in the kernel.
You could look. Original report in the commit 459b4e661

"The patch changes default bounce buffer optimal alignment to
    MAX(page size, 4k). 4k is chosen as maximal known sector size on real
    HDD.

    The justification of the performance improve is quite interesting.
    From the kernel point of view each request to the disk was split
    by two. This could be seen by blktrace like this:
      9,0   11  1     0.000000000 11151  Q  WS 312737792 + 1023 [qemu-img]
      9,0   11  2     0.000007938 11151  Q  WS 312738815 + 8 [qemu-img]
      9,0   11  3     0.000030735 11151  Q  WS 312738823 + 1016 [qemu-img]
      9,0   11  4     0.000032482 11151  Q  WS 312739839 + 8 [qemu-img]
      9,0   11  5     0.000041379 11151  Q  WS 312739847 + 1016 [qemu-img]
      9,0   11  6     0.000042818 11151  Q  WS 312740863 + 8 [qemu-img]
      9,0   11  7     0.000051236 11151  Q  WS 312740871 + 1017 [qemu-img]
      9,0    5  1     0.169071519 11151  Q  WS 312741888 + 1023 [qemu-img]
    After the patch the pattern becomes normal:
      9,0    6  1     0.000000000 12422  Q  WS 314834944 + 1024 [qemu-img]
      9,0    6  2     0.000038527 12422  Q  WS 314835968 + 1024 [qemu-img]
      9,0    6  3     0.000072849 12422  Q  WS 314836992 + 1024 [qemu-img]
      9,0    6  4     0.000106276 12422  Q  WS 314838016 + 1024 [qemu-img]"

was made against ext4 and xfs filesystem. Once again, the problem
is in the block layer of the kernel itself.

But again, per my argument that you don't have to involve the kernel nor
O_DIRECT to be able to write a client that can attempt to cause an NBD
server to do unaligned operations, we can use this kernel bug as an
easier way to test any proposed fix to the qemu side of things, whether
or not the kernel module gets tightened in behavior down the road.

Den

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes(), Denis V. Lunev, 2016/04/23
- Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes(), Kevin Wolf, 2016/04/25
  - Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes(), Denis V. Lunev, 2016/04/25
    - Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes(), Eric Blake, 2016/04/25
    - Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes(), Denis V. Lunev <=
    - Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes(), Kevin Wolf, 2016/04/26
    - Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes(), Denis V. Lunev, 2016/04/26
    - Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes(), Kevin Wolf, 2016/04/26
    - Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes(), Denis V. Lunev, 2016/04/27
    - Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes(), Kevin Wolf, 2016/04/27
    - Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes(), Denis V. Lunev, 2016/04/27

Prev by Date: Re: [Qemu-devel] [PATCH v3 36/44] nbd: Improve handling of shutdown requests
Next by Date: Re: [Qemu-devel] [PATCH v3 16/44] atapi: Switch to byte-based block access
Previous by thread: Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes()
Next by thread: Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes()
Index(es):
- Date
- Thread