[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] RFC: some problems with bdrv_get_block_status

From: Eric Blake
Subject: Re: [Qemu-devel] RFC: some problems with bdrv_get_block_status
Date: Fri, 28 Apr 2017 13:31:32 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.0

On 04/28/2017 11:17 AM, Denis V. Lunev wrote:
> Hello, All!
> Recently we have experienced problems with very slow
> bdrv_get_block_status call, which is massive called f.e.
> from mirror_run().
> The problem was narrowed down to slow lseek64() system
> call, which can take 1-2-3 seconds.

I'm guessing you meant one-to-three (the range), not one-two-three
(three separate digits), and just had an unfortunate abbreviation of
'to' turning into the phonetically-similar '2'.

> address@hidden ~]# strace -f -p 77048 -T  -e lseek
> Process 77048 attached with 14 threads
> [pid 77053] lseek(23, 6359613440, SEEK_DATA) = 6360276992 <1.250323>

That sounds like a bug in your choice of filesystem.  It's been
mentioned before that lseek has some pathetically poor behavior (I think
tmpfs was one of the culprits), but I maintain that it is better to
hammer on the kernel folks to fix the poor behavior than it is to have
to implement user-space workarounds in every single affected program.

That said, a workaround of being able to request the avoidance of lseek
has been brought up on this list before.

> The problem comes from this branch of the code
> bdrv_co_get_block_status
>     .......
>     if (bs->file &&
>         (ret & BDRV_BLOCK_DATA) && !(ret & BDRV_BLOCK_ZERO) &&
>         (ret & BDRV_BLOCK_OFFSET_VALID)) {
>         int file_pnum;
>         ret2 = bdrv_co_get_block_status(bs->file, ret >> BDRV_SECTOR_BITS,
>                                         *pnum, &file_pnum);
>         if (ret2 >= 0) {
>             /* Ignore errors.  This is just providing extra information, it
>              * is useful but not necessary.
>              */
>             if (!file_pnum) {
>                 /* !file_pnum indicates an offset at or beyond the EOF;
> it is
>                  * perfectly valid for the format block driver to point
> to such
>                  * offsets, so catch it and mark everything as zero */
>                 ret |= BDRV_BLOCK_ZERO;
>             } else {
>                 /* Limit request to the range reported by the protocol
> driver */
>                 *pnum = file_pnum;
>                 ret |= (ret2 & BDRV_BLOCK_ZERO);
>             }
>         }
>     }

I don't see anything wrong with this code. It's nice to know that we
have data (qcow2 says we have allocated bytes due to this layer, so
don't refer to the backing files), but when the underlying file can ALSO
tell us that the underlying protocol is sparse and we are on a hole,
then we know that we have BDRV_BLOCK_ZERO condition which can in turn
enable other optimizations in quite a bit of the stack.  It IS valid to
return BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO (we don't have to limit
ourselves to BDRV_BLOCK_DATA), and we should probably do so any time
that such computation is cheap.

I agree with your analysis that on a poor filesystem where lseek()
behavior sucks that it is no longer cheap to determine where the holes are.

But the proper workaround for those filesystems should NOT be made by
undoing this part of bdrv_co_get_block_status(), but should rather be
fixed in file-posix.c by the addition of a user-controllable knob on
whether to skip lseek().  In other words, if we're going to work around
the poor filesystem performance, the workaround should live in
file-posix.c, not in the generic block/io.c.  Once the workaround is
enabled, the 'ret2 = bdrv_co_get_block_status()' nested call should be
lightning fast (because file-posix.c would no longer be using lseek when
you've requested the workaround).

> Frankly speaking, this optimization should not give much.
> If upper layer format (QCOW2) says that we have data
> here, then nowadays in 99.9% we do have the data.

You are correct that we have BDRV_BLOCK_DATA; but it IS useful to ALSO
know if that data reads back as all zeros (so we can skip reading it).
Just because qcow2 reports something as data does NOT rule out whether
the data still reads as zeros.

> Meanwhile this branch can cause problems. We would
> need block cleared entirely to get the benefit for most
> cases in mirror and backup operations.
> At my opinion it worth to drop this at all.
> Guys, do you have any opinion?

Again, my opinion is to not change this part of block/io.c.  Rather,
work should be expended on individual protocol drivers to quit wasting
unnecessary time on reporting BDRV_BLOCK_ZERO if the time spent to
determine that is not worth the effort (as it is when using lseek() on
inefficient filesystems).  It is always safe for a protocol driver to
report BDRV_BLOCK_DATA instead of BDRV_BLOCK_ZERO, if asking the
question is too expensive (treating holes as data may pessimize other
operations, but that's okay if the penalty for asking is worse than what
optimizations are able to regain).

> Den
> P.S. The kernel is one based on RedHat 3.10.0-514. The same
>       problem was observed in 3.10.0-327 too.

Red Hat is always two words (I'm allowed to point that out, as they
employ me ;).  But if it really is a Red Hat kernel problem, be sure to
use your support contract to point out the kernel developers that they
really need to fix lseek() - and you'll need to give details on which
filesystem you're using (as not all filesystems have that abysmal

Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]