[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [

From: Paolo Bonzini
Subject: Re: [Qemu-block] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
Date: Thu, 21 Jul 2016 10:23:48 -0400 (EDT)

> > 1) avoid copying zero data, to keep the copy process efficient.  For this,
> > SEEK_HOLE/SEEK_DATA are enough.
> > 
> > 2) copy file contents while preserving the allocation state of the file's
> > extents.
> Which is /very difficult/ to do safely and reliably.
> i.e. the use of fiemap to duplicate the exact layout of a file
> from userspace is only posisble if you can /guarantee/ the source
> file has not changed in any way during the copy operation at the
> pointin time you finalise the destination data copy.

We don't do exactly that, exactly because it's messy when you have
concurrent accesses (which shouldn't be done but you never know).  When
doing a copy, we use(d to use) FIEMAP the same way as you'd use lseek,
querying one extent at a time.  If you proceed this way, all of these
can cause the same races:

- pread(ofs=10MB, len=10MB) returns all zeroes, so the 10MB..20MB is
not copied

- pread(ofs=10MB, len=10MB) returns non-zero data, so the 10MB..20MB is

- lseek(SEEK_DATA, 10MB) returns 20MB, so the 10MB..20MB area is not

- lseek(SEEK_HOLE, 10MB) returns 20MB, so the 10MB..20MB area is

- ioctl(FIEMAP at 10MB) returns an extent starting at 20MB, so
the 10MB..20MB area is not copied

- ioctl(FIEMAP at 10MB) returns an unwritten extent at 10MB..20MB,
so the 10MB..20MB area is FALLOC_FL_ZERO_RANGE-ed on the destination

- ioctl(FIEMAP at 10MB) returns an extent at 10MB..20MB, so
the 10MB..20MB area is copied

But unfortunately in the FIEMAP case you'd need FIEMAP_FLAG_SYNC.  It
would be nice to have a way to do this kind of query without syncing
and with expressiveness that is at least close to FIEMAP's results.
We don't need need the full power of FIEMAP---as mentioned above we
only query one extent---so having a new ioctl would be fine too!
I'm putting a sample program after my signature for reference.

> IOWs, I think you should be looking to optimise file copies by using
> copy_file_range() and getting filesystems to do exactly what you
> need. Using FIEMAP, fallocate and moving data through userspace
> won't ever be reliable without special filesystem help (that only
> exists for XFS right now), nor will it enable the application to
> transparently use smart storage protocols and hardware when it is
> present on user systems....

We want to do that.  We also want to present SCSI extended copy to the
guest, and map that to copy_file_range() on the host if possible.  However,
there are going to be cases where we need to do low-level queries.  For
example, it is on my todo list to implement lseek(SEEK_HOLE/SEEK_DATA)
on /dev/sd* devices as a GET LBA STATUS SCSI command.  Dually, QEMU would
convert the GET LBA STATUS command to lseek or FIEMAP.  QEMU only does a
simple conversion of commands and results, it's up to the guest to avoid

It is of course the best if the guest can use SCSI extended copy, but we
cannot guarantee that.


 * Skeleton that walks the extents of a file (e.g. to copy
 * each part while preserving the allocation structure).
 * Usage: ./fiemap FILENAME

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <linux/fiemap.h>
#include <linux/fs.h>
#include <sys/ioctl.h>
#include <errno.h>

// #define UNSAFE

#define MIN(a, b) ((a) < (b) ? (a) : (b))

enum {

/* Probe the area at [start, start+length) and return whether it begins with a
 * hole (BDRV_BLOCK_ZERO), an unwritten extent (BDRV_BLOCK_ZERO|BDRV_BLOCK_DATA)
 * or allocated data (BDRV_BLOCK_DATA).  The return value applies to the
 * [start, *next) range.  *next should be used as the "start" value in a
 * subsequent call to get_next_fiemap.  
 * If successful, return the kind of data at offset start in the file.
 * If start is beyond the end of file, return -ENXIO.
 * Otherwise, return -errno.
int get_next_fiemap(int fd, off_t start, off_t length, off_t *next)
    struct {
        struct fiemap fm;
        struct fiemap_extent fe;
    } f;

    f.fm.fm_flags = 0; 
    f.fm.fm_flags = FIEMAP_FLAG_SYNC;
    f.fm.fm_start = start;
    f.fm.fm_length = length;
    f.fm.fm_extent_count = 1;
    f.fm.fm_reserved = 0;
    if (ioctl(fd, FS_IOC_FIEMAP, &f) == -1) {
        return -errno;

    if (f.fm.fm_mapped_extents == 0) {
         * No extents found, any data must be at start + length or later.
         * Check if we're at end of file, and if not, return a hole as large
         * as the size we probed; splitting a hole across multiple calls is 
        off_t filelen = lseek(fd, 0, SEEK_END);
        if (start >= filelen) {
            return -ENXIO;
        /* Pay attention to overflows. */
        *next = start + MIN(length, filelen - start);
        return BDRV_BLOCK_ZERO;

    if (f.fe.fe_logical > f.fm.fm_start) {
         * Found an extent, but not at the probed point.  We're in a hole,
         * return it and we'll get to that extent on the next call.
        *next = f.fe.fe_logical;
        return BDRV_BLOCK_ZERO;

    /* Found an extent, and we're inside it.  */
    *next = f.fe.fe_logical + f.fe.fe_length;
    if (f.fe.fe_flags & FIEMAP_EXTENT_UNWRITTEN) {
    } else {
        return BDRV_BLOCK_DATA;

int main(int argc, char **argv)
    int fd = open(argv[1], O_RDONLY);
    if (fd < 0) {

    off_t start = 0, next;
    off_t length = 8 * 1024 * 1024;
    int ret;
    while ((ret = get_next_fiemap(fd, start, length, &next)) >= 0) {
        printf("%-15ld %-15ld%s%s\n",
               start, next - start,
               (ret & BDRV_BLOCK_DATA) ? " data" : "",
               (ret & BDRV_BLOCK_ZERO) ? " zero" : "");
        /* ... copy or fallocate the destination here ... */
        start = next;

    if (ret != -ENXIO) {

reply via email to

[Prev in Thread] Current Thread [Next in Thread]