qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH 3/4] block: VHDX block driver support


From: Kevin Wolf
Subject: Re: [Qemu-devel] [RFC PATCH 3/4] block: VHDX block driver support
Date: Tue, 19 Feb 2013 15:57:53 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

On Mon, Feb 18, 2013 at 06:03:31PM -0500, Jeff Cody wrote:
> This is preliminary and beta support for the VHDX image format,
> as specified in:
>    "VHDX Format Specification v0.95", published 4/12/2012
>     https://www.microsoft.com/en-us/download/details.aspx?id=29681
> 
> This RFC patch has the following limitations
>     - read-only (no write support yet)
>     - does not support differencing files
>     - does not make use of iovec or coroutines
>     - may likely be broken in novel and interesting ways
> 
> I've doen read testing from a dynamic, 127GB VHDX image created
> using Hyper-V.  The image was partitioned with a F18 install,
> and the partitions could be mounted and files read.
> 
> As this is still under active development, and an RFC, you can
> assume that any debug print statements will be removed prior
> to formal patch submission.
> 
> Signed-off-by: Jeff Cody <address@hidden>

One general remark: I think you'd better not copy the structure of the
vpc block driver. It's not our worst driver, but far away from being the
best. I'm afraid that we don't have the perfect driver, but qcow2 is
probably closest to what this should look like in the end.

> ---
>  block/vhdx.c | 814 
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 814 insertions(+)
>  create mode 100644 block/vhdx.c
> 
> diff --git a/block/vhdx.c b/block/vhdx.c
> new file mode 100644
> index 0000000..96f8fc7
> --- /dev/null
> +++ b/block/vhdx.c
> @@ -0,0 +1,814 @@
> +/*
> + * Block driver for Hyper-V VHDX Images
> + *
> + * Copyright (c) 2013 Red Hat, Inc.,
> + *
> + * Authors:
> + *  Jeff Cody <address@hidden>
> + *
> + *  This is based on the "VHDX Format Specification v0.95", published 
> 4/12/2012
> + *  by Microsoft:
> + *      https://www.microsoft.com/en-us/download/details.aspx?id=29681
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the Free
> + * Software Foundation; either version 2 of the License, or (at your option)
> + * any later version.
> + *
> + */

Would you consider a LGPL or MIT license? So far the block drivers are
mostly (except vdi and sheepdog) LGPL compatible, and for a future
libqblock that could become important.

> +/* opens the specified header block from the VHDX file header section */
> +static int vhdx_parse_header(BlockDriverState *bs, BDRVVHDXState *s)
> +{
> +    int ret = 0;
> +    vhdx_header *header1;
> +    vhdx_header *header2;
> +    uint64_t h1_seq = 0;
> +    uint64_t h2_seq = 0;
> +    uint8_t *buffer;
> +
> +    header1 = g_malloc(sizeof(vhdx_header));
> +    header2 = g_malloc(sizeof(vhdx_header));
> +
> +    buffer = g_malloc(VHDX_HEADER_SIZE);
> +
> +    s->headers[0] = header1;
> +    s->headers[1] = header2;
> +
> +    ret = bdrv_pread(bs->file, VHDX_HEADER1_OFFSET, buffer, 
> VHDX_HEADER_SIZE);
> +    if (ret < 0) {
> +        goto fail;
> +    }
> +    vhdx_fill_header(header1, buffer);
> +
> +    if (vhdx_validate_checksum(buffer, VHDX_HEADER_SIZE, 4) == 0 &&
> +        header1->signature == VHDX_HDR_MAGIC) {
> +        printf("header1 is valid!\n");
> +        h1_seq = header1->sequence_number;
> +    }
> +
> +    ret = bdrv_pread(bs->file, VHDX_HEADER2_OFFSET, buffer, 
> VHDX_HEADER_SIZE);
> +    if (ret < 0) {
> +        goto fail;
> +    }
> +    vhdx_fill_header(header2, buffer);
> +
> +    if (vhdx_validate_checksum(buffer, VHDX_HEADER_SIZE, 4) == 0 &&
> +        header2->signature == VHDX_HDR_MAGIC) {
> +        printf("header2 is valid!\n");
> +        h2_seq = header2->sequence_number;
> +    }
> +
> +    if (h1_seq > h2_seq) {
> +        s->curr_header = 0;
> +    } else if (h2_seq > h1_seq) {
> +        s->curr_header = 1;
> +    } else {
> +        printf("NO VALID HEADER\n");
> +        ret = -1;

I know this is only a debug message, but it could become a real
qerror_report().

Also -1 is not a valid -errno return code.

> +    }
> +
> +    ret = 0;
> +
> +    printf("current header is %d\n", s->curr_header);
> +    goto exit;
> +
> +fail:
> +    g_free(header1);
> +    g_free(header2);
> +    s->headers[0] = NULL;
> +    s->headers[1] = NULL;
> +exit:
> +    g_free(buffer);
> +    return ret;
> +}
> +
> +static int vhdx_open_region_tables(BlockDriverState *bs, BDRVVHDXState *s)
> +{
> +    int ret = 0;
> +    uint8_t *buffer;
> +    int offset = 0;
> +    vhdx_region_table_entry rt_entry;
> +    int i;
> +
> +    /* We have to read the whole 64KB block, because the crc32 is over the
> +     * whole block */
> +    buffer = g_malloc(VHDX_HEADER_BLOCK_SIZE);
> +
> +    printf("reading region tables...\n");
> +    ret = bdrv_pread(bs->file, VHDX_REGION_TABLE_OFFSET, buffer,
> +                    VHDX_HEADER_BLOCK_SIZE);
> +    if (ret < 0) {
> +        goto fail;
> +    }
> +
> +    hdr_copy32(&s->rt.signature,   buffer, offset);
> +    hdr_copy32(&s->rt.checksum,    buffer, offset);
> +    hdr_copy32(&s->rt.entry_count, buffer, offset);
> +    hdr_copy32(&s->rt.reserved,    buffer, offset);
> +
> +    if (vhdx_validate_checksum(buffer, VHDX_HEADER_BLOCK_SIZE, 4) ||
> +        s->rt.signature != VHDX_RT_MAGIC) {
> +        ret = -1;

Again, -errno. (More instances follow, won't comment on each)

> +        printf("region table checksum and/or magic failure\n");
> +        goto fail;
> +    }
> +
> +    printf("Found %" PRId32 " region table entries\n", s->rt.entry_count);
> +
> +    for (i = 0; i < s->rt.entry_count; i++) {
> +        hdr_copy_guid(rt_entry.guid, buffer, offset);
> +
> +        hdr_copy64(&rt_entry.file_offset,   buffer, offset);
> +        hdr_copy32(&rt_entry.length,        buffer, offset);
> +        hdr_copy32(&rt_entry.bitfield.data, buffer, offset);
> +
> +        /* see if we recognize the entry */
> +        if (guid_cmp(rt_entry.guid, bat_guid)) {
> +            s->bat_rt = rt_entry;
> +            continue;
> +        }
> +
> +        if (guid_cmp(rt_entry.guid, metadata_guid)) {
> +            s->metadata_rt = rt_entry;
> +            continue;
> +        }
> +
> +        if (rt_entry.bitfield.bits.required) {
> +            /* cannot read vhdx file - required region table entry that
> +             * we do not understand.  per spec, we must fail to open */
> +            printf("Found unknown region table entry that is REQUIRED!\n");
> +            ret = -1;
> +            goto fail;
> +        }
> +    }
> +    ret = 0;
> +
> +fail:
> +    g_free(buffer);
> +    return ret;
> +}
> +
> +
> +static int vhdx_parse_metadata(BlockDriverState *bs, BDRVVHDXState *s)
> +{
> +    int ret = 0;
> +    uint8_t *buffer;
> +    int offset = 0;
> +    int i = 0;
> +    uint32_t block_size, sectors_per_block, logical_sector_size;
> +    uint64_t chunk_ratio;
> +    vhdx_metadata_table_entry md_entry;
> +
> +    buffer = g_malloc(VHDX_METADATA_TABLE_MAX_SIZE);
> +
> +    printf("reading metadata at offset 0x%" PRIx64 "\n",
> +            s->metadata_rt.file_offset);
> +    ret = bdrv_pread(bs->file, s->metadata_rt.file_offset, buffer,
> +                     VHDX_METADATA_TABLE_MAX_SIZE);
> +    if (ret < 0) {
> +        goto fail_no_free;
> +    }
> +    hdr_copy64(&s->metadata_hdr.signature,   buffer, offset);
> +    hdr_copy16(&s->metadata_hdr.reserved,    buffer, offset);
> +    hdr_copy16(&s->metadata_hdr.entry_count, buffer, offset);
> +    hdr_copy(&s->metadata_hdr.reserved2, buffer,
> +             sizeof(s->metadata_hdr.reserved2), offset);
> +
> +    if (s->metadata_hdr.signature != VHDX_METADATA_MAGIC) {
> +        ret = -1;
> +        printf("metadata header signature did not match: 0x%" PRIx64 "\n",
> +                s->metadata_hdr.signature);
> +        goto fail_no_free;
> +    }
> +
> +    printf("metadata section has %" PRId16 " entries\n",
> +            s->metadata_hdr.entry_count);
> +
> +    s->metadata_entries.present = 0;
> +
> +    for (i = 0; i < s->metadata_hdr.entry_count; i++) {
> +        hdr_copy_guid(md_entry.item_id,     buffer, offset);
> +        hdr_copy32(&md_entry.offset,        buffer, offset);
> +        hdr_copy32(&md_entry.length,        buffer, offset);
> +        hdr_copy32(&md_entry.bitfield.data, buffer, offset);
> +        hdr_copy32(&md_entry.reserved2,     buffer, offset);
> +
> +        if (guid_cmp(md_entry.item_id, file_param_guid)) {
> +            s->metadata_entries.file_parameters_entry = md_entry;
> +            s->metadata_entries.present |= META_FILE_PARAMETER_PRESENT;
> +            continue;
> +        }
> +
> +        if (guid_cmp(md_entry.item_id, virtual_size_guid)) {
> +            s->metadata_entries.virtual_disk_size_entry = md_entry;
> +            s->metadata_entries.present |= META_VIRTUAL_DISK_SIZE_PRESENT;
> +            continue;
> +        }
> +
> +        if (guid_cmp(md_entry.item_id, page83_guid)) {
> +            s->metadata_entries.page83_data_entry = md_entry;
> +            s->metadata_entries.present |= META_PAGE_83_PRESENT;
> +            continue;
> +        }
> +
> +        if (guid_cmp(md_entry.item_id, logical_sector_guid)) {
> +            s->metadata_entries.logical_sector_size_entry = md_entry;
> +            s->metadata_entries.present |= META_LOGICAL_SECTOR_SIZE_PRESENT;
> +            continue;
> +        }
> +
> +        if (guid_cmp(md_entry.item_id, phys_sector_guid)) {
> +            s->metadata_entries.phys_sector_size_entry = md_entry;
> +            s->metadata_entries.present |= META_PHYS_SECTOR_SIZE_PRESENT;
> +            continue;
> +        }
> +
> +        if (guid_cmp(md_entry.item_id, parent_locator_guid)) {
> +            s->metadata_entries.parent_locator_entry = md_entry;
> +            s->metadata_entries.present |= META_PARENT_LOCATOR_PRESENT;
> +            continue;
> +        }
> +
> +        if (md_entry.bitfield.bits.is_required) {
> +            /* cannot read vhdx file - required region table entry that
> +             * we do not understand.  per spec, we must fail to open */
> +            printf("Found unknown metadata table entry that is REQUIRED!\n");
> +            ret = -1;
> +            goto exit;
> +        }
> +    }
> +
> +    if (s->metadata_entries.present != META_ALL_PRESENT) {
> +        printf("Did not find all required metadata entry fields\n");
> +        ret = -1;
> +        goto exit;
> +    }
> +
> +    offset = 0;
> +    buffer = g_realloc(buffer,
> +                       s->metadata_entries.file_parameters_entry.length);
> +    ret = bdrv_pread(bs->file,
> +                     s->metadata_entries.file_parameters_entry.offset
> +                                         + s->metadata_rt.file_offset,
> +                     buffer,
> +                     s->metadata_entries.file_parameters_entry.length);
> +
> +    hdr_copy32(&s->params.block_size,    buffer, offset);
> +    hdr_copy32(&s->params.bitfield.data, buffer, offset);
> +
> +
> +    /* We now have the file parameters, so we can tell if this is a
> +     * differencing file (i.e.. has_parent), is dynamic or fixed
> +     * sized (leave_blocks_allocated), and the block size */
> +
> +    /* The parent locator required iff the file parameters has_parent set */
> +    if (s->params.bitfield.bits.has_parent) {
> +        if (s->metadata_entries.present & ~META_PARENT_LOCATOR_PRESENT) {
> +            offset = 0;
> +            buffer = g_realloc(buffer,
> +                               
> s->metadata_entries.parent_locator_entry.length);
> +            ret = bdrv_pread(bs->file,
> +                             s->metadata_entries.parent_locator_entry.offset,
> +                             buffer,
> +                             
> s->metadata_entries.parent_locator_entry.length);
> +
> +            ret = -1; /* temp, until differencing files are supported */
> +            /* TODO: parse  parent locator fields */
> +
> +        } else {
> +            printf("Did not find all required metadata entry fields\n");
> +            ret = -1;
> +            goto exit;
> +        }
> +    }
> +    /* determine virtual disk size, logical sector size,
> +     * and phys sector size */
> +
> +    ret = bdrv_pread(bs->file,
> +                     s->metadata_entries.virtual_disk_size_entry.offset
> +                                           + s->metadata_rt.file_offset,
> +                     &s->virtual_disk_size,
> +                     sizeof(uint64_t));

Error handling? The next two calls as well.

> +    ret = bdrv_pread(bs->file,
> +                     s->metadata_entries.logical_sector_size_entry.offset
> +                                             + s->metadata_rt.file_offset,
> +                     &s->logical_sector_size,
> +                     sizeof(uint32_t));
> +    ret = bdrv_pread(bs->file,
> +                     s->metadata_entries.phys_sector_size_entry.offset
> +                                          + s->metadata_rt.file_offset,
> +                     &s->physical_sector_size,
> +                     sizeof(uint32_t));
> +
> +    le64_to_cpus(&s->virtual_disk_size);
> +    le32_to_cpus(&s->logical_sector_size);
> +    le32_to_cpus(&s->physical_sector_size);
> +
> +    /* both block_size and sector_size are guaranteed powers of 2 */
> +    s->sectors_per_block = s->params.block_size / s->logical_sector_size;
> +    s->chunk_ratio = (VHDX_MAX_SECTORS_PER_BLOCK) *
> +                     (uint64_t)s->logical_sector_size /
> +                     (uint64_t)s->params.block_size;
> +
> +
> +
> +    /* These values are ones we will want to division / multiplication
> +     * with later on, and they are all guaranteed (per the spec) to be
> +     * powers of 2, so we can take advantage of that */
> +    logical_sector_size = s->logical_sector_size;
> +    while (logical_sector_size >>= 1) {
> +        s->logical_sector_size_bits++;
> +    }
> +    sectors_per_block = s->sectors_per_block;
> +    while (sectors_per_block >>= 1) {
> +        s->sectors_per_block_bits++;
> +    }
> +    chunk_ratio = s->chunk_ratio;
> +    while (chunk_ratio >>= 1) {
> +        s->chunk_ratio_bits++;
> +    }
> +    block_size = s->params.block_size;
> +    while (block_size >>= 1) {
> +        s->block_size_bits++;
> +    }
> +
> +    printf("chunk ratio is %" PRId64 "\n", s->chunk_ratio);
> +    printf("block size is %" PRId32 " MB\n", 
> s->params.block_size/(1024*1024));
> +    printf("virtual disk size is %" PRId64 " MB\n",
> +            s->virtual_disk_size/(1024*1024));
> +    printf("logical sector size is %" PRId32 " bytes\n",
> +            s->logical_sector_size);
> +    printf("sectors per block is %" PRId32 "\n", s->sectors_per_block);
> +
> +    if (s->logical_sector_size != BDRV_SECTOR_SIZE) {
> +        printf("VHDX error - QEMU only supports 512 byte sector sizes\n");
> +        ret = -1;
> +        goto exit;
> +    }
> +
> +    ret = 0;
> +
> +exit:
> +    g_free(buffer);
> +fail_no_free:
> +    return ret;
> +}
> +
> +static int vhdx_open(BlockDriverState *bs, int flags)
> +{
> +    BDRVVHDXState *s = bs->opaque;
> +    int ret = 0;
> +    int i;
> +
> +    s->bat = NULL;
> +
> +    ret = vhdx_parse_header(bs, s);
> +    if (ret) {
> +        goto fail;
> +    }
> +
> +    ret = vhdx_open_region_tables(bs, s);
> +    if (ret) {
> +        goto fail;
> +    }
> +
> +    ret = vhdx_parse_metadata(bs, s);
> +    if (ret) {
> +        goto fail;
> +    }
> +
> +    /* the VHDX spec dictates that virtual_disk_size is always a multiple of
> +     * logical_sector_size */
> +    bs->total_sectors = s->virtual_disk_size / s->logical_sector_size;
> +
> +    s->bat_entries = s->bat_rt.length / VHDX_BAT_ENTRY_SIZE;
> +    s->bat = g_malloc(s->bat_rt.length);
> +
> +    ret = bdrv_pread(bs->file, s->bat_rt.file_offset, s->bat, 
> s->bat_rt.length);
> +
> +    /* as it is a uint64_t bitfield, we need to have the right endianness */
> +    for (i = 0; i < s->bat_entries; i++) {
> +        le64_to_cpus(&s->bat[i].bitfield.data);
> +    }
> +
> +    /* TODO: differencing files, r/w */
> +    /* the below is obviously temporary */
> +    if (flags & BDRV_O_RDWR) {
> +        ret = -1;
> +        goto fail;
> +    }
> +fail:
> +    ret = 0;

Don't you think the ret = 0; would be better before the fail: label? :-)

In error cases s->bat may be leaked.

> +    return ret;
> +}
> +
> +static int vhdx_reopen_prepare(BDRVReopenState *state,
> +                               BlockReopenQueue *queue, Error **errp)
> +{
> +    return 0;
> +}
> +
> +
> +
> +static int vhdx_read(BlockDriverState *bs, int64_t sector_num,
> +                    uint8_t *buf, int nb_sectors)

For new block drivers there's really no excuse for not implementing
bdrv_co_readv instead.

> +{
> +    BDRVVHDXState *s = bs->opaque;
> +    int ret = 0;
> +    uint32_t sectors_avail;
> +    uint32_t block_offset;
> +    uint64_t offset;
> +    uint32_t bat_idx;
> +    uint32_t bytes_to_read;
> +
> +    while (nb_sectors > 0) {
> +        /* We are a differencing file, so we need to inspect the sector 
> bitmap

This function has a few lines > 80 characters.

> +         * to see if we have the data or not */
> +        if (s->params.bitfield.bits.has_parent) {
> +            /* not supported yet */
> +            ret = -1;
> +            goto exit;
> +        } else {
> +            bat_idx = sector_num >> s->sectors_per_block_bits;
> +            /* effectively a modulo - this gives us the offset into the block
> +             * (in sector sizes) for our sector number */
> +            block_offset = sector_num - (bat_idx << 
> s->sectors_per_block_bits);
> +            /* the chunk ratio gives us the interleaving of the sector
> +             * bitmaps, so we need to advance our page block index by the
> +             * sector bitmaps entry number */
> +            bat_idx += bat_idx >> s->chunk_ratio_bits;
> +
> +            /* the number of sectors we can read in this cycle */
> +            sectors_avail = s->sectors_per_block - block_offset;
> +
> +            if (sectors_avail  > nb_sectors) {
> +                sectors_avail = nb_sectors;
> +            }
> +
> +            bytes_to_read = sectors_avail << s->logical_sector_size_bits;
> +
> +            /*
> +            printf("bat_idx: %d\n", bat_idx);
> +            printf("sectors_avail: %d\n", bat_idx);
> +            printf("bytes_to_read: %d\n", bat_idx);
> +            */
> +            /* check the payload block state */
> +            switch (s->bat[bat_idx].bitfield.bits.state) {
> +            case PAYLOAD_BLOCK_NOT_PRESENT: /* fall through */
> +            case PAYLOAD_BLOCK_UNMAPPED:    /* fall through */
> +            case PAYLOAD_BLOCK_ZERO:
> +                /* return zero */
> +                memset(buf, 0, bytes_to_read);
> +                break;
> +            case PAYLOAD_BLOCK_UNDEFINED:   /* fall through */
> +            case PAYLOAD_BLOCK_FULL_PRESENT:
> +                offset = (block_offset << s->logical_sector_size_bits) +
> +                         (s->bat[bat_idx].bitfield.bits.file_offset_mb
> +                          * (1024 * 1024));
> +                ret = bdrv_pread(bs->file, offset, buf, bytes_to_read);
> +                if (ret != bytes_to_read) {
> +                    ret = -1;
> +                    goto exit;
> +                }
> +                break;
> +            case PAYLOAD_BLOCK_PARTIALLY_PRESENT:
> +                /* we don't yet support difference files, fall through
> +                 * to error */
> +            default:
> +                ret = -1;
> +                goto exit;
> +                break;
> +            }
> +            nb_sectors -= sectors_avail;
> +            sector_num += sectors_avail;
> +            buf += bytes_to_read;
> +        }

This is a rather long else branch, and the then branch will probably get
even longer when you implement. Probably they should become separate
functions, but you'll see how it turns out when you implement it.

> +    }
> +    ret = 0;
> +exit:
> +    return ret;
> +}
> +
> +static int vhdx_write(BlockDriverState *bs, int64_t sector_num,
> +    const uint8_t *buf, int nb_sectors)
> +{
> +   /* BDRVVHDXState *s = bs->opaque; */
> +
> +    /* TODO */
> +
> +    return 0;
> +}

Right, commenting on a stub that won't survive in the real series feels
rather silly, but I wouldn't let a stub return success in any case...

> +static coroutine_fn int vhdx_co_write(BlockDriverState *bs, int64_t 
> sector_num,
> +                                     const uint8_t *buf, int nb_sectors)
> +{
> +    int ret;
> +    BDRVVHDXState *s = bs->opaque;
> +    qemu_co_mutex_lock(&s->lock);
> +    ret = vhdx_write(bs, sector_num, buf, nb_sectors);
> +    qemu_co_mutex_unlock(&s->lock);
> +
> +    /* TODO */
> +
> +    return ret;
> +}

I see that you already removed the same pattern from reads, so havin
separate vhdx_co_write/vhdx_write doesn't make sense. Also holding the
mutex all the time doesn't appear to be a good idea. All of this is
copied from drivers that were never properly converted.

> +static int vhdx_create(const char *filename, QEMUOptionParameter *options)
> +{
> +
> +    /* TODO */
> +
> +   return 0;
> +}

You don't have to have a .bdrv_create function in the BlockDriver if you
don't implement it. The block layer usually has a sane default
implementation for everything that's missing (saner than doing nothing
an returning success in any case).

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]