qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2 3/4] 9p: persistency of QID path beyond reboo


From: Greg Kurz
Subject: Re: [Qemu-devel] [PATCH v2 3/4] 9p: persistency of QID path beyond reboots / suspensions
Date: Tue, 7 May 2019 11:41:58 +0200

On Fri, 03 May 2019 18:23:03 +0200
Christian Schoenebeck <address@hidden> wrote:

> This patch aims to keep QID path identical beyond the scope of reboots and
> guest suspensions. With the 1st patch alone the QID path of the same files
> might change after reboots / suspensions, since 9p would restart with
> empty qpp_table and the resulting QID path depends on the precise sequence
> of files being accessed on guest.
> 
> The first patch should already avoid the vast majority of potential QID
> path collisions. However especially network services running on guest
> would still be prone to QID path issues when just using the 1st patch.
> For instance Samba is exporting file IDs to clients in the network and
> SMB cliens in the network will use those IDs to access and request
> changes on the file server. If guest is now interrupted in between, like
> it commonly happens on maintenance, e.g. software updates on host, then
> SMB clients in the network will continue working with old file IDs, which
> in turn leads to data corruption and data loss on the file server.
> Furthermore on SMB client side I also encountered severe misbehaviours in
> this case, for instance Macs accessing the file server would either
> start to hang or die with a kernel panic within seconds, since the smbx
> implementation on macOS heavily relies on file IDs being unique (within
> the context of a connection that is).
> 
> So this patch here mitigates the remaining problem described above by
> storing the qpp_table persistently as extended attribute(s) on the
> exported root of the file system and automatically tries to restore the
> qpp_table i.e. after reboots / resumptions.
> 
> This patch is aimed at real world scenarios, in which qpp_table will only
> ever get few dozens of entries (and none ever in qpf_table). So it is e.g.
> intentionally limited to only store qpp_table, not qpf_table; and so far
> I have not made optimizations, since in practice the qpf_table is really
> just tiny.
> 
> Since there is currently no callback in qemu yet that would reliably be
> called on guest shutdowns, the table is stored on every new insertion for
> now.
> 
> Signed-off-by: Christian Schoenebeck <address@hidden>
> ---
>  hw/9pfs/9p.c | 315 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  hw/9pfs/9p.h |  33 +++++++
>  2 files changed, 343 insertions(+), 5 deletions(-)
> 

Yikes... both the changelog and the diffstat have an impressive size. Since
I'm likely the only QEMU developer who will review this, please expect some
delay before I get down to it... :-\ Of course, if you can split this into
smaller patches, it will probably help :)

Same remark for patch 4.

> diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
> index 2b893e25a1..29c6dfc68a 100644
> --- a/hw/9pfs/9p.c
> +++ b/hw/9pfs/9p.c
> @@ -26,6 +26,19 @@
>  #include "migration/blocker.h"
>  #include "sysemu/qtest.h"
>  #include "qemu/xxhash.h"
> +#include "qemu/crc32c.h"
> +#if defined(__linux__) /* TODO: This should probably go into osdep.h instead 
> */
> +# include <linux/limits.h> /* for XATTR_SIZE_MAX */
> +#endif
> +
> +/*
> + * How many bytes may we store to fs per extended attribute value?
> + */
> +#ifdef XATTR_SIZE_MAX
> +# define ATTR_MAX_SIZE XATTR_SIZE_MAX /* Linux only: 64kB limit in kernel */
> +#else
> +# define ATTR_MAX_SIZE 65536 /* Most systems allow a bit more, so we take 
> this as basis.  */
> +#endif
>  
>  int open_fd_hw;
>  int total_open_fd;
> @@ -642,6 +655,285 @@ static int qid_path_fullmap(V9fsPDU *pdu, const struct 
> stat *stbuf,
>      return 0;
>  }
>  
> +static inline bool is_ro_export(FsContext *ctx)
> +{
> +    return ctx->export_flags & V9FS_RDONLY;
> +}
> +
> +/*
> + * Once qpp_table size exceeds this value, we no longer save
> + * the table persistently. See comment in v9fs_store_qpp_table()
> + */
> +#define QPP_TABLE_PERSISTENCY_LIMIT 32768
> +
> +/* Remove all user.virtfs.system.qidp.* xattrs from export root. */
> +static void remove_qidp_xattr(FsContext *ctx)
> +{
> +    V9fsString name;
> +    int i;
> +
> +    /* just for a paranoid endless recursion sanity check */
> +    const ssize_t max_size =
> +        sizeof(QppSrlzHeader) +
> +        QPP_TABLE_PERSISTENCY_LIMIT * sizeof(QppEntryS);
> +
> +    v9fs_string_init(&name);
> +    for (i = 0; i * ATTR_MAX_SIZE < max_size; ++i) {
> +        v9fs_string_sprintf(&name, "user.virtfs.system.qidp.%d", i);
> +        if (lremovexattr(ctx->fs_root, name.data) < 0)
> +            break;
> +    }
> +    v9fs_string_free(&name);
> +}
> +
> +/* Used to convert qpp hash table into continuous stream. */
> +static void qpp_table_serialize(void *p, uint32_t h, void *up)
> +{
> +    const QppEntry *entry = (const QppEntry*) p;
> +    QppSerialize *ser = (QppSerialize*) up;
> +
> +    if (ser->error)
> +        return;
> +
> +    /* safety first */
> +    if (entry->qp_prefix - 1 >= ser->count) {
> +        ser->error = -1;
> +        return;
> +    }
> +
> +    ser->elements[entry->qp_prefix - 1] = (QppEntryS) {
> +        .dev = entry->dev,
> +        .ino_prefix = entry->ino_prefix
> +    };
> +    ser->done++;
> +}
> +
> +/*
> + * Tries to store the current qpp_table as extended attribute(s) on the
> + * exported file system root with the goal to preserve identical qids
> + * beyond the scope of reboots.
> + */
> +static void v9fs_store_qpp_table(V9fsState *s)
> +{
> +    FsContext *ctx = &s->ctx;
> +    V9fsString name;
> +    int i, res;
> +    size_t size;
> +    QppSrlzStream* stream;
> +    QppSerialize ser;
> +
> +    if (is_ro_export(ctx))
> +        return;
> +
> +    /*
> +     * Whenever we exceeded some certain (arbitrary) high qpp_table size we
> +     * delete the stored table from the file system to get rid of old device
> +     * ids / inodes that might no longer exist with the goal to potentially
> +     * yield in a smaller table size after next reboot.
> +     */
> +    if (!s->qp_prefix_next || s->qp_prefix_next >= 
> QPP_TABLE_PERSISTENCY_LIMIT) {
> +        if (s->qp_prefix_next == QPP_TABLE_PERSISTENCY_LIMIT) {
> +            remove_qidp_xattr(ctx);
> +        }
> +        return;
> +    }
> +
> +    /* Convert qpp hash table into continuous array. */
> +    size = sizeof(QppSrlzHeader) +
> +           ( (s->qp_prefix_next - 1) /* qpp_table entry count */ * 
> sizeof(QppEntryS) );
> +    stream = g_malloc0(size);
> +    ser = (QppSerialize) {
> +        .elements = &stream->elements[0],
> +        .count = s->qp_prefix_next - 1,
> +        .done  = 0,
> +        .error = 0,
> +    };
> +    qht_iter(&s->qpp_table, qpp_table_serialize, &ser);
> +    if (ser.error || ser.done != ser.count)
> +        goto out;
> +
> +    /* initialize header and calculate CRC32 checksum */
> +    stream->header = (QppSrlzHeader) {
> +        .version = 1,
> +        .reserved = 0,
> +        .crc32 = crc32c(
> +            0xffffffff,
> +            (const uint8_t*) &stream->elements[0],
> +            (ser.count * sizeof(QppEntryS))
> +        ),
> +    };
> +
> +    /*
> +     * Actually just required if the qpp_table size decreased, or if the
> +     * previous xattr size limit increased on OS (kernel/fs) level.
> +     */
> +    remove_qidp_xattr(ctx);
> +
> +    /*
> +     * Subdivide (if required) the data stream into individual xattrs
> +     * to cope with the system's max. supported xattr value size.
> +     */
> +    v9fs_string_init(&name);
> +    for (i = 0; size > (i * ATTR_MAX_SIZE); ++i) {
> +        v9fs_string_sprintf(&name, "user.virtfs.system.qidp.%d", i);
> +        res = lsetxattr(
> +            ctx->fs_root,
> +            name.data,
> +            ((const uint8_t*)stream) + i * ATTR_MAX_SIZE,
> +            MIN(ATTR_MAX_SIZE, size - i * ATTR_MAX_SIZE),
> +            0/*flags*/
> +        );
> +        if (res < 0) {
> +            if (i > 0)
> +                remove_qidp_xattr(ctx);
> +            break;
> +        }
> +    }
> +    v9fs_string_free(&name);
> +out:
> +    g_free(stream);
> +}
> +
> +/* Frees the entire chain of passed nodes from memory. */
> +static void destroy_xattr_nodes(XAttrNode **first)
> +{
> +    XAttrNode *prev;
> +    if (!first)
> +        return;
> +    while (*first) {
> +        if ((*first)->value)
> +            g_free((*first)->value);
> +        prev = *first;
> +        *first = (*first)->next;
> +        g_free(prev);
> +    }
> +}
> +
> +/*
> + * Loads all user.virtfs.system.qidp.* xattrs from exported fs root and
> + * returns a linked list with one node per xattr.
> + */
> +static XAttrNode* v9fs_load_qidp_xattr_nodes(V9fsState *s)
> +{
> +    FsContext *ctx = &s->ctx;
> +    XAttrNode *first = NULL, *current = NULL;
> +    V9fsString name;
> +    ssize_t size;
> +    int i;
> +
> +    const ssize_t max_size =
> +        sizeof(QppSrlzHeader) +
> +        QPP_TABLE_PERSISTENCY_LIMIT * sizeof(QppEntryS);
> +
> +    v9fs_string_init(&name);
> +
> +    for (i = 0; i * ATTR_MAX_SIZE < max_size; ++i) {
> +        v9fs_string_sprintf(&name, "user.virtfs.system.qidp.%d", i);
> +        size = lgetxattr(ctx->fs_root, name.data, NULL, 0);
> +        if (size <= 0)
> +            break;
> +        if (!first) {
> +            first = current = g_malloc0(sizeof(XAttrNode));
> +        } else {
> +            current = current->next = g_malloc0(sizeof(XAttrNode));
> +        }
> +        current->value = g_malloc0(size);
> +        current->length = lgetxattr(
> +            ctx->fs_root, name.data, current->value, size
> +        );
> +        if (current->length <= 0) {
> +            goto out_w_err;
> +        }
> +    }
> +    goto out;
> +
> +out_w_err:
> +    destroy_xattr_nodes(&first);
> +out:
> +    v9fs_string_free(&name);
> +    return first;
> +}
> +
> +/*
> + * Try to load previously stored qpp_table from file system. Calling this
> + * function assumes that qpp_table is yet empty.
> + *
> + * @see v9fs_store_qpp_table()
> + */
> +static void v9fs_load_qpp_table(V9fsState *s)
> +{
> +    ssize_t size, count;
> +    XAttrNode *current, *first;
> +    QppSrlzStream* stream = NULL;
> +    uint32_t crc32;
> +    int i;
> +    QppEntry *val;
> +    uint32_t hash;
> +
> +    if (s->qp_prefix_next != 1)
> +        return;
> +
> +    first = v9fs_load_qidp_xattr_nodes(s);
> +    if (!first)
> +        return;
> +
> +    /* convert nodes into continuous stream */
> +    size = 0;
> +    for (current = first; current; current = current->next) {
> +        size += current->length;
> +    }
> +    if (size <= 0) {
> +        goto out;
> +    }
> +    stream = g_malloc0(size);
> +    size = 0;
> +    for (current = first; current; current = current->next) {
> +        memcpy(((uint8_t*)stream) + size, current->value, current->length);
> +        size += current->length;
> +    }
> +
> +    if (stream->header.version != 1) {
> +        goto out;
> +    }
> +
> +    count = (size - sizeof(QppSrlzHeader)) / sizeof(QppEntryS);
> +    if (count <= 0) {
> +        goto out;
> +    }
> +
> +    /* verify CRC32 checksum of stream */
> +    crc32 = crc32c(
> +        0xffffffff,
> +        (const uint8_t*) &stream->elements[0],
> +        (count * sizeof(QppEntryS))
> +    );
> +    if (crc32 != stream->header.crc32) {
> +        goto out;
> +    }
> +
> +    /* fill qpp_table with the retrieved elements */
> +    for (i = 0; i < count; ++i) {
> +        val = g_malloc0(sizeof(QppEntry));
> +        *val = (QppEntry) {
> +            .dev = stream->elements[i].dev,
> +            .ino_prefix = stream->elements[i].ino_prefix,
> +        };
> +        hash = qpp_hash(*val);
> +        if (qht_lookup(&s->qpp_table, val, hash)) {
> +            /* should never happen: duplicate entry detected */
> +            g_free(val);
> +            goto out;
> +        }
> +        val->qp_prefix = s->qp_prefix_next++;
> +        qht_insert(&s->qpp_table, val, hash, NULL);
> +    }
> +
> +out:
> +    destroy_xattr_nodes(&first);
> +    if (stream)
> +        g_free(stream);
> +}
> +
>  /* stat_to_qid needs to map inode number (64 bits) and device id (32 bits)
>   * to a unique QID path (64 bits). To avoid having to map and keep track
>   * of up to 2^64 objects, we map only the 16 highest bits of the inode plus
> @@ -675,6 +967,14 @@ static int qid_path_prefixmap(V9fsPDU *pdu, const struct 
> stat *stbuf,
>          /* new unique inode prefix and device combo */
>          val->qp_prefix = pdu->s->qp_prefix_next++;
>          qht_insert(&pdu->s->qpp_table, val, hash, NULL);
> +
> +        /*
> +         * Store qpp_table as extended attribute(s) to file system.
> +         *
> +         * TODO: This should better only be called from a guest shutdown and
> +         * suspend handler.
> +         */
> +        v9fs_store_qpp_table(pdu->s);
>      }
>  
>      *path = ((uint64_t)val->qp_prefix << 48) | (stbuf->st_ino & 
> QPATH_INO_MASK);
> @@ -1064,11 +1364,6 @@ static void v9fs_fix_path(V9fsPath *dst, V9fsPath 
> *src, int len)
>      v9fs_path_free(&str);
>  }
>  
> -static inline bool is_ro_export(FsContext *ctx)
> -{
> -    return ctx->export_flags & V9FS_RDONLY;
> -}
> -
>  static void coroutine_fn v9fs_version(void *opaque)
>  {
>      ssize_t err;
> @@ -3784,6 +4079,8 @@ int v9fs_device_realize_common(V9fsState *s, const 
> V9fsTransport *t,
>      qht_init(&s->qpp_table, qpp_cmp_func, 1, QHT_MODE_AUTO_RESIZE);
>      s->qp_prefix_next = 1; /* reserve 0 to detect overflow */
>      s->qp_fullpath_next = 1;
> +    /* try to load and restore previous qpp_table */
> +    v9fs_load_qpp_table(s);
>  
>      s->ctx.fst = &fse->fst;
>      fsdev_throttle_init(s->ctx.fst);
> @@ -3807,6 +4104,14 @@ out:
>  
>  void v9fs_device_unrealize_common(V9fsState *s, Error **errp)
>  {
> +    /*
> +     * Store qpp_table as extended attribute(s) to file system.
> +     *
> +     * This was actually plan A, but unfortunately unserialize is not called
> +     * reliably on guest shutdowns and suspensions.
> +     */
> +    v9fs_store_qpp_table(s);
> +
>      if (s->ops->cleanup) {
>          s->ops->cleanup(&s->ctx);
>      }
> diff --git a/hw/9pfs/9p.h b/hw/9pfs/9p.h
> index 44112ea97f..54ce039969 100644
> --- a/hw/9pfs/9p.h
> +++ b/hw/9pfs/9p.h
> @@ -245,6 +245,13 @@ typedef struct {
>      uint16_t qp_prefix;
>  } QppEntry;
>  
> +/* Small version of QppEntry for serialization as xattr. */
> +struct QppEntryS {
> +    dev_t dev;
> +    uint16_t ino_prefix;
> +} __attribute__((packed));
> +typedef struct QppEntryS QppEntryS;
> +
>  /* QID path full entry, as above */
>  typedef struct {
>      dev_t dev;
> @@ -252,6 +259,32 @@ typedef struct {
>      uint64_t path;
>  } QpfEntry;
>  
> +typedef struct {
> +    QppEntryS *elements;
> +    uint count; /* In: QppEntryS count in @a elements */
> +    uint done; /* Out: how many QppEntryS did we actually fill in @a 
> elements */
> +    int error; /* Out: zero on success */
> +} QppSerialize;
> +
> +struct QppSrlzHeader {
> +    uint16_t version;
> +    uint16_t reserved; /* might be used e.g. for flags in future */
> +    uint32_t crc32;
> +} __attribute__((packed));
> +typedef struct QppSrlzHeader QppSrlzHeader;
> +
> +struct QppSrlzStream {
> +    QppSrlzHeader header;
> +    QppEntryS elements[0];
> +} __attribute__((packed));
> +typedef struct QppSrlzStream QppSrlzStream;
> +
> +typedef struct XAttrNode {
> +    uint8_t* value;
> +    ssize_t length;
> +    struct XAttrNode* next;
> +} XAttrNode;
> +
>  struct V9fsState
>  {
>      QLIST_HEAD(, V9fsPDU) free_list;




reply via email to

[Prev in Thread] Current Thread [Next in Thread]