qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 2/3] barriers: block-raw-posix barrier support


From: Jamie Lokier
Subject: Re: [Qemu-devel] [PATCH 2/3] barriers: block-raw-posix barrier support
Date: Tue, 5 May 2009 13:33:11 +0100
User-agent: Mutt/1.5.13 (2006-08-11)

Christoph Hellwig wrote:
> Add support for write barriers to the posix raw file / block device code.
> The guts of this is in the aio emulation as that's where we handle our queue
> of outstanding requests.

It's nice to see this :-)

IDE and SCSI's cache flush commands should map to it nicely too.

> The highlevel design is the following:
> 
>  - As soon as a barrier request is submitted via qemu_paio_submit we
>    increment the barrier_inprogress count to signal we now have to
>    deal with barriers.
>  - From that point on every new request that is queued up by
>    qemu_paio_submit does not get onto the normal request list but a
>    secondary post-barrier queue
>
>  - Once the barrier request is dequeued by an aio_thread that thread waits for
>    all other outstanding requests to finish, issues an fdatasync, the actual
>    barrier request, another fdatasync to prevent reordering in the pagecache.

You don't need two fdatasyncs if the barrier request is just a
barrier, no data write, used only to flush previously written data by
a guest's fsync/fdatasync implementation.

>    After the request is finished the barrier_inprogress counter is decrement,
>    the post-barrier list is splice back onto the main request list up to and
>    including the next barrier request if there is one and normal operation
>    is resumed.
> 
> That means barrier mean a quite massive serialization of the I/O submission
> path, which unfortunately is not avoidable given their semantics.

This is the best argument yet for having distinct "barrier" and "sync"
operations.  "Barrier" is for ordering I/O, such as journalling
filesystems.

"Sync" is to be sent after guest fsync/fdatasync, to commit data
written so far to storage.  It waits for the data to be committed, and
also asks the data to be written sooner.

"Sync" operations don't need to serialise I/O as much: it's ok to
initiate later writes in parallel, and this is enough to keep the
storage busy when there's a steady stream of guest fsyncs.

Both together, "Barrier|Sync" would do what you've implemented: force
ordering, write data quickly, and wait until it's committed to hard
storage.

Although Linux doesn't separate these two concepts (yet), because of
I/O serialisation it might make a measurable difference to fsync-heavy
workloads for virtio to have two separate bits, one for each concept,
and then add the necessary tweaks to guests kernels to use only one or
both bits as needed.

-- Jamie




reply via email to

[Prev in Thread] Current Thread [Next in Thread]