qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] bdrv_aio_flush


From: Jamie Lokier
Subject: Re: [Qemu-devel] [PATCH] bdrv_aio_flush
Date: Tue, 2 Sep 2008 19:01:50 +0100
User-agent: Mutt/1.5.13 (2006-08-11)

Ian Jackson wrote:
> > The Open Group text for aio_fsync says: "shall asynchronously force
> > all I/O operations [...]  queued at the time of the call to aio_fsync
> > [...]".
> 
> We discussed the meaning of the unix specs before.  I did a close
> textual analysis of it here:
> 
>   http://lists.nongnu.org/archive/html/qemu-devel/2008-04/msg00046.html
> 
> > Since then, I've read the Open Group specifications more closely, and
> > some other OS man pages, and they are consistent that _writes_ always
> > occur in the order they are submitted to aio_write.
> 
> Can you give me a chapter and verse quote for that ?  I'm looking at
> SuSv3 eg
>   http://www.opengroup.org/onlinepubs/009695399/nfindex.html
> 
> All I can see is this:
>   If O_APPEND is set for the file descriptor, write operations append
>   to the file in the same order as the calls were made.

Ack, you are right (again).  The SuS language is unclear, and ordering
cannot be assumed.  I'd forgotten about the O_APPEND constraint.  My
brain was addled from reading more things (below).

> If things are as you say and aio writes are always executed in the
> order queued, there would be no need to specify that because it would
> be implied by the behaviour of write(2) and O_APPEND.

You're right.  That's a subtlety I missed.

The following document on sun.com says:

    "I/O Issues (Multi-threaded programming guide)"
    [http://docs.sun.com/app/docs/doc/816-5137/gen-99376?a=view]

    "In most situations, asynchronous I/O is not required because its
    effects can be achieved with the use of threads, with each thread
    execution of synchronous I/O. However, in a few situations, threads
    cannot achieve what asynchronous I/O can.

    "The most straightforward example is writing to a tape drive to make
    the tape drive stream. Streaming prevents the tape drive from
    stopping while the drive is being written to. The tape moves forward
    at high speed while supplying a constant stream of data that is
    written to tape.

    "To support streaming, the tape driver in the kernel should use
    threads. The tape driver in the kernel must issue a queued write
    request when the tape driver responds to an interrupt. The interrupt
    indicates that the previous tape-write operation has completed.

    "Threads cannot guarantee that asynchronous writes are ordered
    because the order in which threads execute is indeterminate. You
    cannot, for example, specify the order of a write to a tape."

Yet I didn't find anything in Solaris man pages which implies anything
different than the SuSv2/3 description on Solaris, so I don't see how
the above strategy can work.  It does sound like a useful strategy
when streaming data to any device.

Quite a lot of things use aio_read and aio_write with sockets,
although I don't see anything in SuS which permits that, although its
notable that ESPIPE is not listed as an error.  Socket AIO must
preserve request ordering for both aio_read and aio_write (to stream
sockets), otherwise it's limited to one request at a time.

So I'm back to agreeing with you that nothing in SuS that I've seen
says you can rely on the order of requests submitted by AIO (nor can
you expect AIO on sockets to work).

Ugh.

It would be interesting to see what, say, Oracle does with aio_fsync.

> > What _can_ be queued is a WRITE FUA command: meaning write some data
> > and flush _this_ data to non-volatile storage.
> 
> In order to implement that interface without flushing other data
> unecessarily, we need to be able to
> 
>    submit other IO requests
>    submit aio_write request for WRITE FUA
>    asynchronously await completion of the aio_write for WRITE FUA
>    submit and perhaps collect completion of other IO requests
>    collect completion of aio_write for WRITE FUA
>    submit and perhaps collect completion of other IO requests
>    submit aio_fsync (for WRITE FUA)
>    submit and perhaps collect completion of other IO requests
>    collect aio_fsync (for WRITE FUA)
> 
> This is still not perfect because we unnecessarily flush some data
> thus delaying reporting completion of the WRITE FUA.  But there is at
> at least no need to wait for _other_ writes to complete.
> 
> So the semantics of bdrv_aio_flush should be `flush (at least) writes
> which have already completed'.

That's an interesting reason for bdrv_aio_flush() to have that
specification.

It would also make fsync() a correct implementation for
bdrv_aio_flush() - aio_fsync() not required (but better).

Btw, on Linux you can use sync_file_range() to flush specific regions
of a file.  It's messy though, and the documentation doesn't really
explain how to use it properly or what it does.  There is no AIO
equivalent to it.  And it doesn't invoke host disk barriers.

-- Jamie




reply via email to

[Prev in Thread] Current Thread [Next in Thread]