[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool

From: Anthony Liguori
Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool
Date: Fri, 12 Dec 2008 11:25:55 -0600
User-agent: Thunderbird (X11/20080925)

Andrea Arcangeli wrote:
On Fri, Dec 12, 2008 at 10:49:45AM -0600, Anthony Liguori wrote:
I meant, if you wanted to pass a file descriptor as a raw device.  So:

qemu -hda raw:fd=4

Or something like that.  We don't support this today.

ah ok.

I think bouncing the iov and just using pread/pwrite may be our best bet. It means memory allocation but we can cap it. Since we're using threads,

It's already capped. However currently it generates an iovec, but
we've simply to check the iovcnt to be 1, if it's 1 we pread from
iov.iov_base, iov.iov_len. The dma api will take care to enforce
iovcnt to be 1 for the iovec if preadv/pwritev isn't detected at
compile time.

Hrm, that's more complex than I was expecting. I was thinking the bdrv aio infrastructure would always take an iovec. Any details about the underlying host's ability to handle the iovec would be insulated.

we just can force a thread to sleep until memory becomes available so it's actually pretty straight forward.

There's no way to detect that and wait for memory,

If we artificially cap at say 50MB, then you do something like:

while (buffer == NULL) {
  buffer = try_to_bounce(offset, iov, iovcnt, &size);
  if (buffer == NULL && errno == ENOMEM) {
     pthread_wait_cond(more memory);

try_to_bounce allocs with malloc() but if you exceed 50MB, then you fail with an error of ENOMEM. In your bounce_free() function, you do a pthread_cond_broadcast() to wake up any threads potentially waiting to allocate memory.

This lets us expose a preadv/pwritev function that actually works. The expectation is that bouncing will outperform just doing pread/pwrite of each vector. Of course, you could get smart and if try_to_bounce fail, fall back to pread/pwrite each vector. Likewise, you can fast-path the case of a single iovec to avoid bouncing entirely.


Anthony Liguori

 it'd sigkill before
you can check... at least with the default overcommit. The way the dma
api works, is that it doesn't send a mega large writev, but send it in
pieces capped by the max buffer size, with many iovecs with iovcnt = 1.

We can use libaio on older Linux's to simulate preadv/pwritev. Use the proper syscalls on newer kernels, on BSDs, and bounce everything else.

Given READV/WRITEV aren't available in not very recent kernels and
given that without O_DIRECT each iocb will become synchronous, we
can't use the libaio. Also once they fix linux-aio, if we do that, the
iocb logic would need to be largely refactored. So I'm not sure if it
worth it as it can't handle 2.6.16-18 when O_DIRECT is disabled (when
O_DIRECT is enabled we could just build an array of linear iocb).

reply via email to

[Prev in Thread] Current Thread [Next in Thread]