qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Adding a persistent writeback cache to qemu


From: Alex Bligh
Subject: Re: [Qemu-devel] Adding a persistent writeback cache to qemu
Date: Thu, 20 Jun 2013 15:25:09 +0100

Stefan,

--On 20 June 2013 11:46:18 +0200 Stefan Hajnoczi <address@hidden> wrote:

The concrete problem here is that flashcache/dm-cache/bcache don't
work with the rbd (librbd) driver, as flashcache/dm-cache/bcache
cache access to block devices (in the host layer), and with rbd
(for instance) there is no access to a block device at all. block/rbd.c
simply calls librbd which calls librados etc.

So the context switches etc. I am avoiding are the ones that would
be introduced by using kernel rbd devices rather than librbd.

I understand the limitations with kernel block devices - their
setup/teardown is an extra step outside QEMU and privileges need to be
managed.  That basically means you need to use a management tool like
libvirt to make it usable.

It's not just the management tool (we have one of those). Kernel
devices are pain. As a trivial example, duplication of UUIDs, LVM IDs
etc. by hostile guests can cause issues.

But I don't understand the performance angle here.  Do you have profiles
that show kernel rbd is a bottleneck due to context switching?

I don't have test figures - perhaps this is just received wisdom, but I'd
understood that's why they were faster.

We use the kernel page cache for -drive file=test.img,cache=writeback
and no one has suggested reimplementing the page cache inside QEMU for
better performance.

That's true, but I'd argue that is a little different because nothing
blocks on the page cache (it being in RAM). You don't get the situation
where the tasks sleeps awaiting data (from the page cache), the data
arrives, and the task then needs to to be scheduled in. I will admit
to a degree of handwaving here as I hadn't realised the claim qemu+rbd
was more efficient than qemu+blockdevice+kernelrbd was controversial.

Also, how do you want to manage QEMU page cache with multiple guests
running?  They are independent and know nothing about each other.  Their
process memory consumption will be bloated and the kernel memory
management will end up having to sort out who gets to stay in physical
memory.

I don't think that one's an issue. Currently QEMU processes with
cache=writeback contend physical memory via the page cache. I'm
not changing that bit. I'm proposing allocating SSD (rather than
RAM) for cache, so if anything that should reduce RAM use as it
will be quicker to flush the cache to 'disk' (the second layer
of caching). I was proposing allocating each task a fixed amount
of SSD space.

In terms of how this is done, one way would be to mmap a large
file on SSD, which would mean the page cache used would be
whatever page cache is used for the SSD. You've got more control
over this (with madvise etc) than you have with aio I think.

You can see I'm skeptical of this

Which is no bad thing!

and think it's premature optimization,

... and I'm only to keen to avoid work if it brings no gain.

but if there's really a case for it with performance profiles then I
guess it would be necessary.  But we should definitely get feedback from
the Ceph folks too.

The specific problem we are trying to solve (in case that's not
obvious) is the non-locality of data read/written by ceph. Whilst
you can use placement to localise data to the rack level, even if
one of your OSDs is in the machine you end up waiting on network
traffic. That is apparently hard to solve inside Ceph.

However, this would be applicable to sheepdog, gluster, nfs,
the internal iscsi initiator, etc. etc. rather than just to Ceph.

I'm also keen to hear from the Ceph guys as if they have a way of
keeping lots of reads and writes in the box and not crossing the
network, I'd be only too keen to use that.

--
Alex Bligh



reply via email to

[Prev in Thread] Current Thread [Next in Thread]