[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
From: |
Rusty Russell |
Subject: |
Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH |
Date: |
Thu, 6 May 2010 15:35:19 +0930 |
User-agent: |
KMail/1.13.2 (Linux/2.6.32-21-generic; KDE/4.4.2; i686; ; ) |
On Wed, 5 May 2010 03:33:43 pm Neil Brown wrote:
> On Wed, 5 May 2010 14:28:41 +0930
> Rusty Russell <address@hidden> wrote:
>
> > On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> > > Jens Axboe wrote:
> > > > On Tue, May 04 2010, Rusty Russell wrote:
> > > > > ISTR someone mentioning a desire for such an API years ago, so CC'ing
> > > > > the
> > > > > usual I/O suspects...
> > > >
> > > > It would be nice to have a more fuller API for this, but the reality is
> > > > that only the flush approach is really workable. Even just strict
> > > > ordering of requests could only be supported on SCSI, and even there the
> > > > kernel still lacks proper guarantees on error handling to prevent
> > > > reordering there.
> > >
> > > There's a few I/O scheduling differences that might be useful:
> > >
> > > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> > > before a BARRIER. That might be useful for time-critical WRITEs,
> > > and those issued by high I/O priority.
> >
> > This is only because noone actually wants flushes or barriers, though
> > I/O people seem to only offer that. We really want "<these writes> must
> > occur before <this write>". That offers maximum choice to the I/O subsystem
> > and potentially to smart (virtual?) disks.
> >
> > > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> > > only for data belonging to a particular file (e.g. fdatasync with
> > > no file size change, even on btrfs if O_DIRECT was used for the
> > > writes being committed). That would entail tagging FLUSHes and
> > > WRITEs with a fs-specific identifier (such as inode number), opaque
> > > to the scheduler which only checks equality.
> >
> > This is closer. In userspace I'd be happy with a "all prior writes to this
> > struct file before all future writes". Even if the original guarantees were
> > stronger (ie. inode basis). We currently implement transactions using 4
> > fsync
> > /msync pairs.
> >
> > write_recovery_data(fd);
> > fsync(fd);
> > msync(mmap);
> > write_recovery_header(fd);
> > fsync(fd);
> > msync(mmap);
> > overwrite_with_new_data(fd);
> > fsync(fd);
> > msync(mmap);
> > remove_recovery_header(fd);
> > fsync(fd);
> > msync(mmap);
>
> Seems over-zealous.
> If the recovery_header held a strong checksum of the recovery_data you would
> not need the first fsync, and as long as you have two places to write recovery
> data, you don't need the 3rd and 4th syncs.
> Just:
> write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
> fsync / msync
> overwrite_with_new_data()
>
> To recovery you choose the most recent log_space and replay the content.
> That may be a redundant operation, but that is no loss.
I think you missed a checksum for the new data? Otherwise we can't tell if
the new data is completely written. But yes, I will steal this scheme for
TDB2, thanks!
In practice, it's the first sync which is glacial, the rest are pretty cheap.
> Also cannot see the point of msync if you have already performed an fsync,
> and if there is a point, I would expect you to call msync before
> fsync... Maybe there is some subtlety there that I am not aware of.
I assume it's this from the msync man page:
msync() flushes changes made to the in-core copy of a file that was
mapped into memory using mmap(2) back to disk. Without use of this
call there is no guarantee that changes are written back before mun‐
map(2) is called.
> > It's an implementation detail; barrier has less flexibility because it has
> > less information about what is required. I'm saying I want to give you as
> > much information as I can, even if you don't use it yet.
>
> Only we know that approach doesn't work.
> People will learn that they don't need to give the extra information to still
> achieve the same result - just like they did with ext3 and fsync.
> Then when we improve the implementation to only provide the guarantees that
> you asked for, people will complain that they are getting empty files that
> they didn't expect.
I think that's an oversimplification: IIUC that occurred to people *not*
using fsync(). They weren't using it because it was too slow. Providing
a primitive which is as fast or faster and more specific doesn't have the
same magnitude of social issues.
And we can't write userspace interfaces for idiots only.
> The abstraction I would like to see is a simple 'barrier' that contains no
> data and has a filesystem-wide effect.
I think you lack ambition ;)
Thinking about the single-file use case (eg. kvm guest or tdb), isn't that
suboptimal for md? Since you have to hand your barrier to every device
whereas a file-wide primitive may theoretically only go to some.
Cheers,
Rusty.
[Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH, Christoph Hellwig, 2010/05/04
Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH, Jamie Lokier, 2010/05/04