[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#9500: [PATCH]: use posix_fallocate where supported

From: Goswin von Brederlow
Subject: bug#9500: [PATCH]: use posix_fallocate where supported
Date: Wed, 30 Nov 2011 18:54:15 +0100
User-agent: Gnus/5.110009 (No Gnus v0.9) XEmacs/21.4.22 (linux, no MULE)

Paul Eggert <address@hidden> writes:

> My read of the situation is that the filesystem guys have
> spent a lot of time optimizing ordinary write but they
> haven't gotten around to optimizing fallocate because it's
> so rarely used -- which means that if one uses fallocate
> one gets lousy performance.
> It's a chicken and egg problem.
> If coreutils started using fallocate now, one can be pretty
> sure they'd tune their filesystems over the next few years,
> to make fallocate compatible with delayed-write optimizations.
> On the other hand if nobody uses fallocate, there will be little
> incentive on their part to make it go fast.
> It's a question of whether we want to inflict temporary pain
> on users for a long-term benefit (early warning of file system
> full, which is something I'd dearly love to have).

I totaly agree.

I also don't buy Daves analysis that fallocate() will hurt the

Sure, it will place blocks on the disk disjunct from the write
pattern. So if you have a lot of files being written in parallel the
fallocate() will make the disk seek more. But the data for each file
will end up sequentially on disk. Without fallocate() it will be layed
out in order of the write pattern, i.e. 4MB of this file, 4MB of that
file, 4MB of the next file, 4MB of the first file and so on. Lots of
fragments of a size the systems cache and delayed allocation allowed.

So fallocate() might hurt write speed with many parallel writes but it
will keep the fragmentation down and speed up future reads. A one time
penalty for many times advantages in the future.

As for filesystem aligning all fallocate() chunks and creating
fragmentation in their free space: Too bad for them. FIX THE FILESYSTEM.
If I tell the FS that I'm only going to write 32k then it should not
force alignment to a 1MB chunk of free space. Instead it should find
some nice little 32k fragment of free space left over somewhere else.
Put all the 32K files together to fill up the 1MB stripe of an raid.

As for fallocate() causing more IOPS I don't buy that either. Either the
data is too big for the cache so that it is forced out with fallocate()
or delayed allocation or it is so small that it remains in cache in both
cases and the elevator code should write it out sequentially. I mean we
are not talking about opening 1000 files, fallocat()ing them to 100GB
each and then writing 4k chunks to each in a round-robin way. Cp is
writing ONE sequential stream as fast as possible.

As for his assertion that three major Linux filesystems (XFS, BTRFS and
ext4) don't need fallocate() because they use delayed allocation that is
plain not true for large files.

As a compromise cp could start with using fallocate() only for largish
files, say 16MB and above. Anything smaller can probably be handled by
delayed allocation or goes out so fast it stays in one chunk anyway.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]