[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data los

From: Chris Mason
Subject: Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
Date: Mon, 11 Jul 2016 11:00:55 -0400
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1

On 07/11/2016 10:41 AM, David Sterba wrote:
On Sat, Jul 02, 2016 at 09:18:07AM +0200, Pavel Raiskup wrote:
There are optimizations in archivers (tar, rsync, ...) that rely on up2date
st_blocks info.  For example, in GNU tar there is optimization check [1]
whether the 'st_size' reports more data than the 'st_blocks' can hold --> then
tar considers that file is sparse (and does additional steps).

It looks like btrfs doesn't show correct value in 'st_blocks' until the data
are synced.  ATM, there happens that:

    a) some "tool" creates sparse file
    b) that tool does not sync explicitly and exits ..
    c) tar is called immediately after that to archive the sparse file
    d) tar considers [2] the file is completely sparse (because st_blocks is
       zero) and archives no data.  Here comes data loss.

Because we fixed 'btrfs' to report non-zero 'st_blocks' when the file data is
small and is in-lined (no real data blocks) -- I consider this is too bug in
btrfs worth fixing.


Tested on kernel:

Originally reported here, reproducer available there:

The reproducer works for me here. So far I found:

* the btrfs implementation of stat.st_blocks (btrfs_getattr) includes
  the 'delayed allocated' bytes, so there is not a problem in principle

* calling fsync on the sparsefile will produce the expected result

* a short delay between ./binary and 'stat' will also produce correct
  result, 0.5 seconds worked for me -- so it IMO proves it's a race
  between writing and reporting the data

* I'm not yet sure where the delay between write and synced
  'inode->delalloc_bytes' comes from

* I think that st_blocks accounting can be wrong anyway, if the file is
  mmap-ed and not msync-ed, I'm writing a reproducer for this case

On my test box running current linux git, things work fine if I run the reproducer once. But if I leave it running in a loop long enough for writeback to kick in, I trigger it.

The reproducer has a loop in there where it is adding delalloc writes and truncating them away. What should be happening is that we're leaving some delalloc bits set past EOF, which makes us skip bumping inode->delalloc_bytes during the new write.

I can kind of confirm this by changing the reproducer to stat directly after the write call. Normally st_blocks is never zero. But if I leave it running in a loop for 30 seconds or so, I eventually get st_block zero directly after the write().

If I change the C program to unlink the file on exit, running the binary over and over again works every time.

So, the real bug is that we're letting some delalloc stat hang around after the truncate, probably related to IO in progress. We do already account for delalloc in what we return to stat, but there's a corner case involving truncate where we screw it up.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]