[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

small ascii files can be sparse

From: Martin Carroll
Subject: small ascii files can be sparse
Date: Fri, 27 Jul 2012 10:36:38 -0400
User-agent: Mozilla/5.0 (X11; Linux i686; rv:13.0) Gecko/20120717 Thunderbird/13.0

Hi all,

The recent "sparse files are now considered binary" patch
produces false positives with files stored on certain NFS servers.  The
patch, to refresh your memory, treats any file with holes as containing
binary data.  However, some NFS (version 3) servers report holes for
all files with fewer than 65 bytes.  This behavior results in the following
incorrect behavior (for example) from grep:

    % echo 'x' >foo ; grep x foo
    Binary file foo matches

More on NFS: Certain NFS V3 servers, when sent a GETATTR request
for a regular file of length less than 65 bytes, return a GETATTR
reply whose "size" field is the actual length of the file and "used"
field is *zero* bytes.  In other words, the server reports that small
files occupy no space on the server.    

Here is what the NFS V3 spec (https://tools.ietf.org/rfc/rfc1813.txt)
says for the "used" field (page 22):

    Used is the number of bytes of disk space that the file actually
    uses (which can be smaller than the size because the file may have
    holes or it may be larger due to fragmentation).

Now, we all know that a value of

    used < size

is intended for binary files containing lots of nuls; it was *not*
intended for small ascii files.  However, regardless of intention,
a "used" value of 0 for small ascii files is technically within spec,
as far as I can tell.  We might not like this behavior, and we might
consider it an NFS bug, but the fact is that there are NFS servers
out there that behave this way (rightly or wrongly), and grep needs
to interoperate with them.

I am in the process of discussing this issue with the vendor of the
NFS servers to understand the precise conditions under which an NFS
server might report holes for ascii files.  That process might take
some time.  Meanwhile, the latest grep does not work on my system.
I propose that we fix this problem by adding a patch that ignores
holes in small files.  I am happy to send this patch, if you agree
that it the right thing to do.

Oh, and one more issue: On the same files that grep reports as binary,
file(1) reports as ascii:

    % echo 'x' >foo ; file foo
    foo: ASCII text

Now, a system in which one tool (grep) reports a file as binary and
another tool (file(1)) reports a file as ascii is inconsistent and leads
to issues (and bugs) when writing shell scripts. 

To make the system consistent, the "is it binary" logic should be factored
out and put into a separate package, which all tools use. 

The one objection I can see to doing such a refactoring of the upstream
packages is that some tools might really have different criteria for
a file should be considered binary.  Offhand, however, I am unable to think
of a good example, where "good" means "sufficiently good enough to
justify the pain that it would cause when writing shell scripts."  Note that
grep and file(1) are *not* a good example, because those two tools
ought to agree IMHO.

Martin Carroll

reply via email to

[Prev in Thread] Current Thread [Next in Thread]