bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#30326: grep not searching through a text file (thinking it binary)


From: Eric Blake
Subject: bug#30326: grep not searching through a text file (thinking it binary)
Date: Fri, 2 Feb 2018 13:55:00 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2

tag 30326 notabug
thanks

On 02/02/2018 01:30 PM, L. A. Walsh wrote:
> I've used grep to search through my mbox-format emails for decades, but
> I've run into a case where it seems to be ignore a text mailbox
> because, I guess, it thinks it is "binary"

Yes, that's correct.

> If I used "-Par" it finds it.

Yes, that's also correct.

> 
> It seems that grep believes the file to binary and ignores it, though
> "file" calls it "text".

The file is conditionally text.  The POSIX definition of a text file is
one whose lines consist of valid characters in the current locale - but
note this definition is locale-dependent!  So a file that is text under
one locale may be binary under another.  When you are grepping a file
encoded correctly for the current locale, you get the output you want;
when you are grepping a file that contains encoding errors for the
current locale, POSIX says behavior is undefined, so GNU grep warns you
that the file is binary (in the current locale); and your use of -a
tells grep to process it anyways.  As 'file' reported that your file was
using non-ISO extended-ASCII, it probable means the file was encoded for
an 8-bit single-byte locale; and my guess is that you were running grep
under a UTF-8 locale, and generally, UTF-8 treats 8-bit single-byte
inputs as encoding errors.  Hence the warning that your file is binary,
under the current locale.

You can also use 'LC_ALL=C grep' to force a locale where EVERY byte is a
valid character, and thus where you will never encounter encoding errors
(you may encounter OTHER things that make your file binary, such as
embedded NULs, but that's a different matter).

This behavior is documented and intentional, so I'm closing this as not
a bug in the tracker.  However, feel free to add further comments or
questions to the thread.

And perhaps we could tweak the grep diagnostics to clarify whether a
file is binary because NUL bytes were encountered, vs. a file is binary
because encoding errors were encountered.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]