[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#19242: latest grep considers text files as binary

From: Paul Eggert
Subject: bug#19242: latest grep considers text files as binary
Date: Sun, 22 Mar 2015 17:42:25 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0

Thomas Wolff wrote:
Hi Paul and Jim,

Thanks for your previous quick responses on this matter and excuse my very late
additional statement.

However, the arguments are not convincing.
The new behavior violates the principle of least astonishment which is well
established in software design.

That cuts both ways. Older versions of grep could dump core when given improperly encoded text, which is even more astonishing. The new version is an improvement in that particular area. It is not clear how grep could be modified to avoid the core dumps while still preserving the old behavior in question.

It is not convincing that a text file is not considered a text file for a few
bytes that are not properly encoded in the current locale. Also the quoted POSIX
clause does not support that claim.

Not by itself, but from the chain of definitions it's clear that a text file must contain properly encoded text. The quoted POSIX clause (3.397) says that a text file contains "characters", and an earlier clause (3.87) defines "character" to be "A sequence of one or more bytes representing a single graphic symbol or control code. Note: This term corresponds to the ISO C standard term multi-byte character".


Because encoding errors are not characters, they are not text.

And, considering the "pipe security" argument, shall all classic Unix tools now
get additional options -a, so that something like
     grep 'bla' | sed -e 'expr' | tr '' '' | grep -v 'argl'
would in future look like
     grep -a 'bla' | sed -a -e 'expr' | tr -a '' '' | grep -a -v 'argl'

It shouldn't be needed for tr, as tr's input is not required to be a text file.

GNU sed doesn't worry about whether files are text or binary. I expect this is because the problem of spitting out random binary data tends to be less of an issue for 'sed' in practice. However, portable scripts should not assume that 'sed' will work on arbitrary binary data.

What about backwards compability of scripts then?
This is breaking decades of Unix tradition of modular tools for the mere
dogmatics of some peculiar and strict locale theory.

UTF-8 does tend to have that effect, yes. From the traditional Unix point of view, patterns like 'a.b' are "broken" with modern grep in UTF-8 locales, since the "." no longer matches only single bytes. This has been true for decades, not just for 'grep' but also for 'sed' etc. These days, though, users tend to be more interested in dealing with multibyte characters than in insisting on circa-1977 semantics in all cases.

If you insist on this priority of locale strategy over Unix tradition,
please offer at least a compatibility option that does not break scripts,
i.e. an environment setting that enforces compatible behaviour (like other tools
have, e.g. LS_COLORS etc).

Instead of an environment variable I suggest using a script.  Please see:


As a last remark, I wonder why my report does not show up in
and apparently I cannot submit anything there myself. Please get the issue
documented there.

I unarchived that bug report and am quoting the entire new part of your message, which should do the trick.

Kind regards,

reply via email to

[Prev in Thread] Current Thread [Next in Thread]