--- Begin Message ---
Subject: |
New 'Binary file' detection considered harmful |
Date: |
Sun, 28 Feb 2016 12:17:07 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Icedove/38.6.0 |
The new heuristics to detect 'Binary files' should be reverted to the
old one (before 2.20) as the new one has too big a potential to silently
fail important tasks.
One of the most important use cases of grep is processing file lists,
eg. in the pipe: find | grep | tar. This is often done by backup
software, eg. the in debian package 'backup2l'.
The new behaviour of grep -- to output 'Binary file matches' after
output started -- has silently broken the 'backup2l' script and has the
potential of silently breaking many other backup scripts as well.
Test case:
$ find /etc/ssl/certs/ | LANG= grep pem
Outcome:
grep will stop with 'Binary file (standard input) matches' after
outputting a small percentage of the existing .pem files.
Expected behaviour:
grep should list all .pem files.
This behaviour is particularly insidious because users may not notice
that their backup archives are a bit smaller than before or that their
backups complete a bit faster, while many thousand files may be missing.
Q: Why do you use LANG= ?
A: To illustrate the problem and because 'backup2l' does that.
Q: Why don't people use the -a switch?
A: People may not notice anything wrong with their backups until they
need them.
Q: Why don't you file a bug against 'backup2l'?
A: I will. But this is such a common use case that I suspect that many
of the backup scripts that people wrote just for themselves are now broken.
Q: Why don't you just set the correct locale?
A: Even then it suffices to have one bogus-encoded filename somewhere to
break your whole backup. It is easy to catch such a file from the
internet or from song or picture metadata.
Regards
--
Marcello Perathoner
--- End Message ---
--- Begin Message ---
Subject: |
Re: bug#22838: New 'Binary file' detection considered harmful |
Date: |
Thu, 8 Sep 2016 18:43:43 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 |
Paul Eggert wrote:
On 03/01/2016 02:05 AM, Marcello Perathoner wrote:
2) If you just output
binary line 42 in file x matches
and continue regular output after the next newline, the breakage would be much
more confined.
This sounds like a good suggestion. That is, grep could keep going if its only
problem is an attempt to output encoding errors (as opposed to reading null
bytes, which are a more-reliable indication of binary data). It would probably
be better to output just one "Binary file matches" line per file, at the end of
the other matches, so that it's more likely to be noticed.
I finally got around to implementing this, which turned out to be considerably
easier than I thought it would be. I installed the attached patch into the grep
Savannah master. I am boldly closing this old bug report; we can always start a
new report if further problems turn up.
0001-grep-encoding-errors-suppress-just-their-line.patch
Description: Text Data
--- End Message ---