emacs-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[debbugs-tracker] bug#23234: closed (unexpected results with charset han


From: GNU bug Tracking System
Subject: [debbugs-tracker] bug#23234: closed (unexpected results with charset handling in GNU grep 2.23)
Date: Sun, 10 Apr 2016 08:44:01 +0000

Your message dated Sun, 10 Apr 2016 01:43:10 -0700
with message-id <address@hidden>
and subject line Re: bug#23234: unexpected results with charset handling in GNU 
grep 2.23
has caused the debbugs.gnu.org bug report #23234,
regarding unexpected results with charset handling in GNU grep 2.23
to be marked as done.

(If you believe you have received this mail in error, please contact
address@hidden)


-- 
23234: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=23234
GNU Bug Tracking System
Contact address@hidden with problems
--- Begin Message --- Subject: unexpected results with charset handling in GNU grep 2.23 Date: Wed, 6 Apr 2016 21:25:21 +0200
Hi,

this change in GNU grep 2.23 has severe consequences:

> Binary files are now less likely to generate diagnostics and more
> likely to yield text matches.  grep now reports "Binary file FOO
> matches" and suppresses further output instead of outputting a line
> containing an encoding error; hence grep can now report matching text
> before a later binary match.  Formerly, grep reported FOO to be
> binary when it found an encoding error in FOO before generating
> output for FOO, which meant it never reported both matching text and
> matching binary data; this was less useful for searching text
> containing encoding errors in non-matching lines.

I got a report that the build of the German spellcheck dictionary got broken.
It tuned out that this happened after the update to GNU grep to 2.23:

https://bugzilla.redhat.com/show_bug.cgi?id=1316359

Actually the mentioned change leaves no reliable way to grep lines out of a
any text file, which contains non-ASCII characters.

Until now it was quite save to use grep in the C locale, also for non-ASCII
text. Now after that change, the locale charmap has to match all of the
encoding of the input file.  Unfortunately the only locale that definetely
always exists for sure is the C locale. We cannot assume that any other locale
definitions exist on an unknown system. For a script, that wants to use grep,
this is a big problem now.

Let's take this example using grep 2.23:

# echo -e "test\ntäst\ntest" | iconv -f utf8 -t latin1 | LC_ALL=C grep "st" ; 
echo $?
test
Binary file (standard input) matches
0

There are several problems here. Someone might want to assume that the locale
definitions for en_US.ISO-8859-1 exist. Unfortunetely such an assumtion cannot
be made. Whatever locale is used - if the definition might not be there and we
will fall back to the C locale in any case then.

The result is, we get the first matching line in the example. The second
matching line with a non-ASCII character returns the text "Binary file
(standard input) matches" on stdout (which might even be a valid matching line
of the input file!) and the following matches are skipped. (Finally the return
code is 0 - as the grepping stopped quickly, a return code >1 might be 
desireble,
but I don't want to dive into that point right now.)


Let me draw a biger picture: Have a look at what a POSIX compliant grep is
expected to do:
http://pubs.opengroup.org/onlinepubs/009604499/utilities/grep.html

Read the description section, especially:

--snip--
By default, an input line shall be selected if any pattern, treated as an
entire basic regular expression (BRE) as described in the Base Definitions
volume of IEEE Std 1003.1-2001, Section 9.3, Basic Regular Expressions, matches
any part of the line excluding the terminating <newline>;
--snap--

That means a posix compliant grep should not try to be too smart and tell the
user that a binary file matches the search pattern (people can use "strings" if
they want). It should just output the line. From that perspective GNU grep was
not posix compliant before either, but it was not a big problem for most people
obviously. With the recent change though and the issues described above I think
a lot of scripts using (GNU) grep will get broken.

I really hope this change will be reverted as soon as possible. I would rather
prefer GNU grep to become posix compliant and not do any binary detection by
default actually.

Cheers
Björn



--- End Message ---
--- Begin Message --- Subject: Re: bug#23234: unexpected results with charset handling in GNU grep 2.23 Date: Sun, 10 Apr 2016 01:43:10 -0700 User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0
Paul Eggert wrote:
I plan to change GNU grep to use this new facility, and to add some grep test
cases for this issue.

I did that by installing the attached patches into the grep master. This fixes the bug for me, so I'm closing the bug report.

These patches mostly just report the fix and add test cases. The actual fix was in gnulib, here:

http://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=b7bc3c1a4e78add4cbad39ae1a0c4fb0747b483f

This gnulib fix works around the underyling glibc facility which caused the problem, for which I've filed a bug report here:

https://sourceware.org/bugzilla/show_bug.cgi?id=19932

It's not clear when the glibc bug will be fixed. Until it is, one should expect similar problems to crop up in applications other than 'grep'.

Attachment: 0001-build-update-gnulib-submodule-to-latest.txt
Description: Text document

Attachment: 0002-grep-in-C-locale-all-bytes-are-valid-characters.txt
Description: Text document


--- End Message ---

reply via email to

[Prev in Thread] Current Thread [Next in Thread]