bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

grep, invalid characters in UTF-8 input


From: Philipp Rumpf
Subject: grep, invalid characters in UTF-8 input
Date: Tue, 19 Sep 2006 00:15:38 +0000

The way GNU grep treats non-UTF8 input when LANG indicates that UTF-8
is to be used strikes me as extremely odd;  I don't know whether this
is a problem of an underlying library or one in grep.

In short, a line that matches neither '.' nor '^$" appears to me to be
an extremely dangerous proposition, and breaking that dichotomy might
easily have security implications;  furthermore, if grep is NOT to
accept non-UTF8 input in UTF8 mode, why is the data passed through
unchanged when the match is trivial?

If this is not a bug, might I suggest better documentation?  The
dependence on $LANG (or the relevant $LC*) needs to be spelled out
clearly, and a word of warning is clearly in order about non-empty
strings that contain no .-matching character.

Philipp Rumpf

$ echo $LANG
en_GB.UTF-8
$ grep --version
grep (GNU grep) 2.5.1

Copyright 1988, 1992-1999, 2000, 2001 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ od -tx1 testcase
0000000 e0 0a
0000002

$ grep '.' testcase | wc -c
0
$ grep '^$' testcase | wc -c
0
$ grep '^' testcase | wc -c
2




reply via email to

[Prev in Thread] Current Thread [Next in Thread]