grep, invalid characters in UTF-8 input

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

grep, invalid characters in UTF-8 input

From:	Philipp Rumpf
Subject:	grep, invalid characters in UTF-8 input
Date:	Tue, 19 Sep 2006 00:15:38 +0000

The way GNU grep treats non-UTF8 input when LANG indicates that UTF-8
is to be used strikes me as extremely odd;  I don't know whether this
is a problem of an underlying library or one in grep.

In short, a line that matches neither '.' nor '^$" appears to me to be
an extremely dangerous proposition, and breaking that dichotomy might
easily have security implications;  furthermore, if grep is NOT to
accept non-UTF8 input in UTF8 mode, why is the data passed through
unchanged when the match is trivial?

If this is not a bug, might I suggest better documentation?  The
dependence on $LANG (or the relevant $LC*) needs to be spelled out
clearly, and a word of warning is clearly in order about non-empty
strings that contain no .-matching character.

Philipp Rumpf

$ echo $LANG
en_GB.UTF-8
$ grep --version
grep (GNU grep) 2.5.1

Copyright 1988, 1992-1999, 2000, 2001 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ od -tx1 testcase
0000000 e0 0a
0000002

$ grep '.' testcase | wc -c
0
$ grep '^$' testcase | wc -c
0
$ grep '^' testcase | wc -c
2

[Prev in Thread]

Current Thread

[Next in Thread]

grep, invalid characters in UTF-8 input, Philipp Rumpf <=

Prev by Date: Re: SED manual broken link
Next by Date: Re: SED manual broken link
Previous by thread: libiconv_set_relocation_prefix could not be located in the dynamic link library libiconv-2.dll
Next by thread: asorti() comparison
Index(es):
- Date
- Thread