[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
grep, invalid characters in UTF-8 input
From: |
Philipp Rumpf |
Subject: |
grep, invalid characters in UTF-8 input |
Date: |
Tue, 19 Sep 2006 00:15:38 +0000 |
The way GNU grep treats non-UTF8 input when LANG indicates that UTF-8
is to be used strikes me as extremely odd; I don't know whether this
is a problem of an underlying library or one in grep.
In short, a line that matches neither '.' nor '^$" appears to me to be
an extremely dangerous proposition, and breaking that dichotomy might
easily have security implications; furthermore, if grep is NOT to
accept non-UTF8 input in UTF8 mode, why is the data passed through
unchanged when the match is trivial?
If this is not a bug, might I suggest better documentation? The
dependence on $LANG (or the relevant $LC*) needs to be spelled out
clearly, and a word of warning is clearly in order about non-empty
strings that contain no .-matching character.
Philipp Rumpf
$ echo $LANG
en_GB.UTF-8
$ grep --version
grep (GNU grep) 2.5.1
Copyright 1988, 1992-1999, 2000, 2001 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ od -tx1 testcase
0000000 e0 0a
0000002
$ grep '.' testcase | wc -c
0
$ grep '^$' testcase | wc -c
0
$ grep '^' testcase | wc -c
2
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- grep, invalid characters in UTF-8 input,
Philipp Rumpf <=