[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: grep dfa bug
From: |
Charles Levert |
Subject: |
Re: grep dfa bug |
Date: |
Mon, 1 Aug 2005 02:50:53 -0400 |
User-agent: |
Mutt/1.4.1i |
* On Monday 2005-08-01 at 09:12:03 +0900, KIMURA Koichi wrote:
>
> I think I found bug of dfa of gawk.
You mean grep? (Both use a dfa.)
> Situation:
> In Japanese ShiftJIS locale, half-witdth katakana in character class
> does not match appropriately.
>
> Reproduce:
> set LANG=ja_JP.SJIS
> export LANG
> echo ABCDE | grep '/[A-E]\+/p'
>
> Actually, A B C D E is half-width katakana character.
> (data to reprodcue is appended at end of this mail (uuencoded SJIS data))
>
> Result:
> nothig printed.
> begin 644 testkana.sh
> M<V5T($Q!3D<]:F%?2E`N4TI)4PIE>'!O<address@hidden;F]T('!R:6YT"F5C!
> <:&address@hidden;address@hidden"!G<F5P("<O6[$MM5U<*R\G"@``(
> ``
> end
$ hexdump -C testkana.sh
00000000 73 65 74 20 4c 41 4e 47 3d 6a 61 5f 4a 50 2e 53 |set LANG=ja_JP.S|
00000010 4a 49 53 0a 65 78 70 6f 72 74 20 4c 41 4e 47 0a |JIS.export LANG.|
00000020 23 6e 6f 74 20 70 72 69 6e 74 0a 65 63 68 6f 20 |#not print.echo |
00000030 b1 b2 b3 b4 b5 20 7c 20 67 72 65 70 20 27 2f 5b |..... | grep '/[|
00000040 b1 2d b5 5d 5c 2b 2f 27 0a |.-.]\+/'.|
This shell script has several problems:
-- it shouldn't be "set LANG=ja_JP.SJIS"
but just "LANG=ja_JP.SJIS" (better yet,
use LC_ALL instead to be sure to override
any other environment variable);
-- there shouldn't be slashes around the
regular expression (that being awk or
sed syntax).
Fixing those two problems, I do get a match
using current CVS grep.
However, using a more recent version of
regex.c et al. (as recently discussed on the
mailing list), I get a "grep: Invalid collation
character" error with an exit code of 2.
Repeating an equivalent experiment with UTF-8, it
works fine no matter what version of grep I use:
$ echo 'アイウエオ' | LC_ALL=ja_JP.utf8 grep '[ア-オ]\+'
アイウエオ
Strangely, this
$ echo 'アイウエオ' | LC_ALL=en_US.utf8 grep '[ア-オ]\+'
only works with the recent regex.c and produces
the same error as above without it.
(I.e., just the opposite as with ja_JP.SJIS).
Is any UTF-8 locale supposed to know about the
collation order of languages other than its
main one (here en_US about ja_JP)?
- grep dfa bug, KIMURA Koichi, 2005/08/01
- Re: grep dfa bug,
Charles Levert <=