[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#16867: [bug #37600] grep -w cuts words on non-ascii
From: |
Stephane Chazelas |
Subject: |
bug#16867: [bug #37600] grep -w cuts words on non-ascii |
Date: |
Mon, 24 Feb 2014 21:38:11 +0000 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
2014-02-24 08:53:17 -0800, Jim Meyering:
[...]
> This is pretty serious:
>
> $ printf 'p\xc3\xa8re\n' |LC_ALL=en_US.utf8 grep -w p
> père
I gets more complicated with combining characters:
$ printf 'pe\314\200re\n' | grep -w pe
père
You can't expect \w to match U+0300 alone. You can't expect \w to
match two characters (e with U+0300) either.
It feels wrong that grep finds a word boundary inside a single
graphem though (between e and its grave accent).
I suppose one way to address the problem would be an option that
turns anything that matches a single character (., [xy], \w,
\s...) into something that matches a graphem, or if not maybe a
"combining character sequence"
http://www.unicode.org/faq/char_combmark.html for more details.
That's not a grep only problem though.
I suppose it gets even more complicated with non-latin alphabets
or non-alphabetic languages.
\w, -w, \b, \<, \> are not "standard" features, so GNU may
decide what they want to do with it. Restricting it to ascii
a-zA-Z0-9_ (which is not even word constituents in English, but
appears to match C identifiers which is probably what it was
designed for in the first place) is as good a choice as any I
would say.
Changing it might break things. Adding other ways to match
unicode characters properties (like PCRE's \p{...}) may be a
better approach.
--
Stephane