[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode
From: |
Norihiro Tanaka |
Subject: |
Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode |
Date: |
Fri, 20 Jan 2017 09:10:02 +0900 |
On Tue, 10 Jan 2017 19:59:02 -0500
Assaf Gordon <address@hidden> wrote:
> (sorry for cross posting, I hope the discussion is relevant for all)
>
> Hello,
>
> I'd like to suggest (or discuss) a minor addition to grep/awk/sed:
> adding support for '\u' and '\U' for unicode characters, with
> the same rules as coreutils' printf:
> \uHHHH Unicode (ISO/IEC 10646) character with hex value HHHH (4 digits)
> \UHHHHHHHH Unicode character with hex value HHHHHHHH (8 digits)
>
> For 'awk' and 'grep', I believe these sequences are currently
> undefined and unused. For sed, it uses '\U' and '\u' in limited
> capacity (upper case replacement in s///).
> As for POSIX, I believe the behavior is unspecified and thus can be
> implemented.
>
> I think that supporting the exact same syntax with the same semantics
> across multiple GNU tools is a good long-term behavior,
> and multibyte/unicode supports is becoming more important and
> more useful as times goes by.
>
> For now I'm not asking about implementation issues (which I'm sure will be
> numerous, including interplay with gnulib and glibc, locales,
> and sed's backwards incompatibility).
>
> I'm more interested to discuss whether such long-term behavior is something
> that you'd consider for each respective projects (perhaps even mentally
> reserve '\u' and '\U' sequences for it, or accept patches in that direction).
>
>
> As for sed,
> I'm quite new here, but my thinking is that \u and \U
> are used in a limited way
> (https://www.gnu.org/software/sed/manual/sed.html#The-_0022s_0022-Command),
> and perhaps it can be argued that breaking compatibility will cause limited
> troubles
> for very specialized scripts, and is worth the long term improvement
> (of course the functionality will remain, just with a different letter).
>
>
> Thanks for reading,
> and for any suggestions or comments,
> regards,
> - assaf
Hi Assaf,
I have two question.
1. How should \uHHHH expression be parsed in bracket?
$ echo b | grep '[\U0041]'
I \uXXXX expression should not work in bracket.
2. Which should following expression be parsed, [a-c] or \[a-c\] ?
$ echo b | grep '\U005Ba-c\U005C'
I think that \uHHHH expression should not work to meta character.
i.e. I think that many users will prefer \[a-c\] to [a-c].
Thanks,
Norihiro
- [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/10
- Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/11
- Re: [bug-gawk] [Grep-devel] GNU grep, awk, sed: support \u and \U for unicode, Paul Eggert, 2017/01/11
- Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, arnold, 2017/01/11
- Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode,
Norihiro Tanaka <=