bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode


From: David Niklas
Subject: Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode
Date: Mon, 23 Jan 2017 23:42:24 -0500

On Fri, 20 Jan 2017 09:10:02 +0900
Norihiro Tanaka <address@hidden> wrote
> 
> On Tue, 10 Jan 2017 19:59:02 -0500
> Assaf Gordon <address@hidden> wrote:
> 
> > (sorry for cross posting, I hope the discussion is relevant for all)
> > 
> > Hello,
> > 
> > I'd like to suggest (or discuss) a minor addition to grep/awk/sed:
> > adding support for '\u' and '\U' for unicode characters, with
> > the same rules as coreutils' printf:
> >   \uHHHH  Unicode (ISO/IEC 10646) character with hex value HHHH (4
> > digits) \UHHHHHHHH  Unicode character with hex value HHHHHHHH (8
> > digits)
> > 
> > For 'awk' and 'grep', I believe these sequences are currently
> > undefined and unused. For sed, it uses '\U' and '\u' in limited
> > capacity (upper case replacement in s///).
> > As for POSIX, I believe the behavior is unspecified and thus can be
> > implemented.
> > 
> > I think that supporting the exact same syntax with the same semantics
> > across multiple GNU tools is a good long-term behavior,
> > and multibyte/unicode supports is becoming more important and
> > more useful as times goes by.
> > 
> > For now I'm not asking about implementation issues (which I'm sure
> > will be numerous, including interplay with gnulib and glibc, locales,
> > and sed's backwards incompatibility).
> > 
> > I'm more interested to discuss whether such long-term behavior is
> > something that you'd consider for each respective projects (perhaps
> > even mentally reserve '\u' and '\U' sequences for it, or accept
> > patches in that direction).
> > 
> > 
> > As for sed,
> > I'm quite new here, but my thinking is that \u and \U
> > are used in a limited way 
> > (https://www.gnu.org/software/sed/manual/sed.html#The-_0022s_0022-Command),
> > and perhaps it can be argued that breaking compatibility will cause
> > limited troubles for very specialized scripts, and is worth the long
> > term improvement (of course the functionality will remain, just with
> > a different letter).
> > 
> > 
> > Thanks for reading,
> > and for any suggestions or comments,
> > regards,
> >  - assaf  
> 
> Hi Assaf,
> 
> I have two question.
> 
>  1. How should \uHHHH expression be parsed in bracket?
> 
>     $ echo b | grep '[\U0041]'
> 
>     I \uXXXX expression should not work in bracket.

If it does not then how do we implement a Unicode point in a bracket?

I was thinking that maybe it should be done like in other languages.
Not that I am familiar with the said langs, but I've looked up the matter
online (www.regular-expressions.info/unicode.html), and Perl and PCRE use
\x{FFFF} vs. \uFFFF . In java and javascript \uFFFF is used.

>  2. Which should following expression be parsed, [a-c] or \[a-c\] ?
> 
>     $ echo b | grep '\U005Ba-c\U005C'
> 
>     I think that \uHHHH expression should not work to meta character.
>     i.e. I think that many users will prefer \[a-c\] to [a-c].
> 
> Thanks,
> Norihiro
> 

I'd argue that treating a code point as an escape sequence would be more
complicated than just recommending the user to use \[ OTOH issuing a code
point like this \\U005Ba-c\\U005C would lead to grep matching
\U005Ba-c\U005C instead of \[a-c\] which would be worse...
perhaps the simplest (and most backward compatible), would be to require
that a literal \U be placed before a Unicode point, like this:
\U005Ba-c\U005C == U005Ba-cU005C (Issue a warning ?)
\\U005Ba-c\\U005C == [a-c]
\\\U005Ba-c\\\U005C == \U005Ba-c\U005C
....

You could also issue a warning that \\U005Ba-c\\U005C is ambiguous for a
few releases and then make \\U005Ba-c\\U005C translate to the \[a-c\]
after people get their scripts updated (which may take years), but then
when you finally make the switch \\UFFFF escapes will break.

If you used \x{FFFF} such a sequence might allow immediate implementation
since it is unlikely to be used in most regexes (even the extraordinarily
long one for email should work (Not that I can find the email regex when I
want it ;) ). Then you could do
[\x{0041}] == [A] or \x{005B}a-c\x{005C} == [a-c] ...

BTW: I've been looking forward to grep supporting Unicode for sometime,
keep up the good work!

Sincerely,
David



reply via email to

[Prev in Thread] Current Thread [Next in Thread]