[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode

From: Assaf Gordon
Subject: Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode
Date: Thu, 19 Jan 2017 19:33:53 -0500


> On Jan 19, 2017, at 19:10, Norihiro Tanaka <address@hidden> wrote:
> 1. How should \uHHHH expression be parsed in bracket?
>    $ echo b | grep '[\U0041]'
>    I \uXXXX expression should not work in bracket.

> 2. Which should following expression be parsed, [a-c] or \[a-c\] ?
>    $ echo b | grep '\U005Ba-c\U005C'
>    I think that \uHHHH expression should not work to meta character.
>    i.e. I think that many users will prefer \[a-c\] to [a-c].

Thank you for raising these good points.

Currently, escape sequences are parsed and converted before
being sent to re/dfa.
Thus, '[\u0041]' is equivalent to '[A]',
and   '\u005Baa-c\u005c' is equivalent to '[a-c]'.

Note that my current implementation is missing a key detail:
coreutils' printf rejects sequences in certain ranges, and
so this will not be accepted in practice:

 "A universal character name shall not specify a character short
  identifier in the range 00000000 through 00000020, 0000007F through
  0000009F, or 0000D800 through 0000DFFF inclusive. A universal
  character name shall not designate a character in the required
  character set."

However other sequences are un-escaped,
  '[\u03a8]' means '[Ψ]'
and not the character set u/0/3/8/a/\\ .

It will take a bit more work (perhaps even touching re/dfa) to avoid
un-escaping sequences inside brackets. Worth considering and discussing.

Thanks again,
 - assaf

reply via email to

[Prev in Thread] Current Thread [Next in Thread]