Re: [Grep-devel] [bug-gawk] GNU grep, awk, sed: support \u and \U for un

From: Paul Eggert
Subject: Re: [Grep-devel] [bug-gawk] GNU grep, awk, sed: support \u and \U for unicode
Date: Thu, 19 Jan 2017 18:48:59 -0800
Assaf Gordon wrote:
Currently, escape sequences are parsed and converted before
being sent to re/dfa.
Thus, '[\u0041]' is equivalent to '[A]'

POSIX requires [\u0041] to be equivalent to [u0041\], that is, it matches any of the characters '\', 'u', '0', '4', and '1'. This is true for grep, sed, and most other utilities that use regular expressions. (awk is an exception.) So except for awk, we can't simply translate \u escapes everywhere. At best we could translate them only if not POSIXLY_CORRECT.

On another topic, if we can't implement \N escapes in general then I wouldn't bother with implementing only \N{U+nnnn}.

