[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

GNU grep,awk,sed: support \u and \U for unicode

From: Assaf Gordon
Subject: GNU grep,awk,sed: support \u and \U for unicode
Date: Tue, 10 Jan 2017 19:59:02 -0500

(sorry for cross posting, I hope the discussion is relevant for all)


I'd like to suggest (or discuss) a minor addition to grep/awk/sed:
adding support for '\u' and '\U' for unicode characters, with
the same rules as coreutils' printf:
  \uHHHH  Unicode (ISO/IEC 10646) character with hex value HHHH (4 digits)
  \UHHHHHHHH  Unicode character with hex value HHHHHHHH (8 digits)

For 'awk' and 'grep', I believe these sequences are currently
undefined and unused. For sed, it uses '\U' and '\u' in limited
capacity (upper case replacement in s///).
As for POSIX, I believe the behavior is unspecified and thus can be implemented.

I think that supporting the exact same syntax with the same semantics
across multiple GNU tools is a good long-term behavior,
and multibyte/unicode supports is becoming more important and
more useful as times goes by.

For now I'm not asking about implementation issues (which I'm sure will be
numerous, including interplay with gnulib and glibc, locales,
and sed's backwards incompatibility).

I'm more interested to discuss whether such long-term behavior is something
that you'd consider for each respective projects (perhaps even mentally
reserve '\u' and '\U' sequences for it, or accept patches in that direction).

As for sed,
I'm quite new here, but my thinking is that \u and \U
are used in a limited way 
and perhaps it can be argued that breaking compatibility will cause limited 
for very specialized scripts, and is worth the long term improvement
(of course the functionality will remain, just with a different letter).

Thanks for reading,
and for any suggestions or comments,
 - assaf

reply via email to

[Prev in Thread] Current Thread [Next in Thread]