bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] v4.1.3 (run on OSX 10.11.3): potential gsub() bug


From: Michael Klement
Subject: Re: [bug-gawk] v4.1.3 (run on OSX 10.11.3): potential gsub() bug
Date: Sun, 7 Feb 2016 10:54:43 -0500

Hi,

Thanks, Hermann, for confirming that using actual characters as the range endpoints works.

There is no intent to support things like \x10f7ff

Perhaps doing what grep -P does is an option: \x{…}

Generally, it sounds like the right thing to do is:

  • in a UTF-8 locale, *always* deal with *characters* (Unicode codepoints), not bytes
  • specifically, when encountering \xhh, compare it to the *Unicode codepoint* of the character at hand

Always dealing with characters makes sense to me, especially given that you can mix Unicode characters and \xhh escapes in a single bracket _expression_.

Thus, given that \xff is the max. codepoint value that can currently be expressed, which doesn't allow matching the full range of Unicode characters, I suggest the following:

Best,

Michael


On Jan 31, 2016, at 2:21 PM, Aharon Robbins <address@hidden> wrote:

Hi.

Thanks for the notes.  The current code base should not dump core,
although I see that with stock 4.1.3.

The questions raised are messy.  I don't have good answers.  I think
that if you use [...] with real UTF-8 encoded characters as the
start and end point of the ranges, things will work OK.  But I'm not
sure.

There is no intent to support things like \x10f7ff. If such a thing
works it's by accident and it won't last; the master branch was changed
to accept no more than two hex digits after \x.

I am not in a rush to add things like \uXXXX to gawk.

For now, you are probably best off avoiding things like [\x80-\xFF] in
Unicode locales.  Or using LC_ALL=C.

Thanks,

Arnold

P.S. I'm curious what current GNU grep does with such things?  Thanks


reply via email to

[Prev in Thread] Current Thread [Next in Thread]