Re: [bug-gawk] v4.1.3 (run on OSX 10.11.3): potential gsub() bug

Hi,

Thanks, Hermann, for confirming that using actual characters as the range endpoints works.

There is no intent to support things like \x10f7ff

Perhaps doing what grep -P does is an option: \x{…}

Generally, it sounds like the right thing to do is:

in a UTF-8 locale, *always* deal with *characters* (Unicode codepoints), not bytes
specifically, when encountering \xhh, compare it to the *Unicode codepoint* of the character at hand

Always dealing with characters makes sense to me, especially given that you can mix Unicode characters and \xhh escapes in a single bracket _expression_.

Thus, given that \xff is the max. codepoint value that can currently be expressed, which doesn't allow matching the full range of Unicode characters, I suggest the following:

At https://www.gnu.org/software/gawk/manual/html_node/Bracket-Expressions.html#Bracket-Expressions:

document this limitation
recommend the workaround of using actual characters rather than codepoint escapes as the range endpoints.

Best,

Michael

On Jan 31, 2016, at 2:21 PM, Aharon Robbins <address@hidden> wrote:

Hi.

Thanks for the notes. The current code base should not dump core,
although I see that with stock 4.1.3.

The questions raised are messy. I don't have good answers. I think
that if you use [...] with real UTF-8 encoded characters as the
start and end point of the ranges, things will work OK. But I'm not
sure.

There is no intent to support things like \x10f7ff. If such a thing
works it's by accident and it won't last; the master branch was changed
to accept no more than two hex digits after \x.

I am not in a rush to add things like \uXXXX to gawk.

For now, you are probably best off avoiding things like [\x80-\xFF] in
Unicode locales. Or using LC_ALL=C.

Thanks,

Arnold

P.S. I'm curious what current GNU grep does with such things? Thanks

From:	Michael Klement
Subject:	Re: [bug-gawk] v4.1.3 (run on OSX 10.11.3): potential gsub() bug
Date:	Sun, 7 Feb 2016 10:54:43 -0500