[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] v4.1.3 (run on OSX 10.11.3): potential gsub() bug
From: |
Aharon Robbins |
Subject: |
Re: [bug-gawk] v4.1.3 (run on OSX 10.11.3): potential gsub() bug |
Date: |
Fri, 19 Feb 2016 16:00:04 +0200 |
User-agent: |
Heirloom mailx 12.5 6/20/10 |
Hi.
> > Generally, it sounds like the right thing to do is:
> >
> > - in a UTF-8 locale, *always* deal with *characters* (Unicode
> > codepoints), not bytes
> > - specifically, when encountering \xhh, compare it to the *Unicode
> > codepoint* of the character at hand
> >
> >
> > Always dealing with characters makes sense to me, especially given that
> > you can *mix* Unicode characters and \x*hh* escapes in a single bracket
> > expression.
> >
> > Thus, given that \xff is the max. codepoint value that can currently be
> > expressed, which doesn't allow matching the full range of Unicode
> > characters, I suggest the following:
> >
> > - At
> >
> > https://www.gnu.org/software/gawk/manual/html_node/Bracket-Expressions.html#Bracket-Expressions:
> > - document this limitation
> > - recommend the workaround of using actual characters rather than
> > codepoint escapes as the range endpoints.
This is what I've done. The changes will eventually propogate to the
repo.
I will talk to other GNU maintainers about how we want to deal with
this issue; I don't want to invent something on my own and have it
be different from other GNU utilities.
Thanks,
Arnold