bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] v4.1.3 (run on OSX 10.11.3): potential gsub() bug


From: Hermann Peifer
Subject: Re: [bug-gawk] v4.1.3 (run on OSX 10.11.3): potential gsub() bug
Date: Fri, 29 Jan 2016 20:42:09 +0100
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.5.1

On 2016-01-29 17:02, Michael Klement wrote:
> Thanks, Hermann.
> 
> LC_ALL=C is an effective workaround for the case at hand, though it
> precludes working with Unicode characters as such in the rest of the
> script (which may never be needed).
> 
> Another, workaround, though not fully equivalent, is:
> 
>     echo 'hät' | gawk '{ gsub(/[^\x00-\x7F]/, ""); print }'
> 
> 
> This works without LC_ALL=C, but excludes ALL non-ASCII characters, not
> just those in the range 128 - 255.
> 
> Which brings me to a question (couldn't figure it out from the docs): 
> 
> Are the \x.. escapes inside bracket expressions *supposed* to work with
> *all Unicode* codepoints?
> In other words: in an UTF-8 locale, *can you specify Unicode code-point
> ranges* (that go way beyond 0xFF) rather than just individual-byte ranges?
> 
> The following does appear to work in locale "en_US.UTF-8", but it may be
> accidental:
> 
>     # Exclude all non-ASCII chars (exclude the entire non-ASCII Unicode
>     codepoint range).
> 
>     echo 'hät' | gawk '{ gsub(/[^\x80-\x10f7ff]/, ""); print }'
> 
> 
>     The crash prevents me from testing the complement.
> 
> 
>     Obviously, without a construct to delimit the hex digits ({…}
>     doesn't work), there's ambiguity.
> 
> 
> Either way, I suggest clarifying the behavior
> at 
> https://www.gnu.org/software/gawk/manual/html_node/Bracket-Expressions.html#Bracket-Expressions
> 
> 

In my understanding, bracket expressions like [\x80-\xff] are about byte
ranges, not code point ranges. I might be wrong though and hope that
others on the list can help.

About your last example: see below what I am getting here.

Hermann

I
$ # gawk 4.1.3: *seems* to work, by accident?
$ echo 'hät' | /opt/local/bin/gawk '{ gsub(/[^\x80-\x10f7ff]/, ""); print }'
ä

$ # gawk 4.1.3: why are those chars in the middle gone?
$ echo ÄÖÜŚŜŹŻŽÅØÆ | /opt/local/bin/gawk '{ gsub(/[^\x80-\x10f7ff]/,
""); print }'
ÄÖÜÅØÆ

$ # gawk/master doesn't like the range in the first place
$ echo 'hät' | gawk '{ gsub(/[^\x80-\x10f7ff]/, ""); print }'
gawk: cmd. line:1: error: Invalid collation character: /[^�-f7ff]/





reply via email to

[Prev in Thread] Current Thread [Next in Thread]