[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: odd behavior of length(), match() and field splitting with multi-byt
From: |
arnold |
Subject: |
Re: odd behavior of length(), match() and field splitting with multi-byte characters |
Date: |
Mon, 01 Jul 2024 06:15:31 -0600 |
User-agent: |
Heirloom mailx 12.5 7/5/10 |
Hi Ed.
I cannot reproduce any of these on my (work) Ubuntu 22.04 system.
This would appear to be a Cygwin issue.
Corinna?
Thanks,
Arnold
Ed Morton <mortoneccc@comcast.net> wrote:
> Configuration Information [Automatically generated, do not change]:
> Machine: x86_64
> OS: cygwin
> Compiler: gcc
> Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security
> -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong
> --param=ssp-buffer-size=4
> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/build=/usr/src/debug/gawk-5.3.0-1
>
> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/src/gawk-5.3.0=/usr/src/debug/gawk-5.3.0-1
>
> -DNDEBUG
> uname output: CYGWIN_NT-10.0-22631 TournaMart_2023 3.5.3-1.x86_64
> 2024-04-03 17:25 UTC x86_64 Cygwin
> Machine Type: x86_64-pc-cygwin
>
> Gawk Version: 5.3.0
>
> Attestation 1:
> I have read
> https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
> Yes
>
> Attestation 2:
> I have not modified the sources before building gawk.
> True
>
> Description:
> gawk is reporting odd lengths and matches of strings
> when multi-byte characters are involved.
>
> Repeat-By:
> Someone on StackOverflow asked about a couple of issues they
> saw that, so far at least, no-one there can explain and seem to just be
> bugs.
>
> 1)
> https://stackoverflow.com/questions/78676444/conditional-replacement-of-arbitrarily-large-strings-that-occur-at-arbitrary-dis#comment138715434_78676444
>
> and
> https://stackoverflow.com/questions/78676444/conditional-replacement-of-arbitrarily-large-strings-that-occur-at-arbitrary-dis#comment138720207_78676444:
>
> If we output 4 multi-byte characters as 10 bytes using:
>
> $ echo '61F09F948DF09F948E62' | xxd -r -p > file1
> $
>
> and run the following gawk command on it we get the output shown:
>
> $ LC_ALL=en_US.utf8 gawk '{print(length($0))}' file1
> 6
> $
>
> i.e. 6 instead of 4. If we run
>
> $ printf 'F0989A9F' | xxd -r -p | LC_ALL=en_US.utf8 awk -F
> '' '{print NF, length(); for (i=1; i<=NF; i++) print $i}' | cat -A
> 2 2$
> M-pM-^XM-^Z$
> M-^_$
> $
>
> it shows that what is intended to be single a 4-byte character
> is being treated as 2 characters, one 3 bytes and the other 1 byte.
>
> 2)
> https://stackoverflow.com/questions/78690533/why-does-the-match-function-not-work-in-this-particular-situation
>
> If we create some input using:
>
> $ echo
> '3C6469763E3C6469763E5F3C2F6469763E5F3C68313E6162636465665F3C2F68313E5F3C2F6469763E3C6469763EF09F93853C2F6469763E0A'
>
> | xxd -r -p > file2
>
> and then run this on it we get the expected output shown::
>
> $ LC_ALL=en_US.utf8 gawk
> '{match($0,/^.*_<h1>(.*)_<\/h1>.*$/,a); print a[1]}' file2
> abcdef
> $
>
> but if we add the `IGNORECASE` flag we get a blank line output:
>
> $ LC_ALL=en_US.utf8 gawk -vIGNORECASE=1
> '{match($0,/^.*_<h1>(.*)_<\/h1>.*$/,a); print a[1]}' file2
>
> $
>
> unless we also remove the end of string delimiter, `$`, from
> the end of the regexp:
>
> $ LC_ALL=en_US.utf8 gawk -vIGNORECASE=1
> '{match($0,/^.*_<h1>(.*)_<\/h1>.*/,a); print a[1]}' file2
> abcdef
> $
- odd behavior of length(), match() and field splitting with multi-byte characters, Ed Morton, 2024/07/01
- Re: odd behavior of length(), match() and field splitting with multi-byte characters, Ed Morton, 2024/07/01
- Re: odd behavior of length(), match() and field splitting with multi-byte characters,
arnold <=
- Re: odd behavior of length(), match() and field splitting with multi-byte characters, Eli Zaretskii, 2024/07/01
- Re: odd behavior of length(), match() and field splitting with multi-byte characters, Ed Morton, 2024/07/01
- Re: odd behavior of length(), match() and field splitting with multi-byte characters, Eli Zaretskii, 2024/07/01
- Re: odd behavior of length(), match() and field splitting with multi-byte characters, Ed Morton, 2024/07/06
- Re: odd behavior of length(), match() and field splitting with multi-byte characters, Eli Zaretskii, 2024/07/06
- Re: odd behavior of length(), match() and field splitting with multi-byte characters, Ed Morton, 2024/07/06