Re: [bug-gawk] binary mode for strings?

I just found that even sprintf() in GAWK is affected by locale configuration.

$ mawk -f x3.awk | hexdump -C

00000000 80 20 80 20 31 0a |. . 1.|

00000006

$ gawk -f x3.awk | hexdump -C

00000000 80 20 c2 80 20 31 0a |. .. 1.|

00000007

$ gawk -b -f x3.awk | hexdump -C

00000000 80 20 80 20 31 0a |. . 1.|

00000006

$ cat x3.awk

BEGIN {

BINMODE="rw";

X="";

X= X sprintf("%c", 128);

L= split(X, XX, "");

printf "\x80 %s %d\n", X, L;

}

I give up, I am going to learn Python...

On Sun, May 6, 2018 at 8:52 PM, Petr Slansky <address@hidden> wrote:

Thank you for recommendation.

I have found this problem when I decided to rewrite my old script from PERL to AWK. I like AWK, I use it a lot because I process text files. The refactored script in AWK worked great for several test files until it hit files with several extended ascii charecters. It was difficult to find source of this problem. Maybe that real source of trouble is that converter (internaly used by substr(), length(), etc) silently discards characters those cannot be mapped to UTF-8 codepoints. I prefer to see fatal error then data silently ignored; that results in corrupted reports. I have found this problem only because I was comparing report generated with my old PERL script with new AWK script, so I noticed difference (and I feel I was lucky that I tested at computer with GAWK).

Other question on my mind. Is "split" working correctly in GAWK? I use it to get result in blength() function but maybe it works only because there is a bug in GAWK and once it is fixed (split will handle strings based on locale settings), it will stop working.

On Sun, May 6, 2018 at 6:11 PM, <address@hidden> wrote:
Petr Slansky <address@hidden> wrote:

> From my point of view, GAWK has to add some way to switch string processing
> functions to "old way" and to do it from script. Just add another magic
> variable... ;-)

Thank you for your opinion.

I disagree with you; gawk's job is to process text, not mess with
character encodings. I cannot solve all the world's problems with
one program, and I have learned not to try. Gawk already has too many
magic variables; they have been part of my painful learning curve.

I suggest that the main solution is to use scripting; real shell scripting
on Linux, and batch files (or cygwin or Windows Subsystem for Linux
on Windows) to process your files, using iconv or whatever other tool
is necessary.

Of course, since gawk is Free Software, you are free to change it as you
like (or hire someone to do it for you if you are not a C programmer),
and this is another possible option available to you.

You are also free to use mawk or Brian Kernighan's awk everywhere, and
that represents a third option.

Best wishes,

Arnold

From:	Petr Slansky
Subject:	Re: [bug-gawk] binary mode for strings?
Date:	Sun, 6 May 2018 22:50:36 +0200