bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] binary mode for strings?


From: arnold
Subject: Re: [bug-gawk] binary mode for strings?
Date: Tue, 01 May 2018 16:17:58 -0600
User-agent: Heirloom mailx 12.4 7/29/08

Hi.

Thank you for the report.

If you are using gawk 4.2.0 or later, you can cycle through PROCINFO["argv"]
to see if the -b option is present, and if not issue an error message.

There is no other way tell gawk to treat bytes as characters except with
the -b option.

You may wish to consider using iconv along with gawk, and letting iconv
convert the data from CP1250 to UTF-8 or something else more suitable
for gawk on Linux.

Thanks,

Arnold

Petr Slansky <address@hidden> wrote:

> Hello,
>
> I have trouble with locale. I run gawk at Linux, locale is set to "utf8"
> but I have to process text files in CP1250 encoding. I have no way to
> disable "locale" from AWK script and I miss such option.
>
> This is sentence from GAWK documentation:
>
> > Gawk is multibyte aware. This means that index(),  length(),  substr()
> and match() all work in terms of characters, not bytes.
>
> I already learned that I can start gawk with switch "-b" or with LC_ALL,
> like 'LC_ALL=C gawk -f script.awk data'
> but there is no way to verify from AWK script that switch -b was used (I
> can get value of LC_ALL, ENVIRON["LC_ALL"]). Problem is that when user
> forgets to activate "binary mode" with -b switch, result of parsing is
> wrong because AWK removes "extended" ASCII characters from results returned
> by substr() and lenght() returns wrong value, etc.
>
> I assume LANG=en_US.UTF-8, LC_ALL is empty.
>
> It is confusing, when I use 'printf "%s", $0;', I see all extended
> characters in the output but when I run 'for (I=1; I<=length(); I++) printf
> "%c", substr($0,i,1);' I see that characters are missing (for ASCII code >
> 127, I guess these are mapped to invalid utf8 codepoints).
>
> I already tried to use BINMODE="r" but it doesn't affect substr().
>
> So, I miss a way to force gawk to use strings in terms of bytes, not
> characters. To activate such option from the script.
>
> I have simple demo, to show difference when ASCII with code 0x80 (could be
> EUR symbol) is in data file. I am interested in the first case, I think
> there is no way to configure AWK (from the script) to process file
> correctly, to get EUR symbol at position 15 or to detect that something is
> wrong...
>
> $ awk -f demo1.awk test.txt
> Price is 35.12� (EUR).
> Price is 35.12 (EUR).
>                  U
> $ awk -b -f demo1.awk test.txt
> Price is 35.12� (EUR).
> Price is 35.12� (EUR).
>                  E
> $ hexdump -C test.txt
> 00000000  50 72 69 63 65 20 69 73  20 33 35 2e 31 32 80 20  |Price is
> 35.12. |
> 00000010  28 45 55 52 29 2e 0a                              |(EUR)..|
> 00000017
>
> $ cat demo.awk
> {
>   printf "%s\n", $0;
>   for (I=1; I<=length(); I++) printf "%c", substr($0,I,1); printf "\n";
>   printf "%18s\n", substr($0,18,1);
> }



reply via email to

[Prev in Thread] Current Thread [Next in Thread]