[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] binary mode for strings?
From: |
arnold |
Subject: |
Re: [bug-gawk] binary mode for strings? |
Date: |
Tue, 01 May 2018 16:17:58 -0600 |
User-agent: |
Heirloom mailx 12.4 7/29/08 |
Hi.
Thank you for the report.
If you are using gawk 4.2.0 or later, you can cycle through PROCINFO["argv"]
to see if the -b option is present, and if not issue an error message.
There is no other way tell gawk to treat bytes as characters except with
the -b option.
You may wish to consider using iconv along with gawk, and letting iconv
convert the data from CP1250 to UTF-8 or something else more suitable
for gawk on Linux.
Thanks,
Arnold
Petr Slansky <address@hidden> wrote:
> Hello,
>
> I have trouble with locale. I run gawk at Linux, locale is set to "utf8"
> but I have to process text files in CP1250 encoding. I have no way to
> disable "locale" from AWK script and I miss such option.
>
> This is sentence from GAWK documentation:
>
> > Gawk is multibyte aware. This means that index(), length(), substr()
> and match() all work in terms of characters, not bytes.
>
> I already learned that I can start gawk with switch "-b" or with LC_ALL,
> like 'LC_ALL=C gawk -f script.awk data'
> but there is no way to verify from AWK script that switch -b was used (I
> can get value of LC_ALL, ENVIRON["LC_ALL"]). Problem is that when user
> forgets to activate "binary mode" with -b switch, result of parsing is
> wrong because AWK removes "extended" ASCII characters from results returned
> by substr() and lenght() returns wrong value, etc.
>
> I assume LANG=en_US.UTF-8, LC_ALL is empty.
>
> It is confusing, when I use 'printf "%s", $0;', I see all extended
> characters in the output but when I run 'for (I=1; I<=length(); I++) printf
> "%c", substr($0,i,1);' I see that characters are missing (for ASCII code >
> 127, I guess these are mapped to invalid utf8 codepoints).
>
> I already tried to use BINMODE="r" but it doesn't affect substr().
>
> So, I miss a way to force gawk to use strings in terms of bytes, not
> characters. To activate such option from the script.
>
> I have simple demo, to show difference when ASCII with code 0x80 (could be
> EUR symbol) is in data file. I am interested in the first case, I think
> there is no way to configure AWK (from the script) to process file
> correctly, to get EUR symbol at position 15 or to detect that something is
> wrong...
>
> $ awk -f demo1.awk test.txt
> Price is 35.12� (EUR).
> Price is 35.12 (EUR).
> U
> $ awk -b -f demo1.awk test.txt
> Price is 35.12� (EUR).
> Price is 35.12� (EUR).
> E
> $ hexdump -C test.txt
> 00000000 50 72 69 63 65 20 69 73 20 33 35 2e 31 32 80 20 |Price is
> 35.12. |
> 00000010 28 45 55 52 29 2e 0a |(EUR)..|
> 00000017
>
> $ cat demo.awk
> {
> printf "%s\n", $0;
> for (I=1; I<=length(); I++) printf "%c", substr($0,I,1); printf "\n";
> printf "%18s\n", substr($0,18,1);
> }
- [bug-gawk] binary mode for strings?, Petr Slansky, 2018/05/01
- Re: [bug-gawk] binary mode for strings?,
arnold <=
- Re: [bug-gawk] binary mode for strings?, Manuel Collado, 2018/05/02
- Re: [bug-gawk] binary mode for strings?, Petr Slansky, 2018/05/06
- Re: [bug-gawk] binary mode for strings?, arnold, 2018/05/06
- Re: [bug-gawk] binary mode for strings?, Petr Slansky, 2018/05/06
- Re: [bug-gawk] binary mode for strings?, Petr Slansky, 2018/05/06
- Re: [bug-gawk] binary mode for strings?, Andrew J. Schorr, 2018/05/06
- Re: [bug-gawk] binary mode for strings?, arnold, 2018/05/06
Re: [bug-gawk] binary mode for strings?, Eli Zaretskii, 2018/05/01