[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gawk] binary mode for strings?

From: Petr Slansky
Subject: [bug-gawk] binary mode for strings?
Date: Tue, 1 May 2018 19:07:43 +0200


I have trouble with locale. I run gawk at Linux, locale is set to "utf8" but I have to process text files in CP1250 encoding. I have no way to disable "locale" from AWK script and I miss such option.

This is sentence from GAWK documentation:

> Gawk is multibyte aware. This means that index(),  length(),  substr() and match() all work in terms of characters, not bytes.

I already learned that I can start gawk with switch "-b" or with LC_ALL, like 'LC_ALL=C gawk -f script.awk data'
but there is no way to verify from AWK script that switch -b was used (I can get value of LC_ALL, ENVIRON["LC_ALL"]). Problem is that when user forgets to activate "binary mode" with -b switch, result of parsing is wrong because AWK removes "extended" ASCII characters from results returned by substr() and lenght() returns wrong value, etc.

I assume LANG=en_US.UTF-8, LC_ALL is empty.

It is confusing, when I use 'printf "%s", $0;', I see all extended characters in the output but when I run 'for (I=1; I<=length(); I++) printf "%c", substr($0,i,1);' I see that characters are missing (for ASCII code > 127, I guess these are mapped to invalid utf8 codepoints).

I already tried to use BINMODE="r" but it doesn't affect substr().

So, I miss a way to force gawk to use strings in terms of bytes, not characters. To activate such option from the script.

I have simple demo, to show difference when ASCII with code 0x80 (could be EUR symbol) is in data file. I am interested in the first case, I think there is no way to configure AWK (from the script) to process file correctly, to get EUR symbol at position 15 or to detect that something is wrong... 

$ awk -f demo1.awk test.txt 
Price is 35.12� (EUR).
Price is 35.12 (EUR).
$ awk -b -f demo1.awk test.txt 
Price is 35.12� (EUR).
Price is 35.12� (EUR).
$ hexdump -C test.txt 
00000000  50 72 69 63 65 20 69 73  20 33 35 2e 31 32 80 20  |Price is 35.12. |
00000010  28 45 55 52 29 2e 0a                              |(EUR)..|

$ cat demo.awk
  printf "%s\n", $0;
  for (I=1; I<=length(); I++) printf "%c", substr($0,I,1); printf "\n";
  printf "%18s\n", substr($0,18,1);

reply via email to

[Prev in Thread] Current Thread [Next in Thread]