Re: [bug-gawk] [External] Re: Invalid Characters Causing Problems in awk

From:

Wolfgang Laun

Subject:

Re: [bug-gawk] [External] Re: Invalid Characters Causing Problems in awk 4.0.2

Date:

Fri, 24 Aug 2018 08:28:07 +0200

File diacrit.txt contains all the 20 non-ASCII characters you need for Spanish in one line (including \n) with UTF-8 encoding:

¡¿ªºÁáÉéÍíÑñÓóÚúÜüÇç

$ wc -c diacrit.txt
41 diacrit.txt
$ wc -m diacrit.txt
21 diacrit.txt
$ od -tx1 diacrit.txt
0000000 c2 a1 c2 bf c2 aa c2 ba c3 81 c3 a1 c3 89 c3 a9
0000020 c3 8d c3 ad c3 91 c3 b1 c3 93 c3 b3 c3 9a c3 ba
0000040 c3 9c c3 bc c3 87 c3 a7 0a

One byte less per character with a codepoint beyond 0x7F, i.e., all of the above require 2 bytes for their UTF-8 encoding.

The trademark sign (™) is codepoint U+2122 and this requires 3 bytes in the UTF-8 encoding, one character on a single line:

™

$ wc -c diacrit.txt
4 diacrit.txt
$ wc -m diacrit.txt
2 diacrit.txt

$ od -tx1 diacrit.txt
0000000 e2 84 a2 0a

Regards

Wolfgang

On 23 August 2018 at 22:35, Gilbert, Brandon (Synchrony) <address@hidden> wrote:

Thank you.

I have noticed that doing a wc -c on a record with a special character, the character count is 2 bytes less than a record that does not have a special character in it. Would this indicate the multibyte encoding?

…Brandon

From: Wolfgang Laun <address@hidden>
Sent: Thursday, August 23, 2018 11:10 AM
To: Gilbert, Brandon (Synchrony) <address@hidden>
Cc: address@hidden
Subject: Re: [External] Re: [bug-gawk] Invalid Characters Causing Problems in awk 4.0.2

Hi Gilbert,

programs on a system with the setting en_US.UTF-8 and acting accordingly will process Ñ ñ encoded as \xc3\x91 \xc3\xb1 correctly and without any complaint. If the program is led to believe that the data is encoded according to ISO-8859-1, not much would happen except that a single Ñ or ñ would result in two characters. If, however, Ñ ñ are encoded according to ISO-8859-1 as \xd1 and \xf1, a program following en_US.UTF-8 will have to indicate an error since no UTF-8 encoding (a multibyte encoding) begins with either characters.

Using /usr/bin/od to look at the "raw" data is a useful first step to see what is going on.

-W