Re: [bug-gawk] [External] Re: Invalid Characters Causing Problems in awk

Hi,

It is mostly with special Spanish characters in names and trademark characters in business names. Due to the confidentiality of the data, I am unable to send examples. I can say that when I pulled the records into Ultra-Edit, and I highlighted characters on the line, it showed the byte size as doubled (1 character showed byte length of 2 and 2 characters as 4, etc.).

Doing some on-line research, since sending the 1^st e-mail to you, I found a message board where someone noted the following:

For a given awk implementation to work properly with non-ASCII characters (foreign letters), it must respect the active locale's character encoding, as reflected in the (effective) LC_CTYPE setting (run locale to see it).

These days, most locales use UTF-8 encoding, a multi-byte-on-demand encoding that is single-byte in the ASCII range, and uses 2 to 4 bytes to represent all other Unicode characters.

Thus, for a given awk implementation to recognize non-ASCII (accented, foreign) letters, it must be able to recognize multiple bytes as a single character.

So I did a compare of the locale command output on each system. The older system, that does not have problems with the characters, has LC_COLLATE=C, and the new system, that does have problems has LC_COLLATE="en_US.UTF-8". All other settings are match, and are set to en_US.UTF-8 . Could this be a cause?

Thank you for your help!

…Brandon

From: Wolfgang Laun <address@hidden>
Sent: Wednesday, August 22, 2018 11:47 PM
To: Gilbert, Brandon (Synchrony) <address@hidden>
Cc: address@hidden
Subject: [External] Re: [bug-gawk] Invalid Characters Causing Problems in awk 4.0.2

What is a "non-standard character"? ISO 10646 is quite comprehensive. - Bug notices without examples aren't likely to cause a stir.

-W

On 22 August 2018 at 22:48, Gilbert, Brandon (Synchrony) <address@hidden> wrote:

Hi,

We are converting from one Linux system to another Linux system.

The old system has awk version 3.1.3 and the new version has awk 4.0.2.

In the version 3.1.3, text records with non-standard characters, the records are processed with no problem by awk.

In the version 4.0.2, text records with non-standard characters are ignored and not processed.

Is there a way to fix this issue, or to be able to ignore non-standard characters with this newer version of awk? Or is there a new version than 4.0.2 that will resolve this issue?

Thank you.

Brandon Gilbert
IT Analyst

Canton Video Committee Lead

Synchrony

T: 330-433-5042
E: address@hidden

4500 Munson St NW

Canton, OH 44718, U.S.

From:	Gilbert, Brandon (Synchrony)
Subject:	Re: [bug-gawk] [External] Re: Invalid Characters Causing Problems in awk 4.0.2
Date:	Thu, 23 Aug 2018 14:00:58 +0000