|From:||Gilbert, Brandon (Synchrony)|
|Subject:||Re: [bug-gawk] [External] Re: Invalid Characters Causing Problems in awk 4.0.2|
|Date:||Thu, 23 Aug 2018 14:00:58 +0000|
It is mostly with special Spanish characters in names and trademark characters in business names. Due to the confidentiality of the data, I am unable to send examples. I can say that when I pulled the records into Ultra-Edit, and I highlighted characters on the line, it showed the byte size as doubled (1 character showed byte length of 2 and 2 characters as 4, etc.).
Doing some on-line research, since sending the 1st e-mail to you, I found a message board where someone noted the following:
For a given awk implementation to work properly with non-ASCII characters (foreign letters), it must respect the active locale's character encoding, as reflected in the (effective) LC_CTYPE setting (run locale to see it).
These days, most locales use UTF-8 encoding, a multi-byte-on-demand encoding that is single-byte in the ASCII range, and uses 2 to 4 bytes to represent all other Unicode characters.
Thus, for a given awk implementation to recognize non-ASCII (accented, foreign) letters, it must be able to recognize multiple bytes as a single character.
So I did a compare of the locale command output on each system. The older system, that does not have problems with the characters, has LC_COLLATE=C, and the new system, that does have problems has LC_COLLATE="en_US.UTF-8". All other settings are match, and are set to en_US.UTF-8 . Could this be a cause?
Thank you for your help!
From: Wolfgang Laun <address@hidden>
What is a "non-standard character"? ISO 10646 is quite comprehensive. - Bug notices without examples aren't likely to cause a stir.
On 22 August 2018 at 22:48, Gilbert, Brandon (Synchrony) <address@hidden> wrote:
|[Prev in Thread]||Current Thread||[Next in Thread]|