bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gawk ignores case with LANG=en_US


From: Bob Proulx
Subject: Re: gawk ignores case with LANG=en_US
Date: Wed, 13 May 2009 14:29:02 -0600
User-agent: Mutt/1.5.18 (2008-05-17)

Jim Keniston wrote:
> /^[a-z]/ { print }
> Assuming that environment variables LC_ALL and LC_CTYPE are
> undefined, if I run the above with the LANG environment variable
> set to "en_US.utf8" or "en_US", "A" matches "^[a-z]" and the
> output is as in output_buggy.  Setting IGNORECASE=0 in the
> command line or the script doesn't help.

Unfortunately what you are seeing is expected behavior.  It isn't a
bug in gawk.  Gawk is doing the correct thing there.

You don't like it and I don't like it but the-powers-that-be (not the
gawk maintainer but above him in libc and the standards committees)
have confused working with data on a computer with talking about
working with data on a computer.  The P.T.B. have decided that the
collation ordering (sort ordering) for data should be dictionary
ordering.  In dictionary ordering case is folded together and
punctuation is ignored.  By having LANG set to any of the "en" locales
the system is instructed to use dictionary sort ordering.  This
affects almost everything on the system that sorts or collates.

This means that [a-z] means different things depending upon the active
locale.  If the locale is C (aka POSIX) then it means the lower case
letters a through z.  But if the locale is en_US or other en_* locale
then it means [aAbBcC...z], note that 'Z' is left out of the set
because it appears after 'z' and is therefore left out of the
specified range a-z in the en locale.

This affects almost every command on the system that sorts, not
restricted to awk.  It affects sed, grep, ls, * file globbing, etc.

Because of this in scripts where this matters you will now usually
find that LC_ALL is set to C by the script author.

  export LC_ALL=C

However setting the standard locale (the C/POSIX/none locale is the
standard locale, all others are non-standard) in your login
environment will have other affects.  The setting is usually used to
control whether graphics terminals support unicode/UTF-8 characters
and other i18n behavior.  Turning it off will probably prevent you
from using non-ASCII characters.  That is often not acceptable.

What I do to compromise is to set LANG=en_US.UTF-8 but also set
LC_COLLATE=C to force a standard sort order regardless.  I put this in
my $HOME/.bashrc file.

  export LANG=en_US.UTF-8
  export LC_COLLATE=C

Hope this helps,
Bob




reply via email to

[Prev in Thread] Current Thread [Next in Thread]