bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sort order changed in "sort" and "ls".


From: Bob Proulx
Subject: Re: sort order changed in "sort" and "ls".
Date: Fri, 13 Mar 2009 12:22:05 -0600
User-agent: Mutt/1.5.13 (2006-08-11)

Hello Rogier,

Rogier Wolff wrote:
> Guys, I've been using Unix systems for over twenty years. I see that
> other people manage to get their Unix systems to "talk" to them in
> Dutch: "Bestand is niet gevonden". Besides that this sentence is wrong
> in a lot of contexts, I'm used to "file not found". 

As a side note to this discussion, if you find translation errors or
improvements and can contribute fixes that would be great.  Please
send those to the translation team.  The address is usually found in
the .po file which contains the translation.  They are the ones who
will know best what to do to correct it.

> When I install a modern system, (possibly through debootstrap, a
> chrooted or nfs-mounted root setup), perl complains loudly about some
> LC_ variable not being set. The way I've found to get it to shut up,
> and get a sane, working apt-setup is to install "locales" whatever
> that may mean. I then have to select somthing that starts with "en" to
> get the system to speak english to me.

Perl complains to you because you apparently have LANG set without the
corresponding locale installed.  It doesn't complain if you don't have
it set and therefore you must have LANG set to some locale.  This was
apparently set for you without your knowledge.  Having been set all of
the rest of the problem follows.  I see the same thing here.  You can
prevent perl from complaining by unsetting LANG and any other LC_*
environment variable that is set.

With no LANG nor LC_* variables set the default locale is the
traditional C locale.  This is also standardized by POSIX and is also
known as the POSIX locale.  The strings "C" and "POSIX" are equivalent
but most typically us traditionalists use "C" as an emphasis that it
is the traditional behavior that we are setting.  Setting LANG=C is
the same as not setting it at all.

HOWEVER!  The LC_ALL variable overrides all other variables.  If you
have LC_ALL set then it doesn't matter what you have other variables
set to.  LC_ALL is the highest priority override.  And similar for
LC_COLLATE and other LC_* variables which override LANG unless LC_ALL
is set.  Therefore we *must* talk about the LC_* variables when
talking about LANG.  But hate doing so since it makes the conversation
so messy.  Much easier just to say that setting LC_ALL=C is the
biggest available override.  This is also often seen in scripts to
force standard behavior.

> Apparently one of these steps has the side effect of changing the sort
> order.

You don't like it and I don't like it but the-powers-that-be have
confused working with data on a computer with talking about working
with data on a computer.  They have decided that the collation
ordering (sort ordering) for data should be dictionary ordering.  In
dictionary ordering case is folded together and punctuation is
ignored.  By having LANG set to any of the "en" locales the system is
instructed to use dictionary sort ordering.  This affects almost
everything on the system that sorts.

> I just want the system to talk english to me, and simply sort 
> my directories in the "normal" order. I don't even know where the LANG
> variable is set. I don't want to have to find out. It is not mentioned
> in your FAQ. 

The FAQ entry was originally written this problem first hit people and
before anyone understood the details of the problem and needs to be
rewritten.

Since you mention APT what you want to do is to reconfigure the
locales on your system with 'dpkg-reconfigure locales'.  When it asks
you for the "Default locale for the system environment:" select
"None".  This will remove the setting of LANG from /etc/environment
and remember that it shouldn't be set.  I am sure that by default it
is placing a dictionary sort order there.  This is a distribution
specific configuration and every operating system does it differently.

However setting the standard locale (the C/POSIX/none locale is the
standard locale, all others are non-standard) will have other affects.
The setting is usually used to control whether graphics terminals
support unicode/UTF-8 characters and other i18n behavior.  Turning it
off will probably prevent you from using non-ASCII characters.  That
is often not acceptable.

What I do to compromise is to set LANG=en_US.UTF-8 but also set
LC_COLLATE=C to force a standard sort order regardless.  I put this in
my $HOME/.bashrc file.

  export LANG=en_US.UTF-8
  export LC_COLLATE=C

> Your explanation is fine. But it should be in the FAQ. The FAQ tells
> me that if it sorts weird, I have an LC_... variable set.

Well, in actuality it says: "You or your vendor have probably set
environment variables like LANG, LC_ALL, or LANG to en_US."  It
doesn't say that you have LC_ALL set.  It says that you have one of
the many variables listed set.  I believe that is true.  Definitely
setting 'export LC_ALL=C' will force a standard sort ordering.  The
shell reads this at start time only so you would need to start a new
shell to have it take effect for shell sort operations such as "*"
file globbing.

> I didn't have an LC_... variable set, and still it sorted wrong.

Then you *must* have had LANG or LC_COLLATE set.  You can print your
locale settings with the 'locale' command.

  $ locale

> I'm smart enough to finally figure out it was a LANG variable, and
> you're intimate enough with the workings of all this to explain the
> order in which the different setting variables are tried. However,
> as it stands I had no chance to find accurate information in the
> FAQ.

You are right that the FAQ entry needs an update.  If you have
suggestions for improvements there that would be great.  I will queue
up some time to work on improving it.

Bob




reply via email to

[Prev in Thread] Current Thread [Next in Thread]