coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Sort option for Posix-locale simple comparisons


From: Ray Dillinger
Subject: Sort option for Posix-locale simple comparisons
Date: Mon, 08 Apr 2013 00:27:18 -0700
User-agent: Mozilla/5.0 (X11; Linux i686; rv:10.0.12) Gecko/20130116 Icedove/10.0.12

I recently discovered that a script on our system had broken in a subtle way
at some unknown point since I coded it a few years ago. This bothered me a
lot because I was sure that I'd tested it very thoroughly and even stepped
through and verified the logic, and in the several years it has been running
with this undetected bug, it has produced a lot of ever-so-slightly wrong
data which have, apparently, caused other, even more subtle bugs in a data
mining application that has been working on that data (insert sound of teeth
grinding here).

I investigated further and discovered that it was broken because 'sort' was
doing an unexpected thing which it had not been doing back when I wrote
the script. At first I thought it might be a bug in 'sort', so I pulled up the
source code and had a look.

It turns out that 'sort' is grabbing locale information now and doing a locale- aware sort (hence failing to treat different lengths of blankspace differently and failing to treat any punctuation characters as significant -- at least in my case). This is inappropriate in the case of our data because things our locale is insensitive to are in fact significant in the output of this program. This is,
hmmm, not quite a bug -- I mean, I see the utility of it for stuff directly
related to user interface like 'ls' and so forth, and for sorting actual language- sensitive text files as opposed to handling program output or script output.

But it is a cause of breakage in old scripts, complicates writing portable
scripts (because you now have to know what locale data will be sorted in
on other machines for some purposes) and causes very slow performance
in 'sort' itself.

There is a workaround; one can set the locale to 'C' or 'POSIX' directly in a script (or at the shell prompt) and then set it back after calling 'sort'. But I dislike that workaround firstly because it complicates the writing of scripts adding boilerplate in many scripts that could be added instead just in 'sort'
itself, secondly because I don't want to be mucking around with the locale
from the command line, thirdly because that means people with other
locales can't get error messages etc in their own languages if they're using
a simplified sort, and fourth because there are too many ways it can fail.
People can carelessly forget to set it back - or it can fail to get set back
when a script aborts, or someone making an unrelated change in a shell
script may divert the path of control away from the part that resets it
and never notice s/he has done so, or any number of other things.

So I decided it would be cleaner to hack a new command line option into
'sort' itself to explicitly invoke the simple traditional sorting behavior.
Since 'c' and 'C' are already taken, I used the 'POSIX' locale instead of the
'C' locale, and gave it short option '-P' and long option '--posix-simple',
with help string 'use POSIX locale (simple byte-value) comparisons.'

I was careful to not break other locale-related behavior.  For example, it
still gets month names and error messages appropriate for the locale.  The
extent of its effect is to switch to the POSIX locale while doing the sort
itself and to use the period and comma respectively as decimal and
thousands separator.

I changed the 'help' usage message -- which when help2man is invoked
will have the effect of updating the man page as well.

The diff is against the Debian distribution's coreutils-8.13 source code,
which appears to be the same as the current gnu source on git unless
I've missed something.

I have attached the diff file. Just in case there is a problem applying it to
your current source I have attached the updated source as well.

I know that I'm new here and I've probably missed a couple of procedure
steps; but I think that this functionality should be part of a locale-aware
'sort' into the future, I hope that others agree with me about that, and I
am willing to learn how to do the procedures correctly if I have gotten it
wrong.

                Ray Dillinger


---
"It's been my experience that hallucinations give bad advice.  I mean,
every so often the voices tell you to do something and you just have
to tell them to f**k off, amiright?"

Attachment: sort.c
Description: Text Data

Attachment: sortdiff
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]