Re: How to sort unicode properly?

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to sort unicode properly?

From:	Lion Yang
Subject:	Re: How to sort unicode properly?
Date:	Thu, 26 Sep 2019 04:49:56 +0800

libicu works in that way. There is ucol_strcoll.

http://userguide.icu-project.org/collation/api

https://github.com/unicode-org/icu

But think twice if you want to add libicu as a mandatory dependency of

coreutils. It does works at C level and widely used but it's also quiteheavy.


2019-09-26 03:46 に Peng Yu さんは書きました:

If python can have pyuca that works across platform, why such thing cannot

have at C level?

On Wed, Sep 25, 2019 at 12:24 PM Eric Blake <address@hidden> wrote:

On 9/25/19 10:56 AM, Peng Yu wrote:
> I want to make my `sort` to be machine-independent and always use the
> correct Unicode sort order. Is there a way to do so?

Those two goals are somewhat at odds.  The only truly portable

machine-independent sorting is the one guaranteed by POSIX when youuseLC_ALL=C (fun fact: even on an EBCDIC machine, that is required byPOSIX

to collate in ASCII order, rather than native byte order).  The moment
you use any other locale, then you not only left to the mercies of

whoever wrote that locale, but also stuck with the fact that there isno

portable way to transfer locale definitions from one vendor's libc to
another.

>
> I don't know how to check where en_US.UTF-8 comes from. Do you know
> how to check it? (I use Mac OS X.)

All other locales are somewhat vendor-dependent; as you've discovered,
your vendor (Apple) has a rather gaping hole in their locale support.

But because Apple is a closed-source shop, it will have to be Applethat

fixes their bug, unless you want to take on the gargantuan task of

writing a gnulib module that provides locale tables to mirror glibcfor

use on non-glibc machines.

Note that glibc doesn't have that problem, at least on my system:

$ cat /etc/fedora-release
Fedora release 30 (Thirty)
$ rpm -q glibc
glibc-2.29-22.fc30.x86_64
$ printf '%s\n' cafe caff café | LC_ALL=en_US.UTF-8  sort --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
cafe
____
café
____
caff
____

So one option you could pursue is switching to an operating systemthat

does not curtail your freedoms.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

[Prev in Thread]

Current Thread

[Next in Thread]

How to sort unicode properly?, Peng Yu, 2019/09/25
- Re: How to sort unicode properly?, Eric Blake, 2019/09/25
  - Re: How to sort unicode properly?, Peng Yu, 2019/09/25
    - Re: How to sort unicode properly?, Eric Fischer, 2019/09/25
    - Re: How to sort unicode properly?, Eric Blake, 2019/09/25
    - Re: How to sort unicode properly?, Peng Yu, 2019/09/25
    - Re: How to sort unicode properly?, Eric Blake, 2019/09/25
    - Re: How to sort unicode properly?, Lion Yang <=

Prev by Date: Re: How to sort unicode properly?
Next by Date: Where to locate source for find utility?
Previous by thread: Re: How to sort unicode properly?
Next by thread: Where to locate source for find utility?
Index(es):
- Date
- Thread