[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: How to sort unicode properly?
From: |
Eric Fischer |
Subject: |
Re: How to sort unicode properly? |
Date: |
Wed, 25 Sep 2019 09:37:08 -0700 |
Unfortunately, multibyte collation is simply unimplemented in MacOS X, so
there is no alternate locale definition that will fix it. As far as I can
tell this is documented only in the BUGS section of `man wcscoll`:
BUGS
The current implementation of wcscoll() only works in single-byte
LC_CTYPE locales, and falls back to using wcscmp() in locales with
extended character sets.
(
https://opensource.apple.com/source/Libc/Libc-1272.250.1/string/FreeBSD/wcscoll.3.auto.html
)
Eric
On Wed, Sep 25, 2019 at 8:59 AM Peng Yu <address@hidden> wrote:
> I want to make my `sort` to be machine-independent and always use the
> correct Unicode sort order. Is there a way to do so?
>
> I don't know how to check where en_US.UTF-8 comes from. Do you know
> how to check it? (I use Mac OS X.)
>
> On 9/25/19, Eric Blake <address@hidden> wrote:
> > On 9/25/19 10:20 AM, Peng Yu wrote:
> >> Hi,
> >>
> >> It seems that "café" should be sorted before "caff" in Unicode.
> >>
> >> https://github.com/jtauber/pyuca
> >>
> >> But `sort` does not do so.
> >>
> >> $ printf '%s\n' cafe caff café | LC_ALL=UTF8 sort
> >> cafe
> >> caff
> >> café
> >> $ printf '%s\n' cafe caff café | LC_ALL=en_US.UTF-8 sort
> >> cafe
> >> caff
> >> café
> >>
> >> How to make `sort` sort according to Unicode order? Thanks.
> >
> > You'll have to write a locale definition where strcoll() sorts in the
> > order you want. Coreutils sort is calling strcoll(), and if it doesn't
> > sort the way you think it should, the bug is in your locale and not in
> > coreutils. You'll want to report this issue to whoever provided your
> > en_US.UTF-8 locale (perhaps glibc?)
> >
> > --
> > Eric Blake, Principal Software Engineer
> > Red Hat, Inc. +1-919-301-3226
> > Virtualization: qemu.org | libvirt.org
> >
>
>
> --
> Regards,
> Peng
>
>