bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18051: 24.3.92; ls-lisp: Sorting; make ls-lisp-string-lessp a normal


From: Michael Albinus
Subject: bug#18051: 24.3.92; ls-lisp: Sorting; make ls-lisp-string-lessp a normal function?
Date: Sat, 16 Aug 2014 23:52:16 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux)

Michael Albinus <address@hidden> writes:

>>> On systems without glib, we might emulate it partially. Packages
>>> like ls-lisp could use it then for sorting.
>>
>> I think we need our own implementation in any case.  If nothing else,
>> that would solve the issue of encoding strings into UTF-8 before
>> calling external C functions.
>
> Yep. But given the complexity of UCA, we will start slowly with a subset
> of the algorithm only. This and performance considerations will still
> demand for a native C library, if available.

Just being curious, I've taken g_utf8_collate from the glib for a
test. It doesn't work bad.

I have added two functions `gstring-lessp' and `gstring-equalp', which
are meant to be the collation counterparts of `string-lessp' and
`string-equal'. Here are some tests, taken from UTS#10, chapter 1.1
"Multi-Level Comparison":

--8<---------------cut here---------------start------------->8---
(sort '("role" "roles" "rule") 'string-lessp)
=> ("role" "roles" "rule")

(sort '("role" "roles" "rule") 'gstring-lessp)
=> ("role" "roles" "rule")
--8<---------------cut here---------------end--------------->8---

No surprise they return the same result, this is level 1
comparison. Just base characters are compared.

--8<---------------cut here---------------start------------->8---
(sort '("role" "rôle" "roles") 'string-lessp)
=> ("role" "roles" "rôle")

(sort '("role" "rôle" "roles") 'gstring-lessp)
=> ("role" "rôle" "roles")
--8<---------------cut here---------------end--------------->8---

Accent differences are typically ignored in collation, if the base
letters differ. And so on, further tests applied from there ...

The collation rules could even be influenced by setting the locale
environment. The following example is taken from ISO 14651:2011,
appendix D.3. If LC_COLLATE is set to C.utf8, `string-lessp' and
`gstring-lessp' behave the same:

--8<---------------cut here---------------start------------->8---
(sort '("Alzheimer" "czar" "cæsium" "cølibat" "Aachen" "Aalborg" "Århus") 
'stri\ng-lessp)
=> ("Aachen" "Aalborg" "Alzheimer" "czar" "cæsium" "cølibat" "Århus")

(sort '("Alzheimer" "czar" "cæsium" "cølibat" "Aachen" "Aalborg" "Århus") 
'gstring-lessp)
=> ("Aachen" "Aalborg" "Alzheimer" "czar" "cæsium" "cølibat" "Århus")
--8<---------------cut here---------------end--------------->8---

When I set LC_COLLATE to en_US.utf8, accent differences are ignored,
again:

--8<---------------cut here---------------start------------->8---
(sort '("Alzheimer" "czar" "cæsium" "cølibat" "Aachen" "Aalborg" "Århus") 
'gstring-lessp)
=> ("Aachen" "Aalborg" "Alzheimer" "Århus" "cæsium" "cølibat" "czar")
--8<---------------cut here---------------end--------------->8---

But setting LC_COLLATE to da_DK.utf8, the order differs, because "cz" is
less than "cæ", and "aa" is equivalent to "å" but greater than "z".

--8<---------------cut here---------------start------------->8---
(sort '("Alzheimer" "czar" "cæsium" "cølibat" "Aachen" "Aalborg" "Århus") 
'gstring-lessp)
("Alzheimer" "czar" "cæsium" "cølibat" "Aachen" "Aalborg" "Århus")
--8<---------------cut here---------------end--------------->8---

Well, for practical use cases it seems to be worth to include
g_utf8_collate into Emacs. Of course, it could be used only in case glib
is linked, so we might still need an own Lisp implementation. I don't
know how well g_utf8_collate works for non Latin characters, 'tho.

And the test files CollationTest_NON_IGNORABLE.txt and
CollationTest_SHIFTED.txt from UTS#10 do not run completely
successful. I have no idea, whether it is due to a limitation of
g_utf8_collate, or whether it is because I have taken the latest Unicode
7.0.0 test files, which might include tests which haven't reached
GNU/Linux distributions yet. (Or whether my implementation is still
erroneous).

Best regards, Michael.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]