emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: case-insensitive string comparison


From: Eli Zaretskii
Subject: Re: case-insensitive string comparison
Date: Tue, 26 Jul 2022 16:05:50 +0300

> From: Sam Steingold <sds@gnu.org>
> Date: Mon, 25 Jul 2022 15:39:34 -0400
> 
> > * Eli Zaretskii <ryvm@tah.bet> [2022-07-25 18:58:19 +0300]:
> >
> >> > (string-collate-equalp "a" "A" current-locale-environment t)
> >> > ==> nil
> >> > current-locale-environment
> >> > ==> "en_US.UTF-8"
> >
> > I cannot reproduce this:
> >
> >   (string-collate-equalp "a" "A" current-locale-environment t)
> >     => t
> >   current-locale-environment
> >     => "en_US.UTF-8"
> >
> > What OS is this, and which Emacs version?
> 
> GNU Emacs 29.0.50 (build 5, x86_64-apple-darwin21.5.0, NS appkit-2113.50 
> Version 12.4 (Build 21F79))
>  of 2022-07-25
> Repository revision: ffe12ff2503917e47c0356195b31430996c148f9
> Repository branch: master
> Windowing system distributor 'Apple', version 10.3.2113
> System Description:  macOS 12.4

Could be something macOS-specific.  Maybe your system doesn't define
the __STDC_ISO_10646__ feature?  In that case, string-collate-equalp
(see the doc string) behaves like string-equal, and that one doesn't
have a case-insensitive variant.

> >> So, how do we do case-insensitive string comparison in Emacs?
> >
> > If you want locale-specific collation, as Stefan said, above.
> 
> Do I?
> Is it really true that "UTF-8" without "en_US" does _not_ define case 
> conversion?

string-collate-equalp relies on the implementation in your libc, so
that's something I cannot answer (although I'd expect any reasonable
libc to work as expected here).

In general, locale-specific comparison is a bad idea in Emacs, unless
you are writing a Lisp program that absolutely _must_ meet the
locale's definitions of collation order and equivalence.  That's
because some locales have unexpected requirements, and because
different libc's implement this stuff very differently.  So using
string-collate-equalp and string-collate-lessp makes your program
unpredictable on any machine but your own.

For that reason, I suggest always using compare-strings instead.  That
function uses the Unicode locale-independent case-conversion rules,
and you can predictably control/tailor that if you need by using a
buffer-local case-table.

> but https://docs.python.org/3/library/stdtypes.html#str.casefold says
> 
> >>>>> The casefolding algorithm is described in section 3.13 of the Unicode 
> >>>>> Standard.
> 
> this seems to imply that user locale setting is not relevant.

That conclusion is incorrect.  The collation database is usually
tailored for each locale, and at least glibc indeed loads the tailored
collation tables for each locale you request.

> >> It is okay to add a `string-equal-ignore-case' based on `compare-strings'?
> >> (even though it does not recognize "SS" and "ß" as equal)
> >
> > What's wrong with calling compare-strings directly?
> 
> I want to be able to use `string-equal-ignore-case' as a :test argument
> to things like `cl-find'.

Then write a thin wrapper around compare-strings, and be done.

> And I don't want to have to think about encodings and locales.
> So I want the core Emacs maintainers who know about these things to
> provide me with something that works. Thanks in advance! ;-)

There's nothing to think about: see above.  The best results, in the
Emacs context, are to write code that doesn't depend on the locale,
and that's what you get with compare-strings.  No need to know
anything about encoding or locales.

> The fact that there are ***TWO*** core functions that compare strings -
> `string-collate-equalp' and `compare-strings' - does not look right to me.
> _I_ should not have to decide which function to use.

You can always ask.  But the documentation at least hints that the
locale-specific comparison has many hidden aspects:

  This function obeys the conventions for collation order in your locale
  settings.  For example, characters with different coding points but
  the same meaning might be considered as equal, like different grave
  accent Unicode characters:

  (string-collate-equalp (string ?\uFF40) (string ?\u1FEF))
    => t

> >> Or should we first implement something like casefold in Python?
> >
> > Ha! we already have that:
> >
> >   (get-char-code-property ?ß 'special-uppercase)
> >     => "SS"
> 
> Nice, but how does it help me if
> --8<---------------cut here---------------start------------->8---
> (compare-strings "SS" 0 nil "ß" 0 nil t)
> ==> -1
> (string-collate-equalp "SS" "ß" "en_US.UTF-8" t)
> ==> nil
> --8<---------------cut here---------------end--------------->8---
> instead of `t'?

It depends on what you want to do, and why you care about the ß case
in the first place.  AFAIR, you never explained that, nor described
your goal.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]