[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#17130: 24.4.50; Deficient Unicode case folding

From: Eli Zaretskii
Subject: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Sat, 29 Mar 2014 20:37:38 +0300

> From: Nathan Trapuzzano <address@hidden>
> Cc: address@hidden
> Date: Sat, 29 Mar 2014 11:29:43 -0400
> Eli Zaretskii <address@hidden> writes:
> >> σ, ς, and Σ would all have σ in the CANONICALIZE slot, since they all
> >> fold to σ.
> >
> > So you would need to search all characters to find those which have σ
> > in the CANONICALIZE slot -- not very efficient, to say the least.
> Doesn't this already happen?

No, not when that slot is used for case-insensitive search.  You just
use it to get the canonical equivalent, i.e. use the one-way mapping
that it provides.

> If not, then what is the CANONICALIZE slot doing that couldn't be
> done with the regular upcase/downcase slots by themselves?

If that slot is "trivial", i.e. contains the lower-case variant of the
character, then indeed this slot doesn't add information, I think,
only utility.  But it doesn't have to contain the lower-case variant.

> > IOW, what you suggest will provide a one-way mapping, whereas we need
> > a two-way mapping.
> Not sure I follow.  Seems to me the CANONICALIZE slot is sufficient, at
> least in principle.

It is sufficient for mapping a character to its canonical equivalent,
but not finding the non-canonical variants of a canonical character.
IOW, it is not well suited to finding ς given just σ.

> > Emacs should use this data for up-casing and down-casing as well, for
> > example, so that M-l downcases Σ to ς, not σ, when it is at the end of
> > the word.  Wouldn't users of Greek expect that?
> Maybe.  I'm just saying that Unicode itself doesn't prescribe or even
> recommend such behavior.  It defines case conversions independently of
> ordering.
> That said, making M-l downcase terminal Σ to ς would be a nice feature
> that could be enabled, e.g., by enabling a minor mode or by modifying
> some *-functions variable of functions that get called before the normal
> behavior of M-l is applied, etc.  But it shouldn't have anything to do
> with Unicode-compliant case-insensitive searching.

For searching, you only need the CANONICALIZE slot.  But what about
replacing the search string while keeping the letter case in the
replacement?  For that, CANONICALIZE alone is not enough, you need the
reverse mapping.

> > Personally, I think we need an additional slot for what you want, and
> > code to use it.
> Given the point about ß, you're probably right.  Unless we can make
> entries in the CANONICALIZE slot be strings rather than code points.

This is Lisp; a vector slot can contain any Lisp object.  But using
CANONICALIZE for what you want would be wrong, I think, because it
will screw up case-insensitive search, which expects to find there a
single character.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]