[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#38235: string-foldcase bug for trailing sigma

From: John Cowan
Subject: bug#38235: string-foldcase bug for trailing sigma
Date: Sun, 17 Nov 2019 13:13:42 -0500

On Sat, Nov 16, 2019 at 3:42 PM Andy Wingo <address@hidden> wrote:
The expected result is "μέλοσ"; see R6RS libraries section 1.2.  However
instead Guile's result is "μέλος".  Note that although Σ usually
downcases to σ, at the end of a string it's ς.

More precisely, it downcases to σ if a letter follows and to ς if not (being at the end of a string is a particular case).  However, this is not actually always Greekly correct:  the string "ΦΙΛΟΣ." with a period at the end downcases to "φιλος." if it is the word φίλος 'friend' (without its proper accent) at the end of a sentence, but as "φιλος." if it is an abbreviation for φιλοσοφία 'philosophy'.  For this reason, R7RS does not require mapping to  ς in this situation as R6RS does.

This test shows a
limitation of defining string-foldcase as simply (string-downcase
(string-upcase str)).

As explained in Unicode section 5.18, the foldcase mappings (in <https://www.unicode.org/Public/UNIDATA/CaseFolding.txt>, the lines with status C and F) actually create a set of equivalence classes that are closed under {upper,lower,title}case mapping, and then choose a single character to represent each class.  This is usually the unique lowercase character, but not always: in Cherokee it is the uppercase character, and in the set {Σ, σ, ς} it is  σ.  

On Sun, Nov 17, 2019 at 6:20 AM <address@hidden> wrote:

Good catch. I think there's even a worse example: dotless
and dotted I [1]. Here it seems even impossible to do
up- and downcase correctly without knowing the language

Language-specific case mappings are explicitly out of Scheme's remit: they have to be performed by specialized libraries.  There is an additional situation in Lithuanian dictionaries (but not running text): an "i" with a tone accent is represented as "i" + dot above + accent, like this:  "i̇́".  However, this dot above must be dropped when uppercasing, producing ordinary "Í".

reply via email to

[Prev in Thread] Current Thread [Next in Thread]