bug#24603: [RFC 15/18] Base lower- and upper-case tests on Unicode prope

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#24603: [RFC 15/18] Base lower- and upper-case tests on Unicode prope

From:	Michal Nazarewicz
Subject:	bug#24603: [RFC 15/18] Base lower- and upper-case tests on Unicode properties
Date:	Tue, 4 Oct 2016 03:10:38 +0200

Not all lower case characters have a simple upper case form.
For example, an ﬁ ligature has no one-character upper case mapping.
Similarly, ɕ from IPA block has no upper case form at all.

It isn’t therefore sufficient to lookup character’s upper-case form to
determine whether it’s lower-case or not.  As such, rewrite the tests
to be based on Unicode properties.

* src/buffer.h (uppercasep, lowercasep): Delete.
* src/character.c (uppercasep, lowercasep): New functions which base
their test on Unicode character properties rather than case table.
* test/src/regex-tests.el (regex-tests-lower-character-class): Include
ﬁ ligature, σ letter and ς IPA symbol in the test.
---
 etc/NEWS                |  7 +++++++
 src/buffer.h            | 13 -------------
 src/character.c         | 24 ++++++++++++++++++++++++
 src/character.h         |  2 ++
 test/src/regex-tests.el |  4 +---
 5 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/etc/NEWS b/etc/NEWS
index 4516812..727af59 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -251,6 +251,10 @@ For example, ﬁ ligature is converted to FI when upper 
cased.
 Strings such as ΌΣΟΣ are now correctly converted to Όσος when
 capitalised to follow rules of Greek spelling.
 
+*** 'upper' and 'lower' classes match characters w/o simple cased forms.
+For instance, ß letter and ﬁ ligature are now matched by [[:lower:]]
+regular expression.
+
 
 * Changes in Specialized Modes and Packages in Emacs 26.1
 
@@ -512,6 +516,9 @@ of curved quotes in format arguments to functions like 
'message' and
 now generate less chatter and more-compact diagnostics.  The auxiliary
 function 'check-declare-errmsg' has been removed.
 
+** 'upper' and 'lower' character classes are unaffected by case table
+since they are now based purely on Unicode properties.
+
 
 * Lisp Changes in Emacs 26.1
 
diff --git a/src/buffer.h b/src/buffer.h
index 1543f67..aade0ea 100644
--- a/src/buffer.h
+++ b/src/buffer.h
@@ -1358,19 +1358,6 @@ upcase (int c)
   return NATNUMP (up) ? XFASTINT (up) : c;
 }
 
-/* True if C is upper case.  */
-INLINE bool uppercasep (int c)
-{
-  return downcase (c) != c;
-}
-
-/* True if C is lower case.  */
-INLINE bool
-lowercasep (int c)
-{
-  return !uppercasep (c) && upcase (c) != c;
-}
-
 INLINE_HEADER_END
 
 #endif /* EMACS_BUFFER_H */
diff --git a/src/character.c b/src/character.c
index 1e49536..707ae10 100644
--- a/src/character.c
+++ b/src/character.c
@@ -967,6 +967,30 @@ char_unicode_category (int c)
   return INTEGERP (category) ? XINT (category) : UNICODE_CATEGORY_UNKNOWN;
 }
 
+/* Return true if C is a upper case character.  This does not imply mean it
+   has a lower case form. */
+bool
+uppercasep (int c)
+{
+  unicode_category_t gen_cat = char_unicode_category (c);
+
+  /* See UTS #18.  There are additional characters that should be
+     here, those designated as Other_uppercase; FIXME.  */
+  return gen_cat == UNICODE_CATEGORY_Lu;
+}
+
+/* Return true if C is a lower case character.  This does not imply mean it
+   has an upper case form. */
+bool
+lowercasep (int c)
+{
+  unicode_category_t gen_cat = char_unicode_category (c);
+
+  /* See UTS #18.  There are additional characters that should be
+     here, those designated as Other_lowercase; FIXME.  */
+  return gen_cat == UNICODE_CATEGORY_Ll;
+}
+
 /* Return true if C is an alphabetic character.  */
 bool
 alphabeticp (int c)
diff --git a/src/character.h b/src/character.h
index fc8a0dd..5931c5c 100644
--- a/src/character.h
+++ b/src/character.h
@@ -676,6 +676,8 @@ extern ptrdiff_t lisp_string_width (Lisp_Object, ptrdiff_t,
 extern Lisp_Object Vchar_unify_table;
 extern Lisp_Object string_escape_byte8 (Lisp_Object);
 
+extern bool uppercasep (int);
+extern bool lowercasep (int);
 extern bool alphabeticp (int);
 extern bool alphanumericp (int);
 extern bool graphicp (int);
diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el
index fa66ff1..fc50344 100644
--- a/test/src/regex-tests.el
+++ b/test/src/regex-tests.el
@@ -70,9 +70,7 @@ regex--test-cc
                 ("digit" "012" "abcABCłąka-, \t\n")
                 ("xdigit" "0123aBc" "łąk-, \t\n")
                 ("upper" "ABCŁĄKAǱĲΣ" "abcß0ﬁσςɕ12-, \t\n")
-                ;; FIXME: ßﬁɕ are all lower case (even though they don’t have
-                ;; (single-character) upper-case form).
-                ("lower" "abcłąkaσς" "ABC012ǱĲΣ-, \t\n")
+                ("lower" "abcłąkaßﬁσς" "ABC012ǱĲΣ-, \t\n")
 
                 ("word" "abcABC012\u2620ǱßĲﬁǲΣσςɕ" "-, \t\n")
 
-- 
2.8.0.rc3.226.g39d4020

[Prev in Thread]

Current Thread

[Next in Thread]

bug#24603: [RFC 00/18] Improvement to casing, Michal Nazarewicz, 2016/10/03
- bug#24603: [RFC 01/18] Add tests for casefiddle.c, Michal Nazarewicz, 2016/10/03
  - bug#24603: [RFC 05/18] Introduce case_character function, Michal Nazarewicz, 2016/10/03
  - bug#24603: [RFC 06/18] Add support for title-casing letters, Michal Nazarewicz, 2016/10/03
  - bug#24603: [RFC 13/18] Add some tricky Unicode characters to regex test, Michal Nazarewicz, 2016/10/03
  - bug#24603: [RFC 15/18] Base lower- and upper-case tests on Unicode properties, Michal Nazarewicz <=
    - bug#24603: [RFC 15/18] Base lower- and upper-case tests on Unicode properties, Eli Zaretskii, 2016/10/04
  - bug#24603: [RFC 04/18] Split casify_object into multiple functions, Michal Nazarewicz, 2016/10/03
  - bug#24603: [RFC 03/18] Don’t assume character can be either upper- or lower-case when casing, Michal Nazarewicz, 2016/10/03
  - bug#24603: [RFC 12/18] Implement rules for title-casing Dutch ij ‘letter’, Michal Nazarewicz, 2016/10/03
  - bug#24603: [RFC 11/18] Implement casing rules for Lithuanian, Michal Nazarewicz, 2016/10/03
  - bug#24603: [RFC 16/18] Refactor character class checking; optimise ASCII case, Michal Nazarewicz, 2016/10/03
    - bug#24603: [RFC 16/18] Refactor character class checking; optimise ASCII case, Eli Zaretskii, 2016/10/04
    - bug#24603: [RFC 16/18] Refactor character class checking; optimise ASCII case, Michal Nazarewicz, 2016/10/17
  - bug#24603: [RFC 09/18] Implement special sigma casing rule, Michal Nazarewicz, 2016/10/03
    - bug#24603: [RFC 09/18] Implement special sigma casing rule, Eli Zaretskii, 2016/10/04

Prev by Date: bug#24603: [RFC 13/18] Add some tricky Unicode characters to regex test
Next by Date: bug#24603: [RFC 04/18] Split casify_object into multiple functions
Previous by thread: bug#24603: [RFC 13/18] Add some tricky Unicode characters to regex test
Next by thread: bug#24603: [RFC 15/18] Base lower- and upper-case tests on Unicode properties
Index(es):
- Date
- Thread