[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#24603: [RFC 15/18] Base lower- and upper-case tests on Unicode prope
From: |
Michal Nazarewicz |
Subject: |
bug#24603: [RFC 15/18] Base lower- and upper-case tests on Unicode properties |
Date: |
Tue, 4 Oct 2016 03:10:38 +0200 |
Not all lower case characters have a simple upper case form.
For example, an fi ligature has no one-character upper case mapping.
Similarly, ɕ from IPA block has no upper case form at all.
It isn’t therefore sufficient to lookup character’s upper-case form to
determine whether it’s lower-case or not. As such, rewrite the tests
to be based on Unicode properties.
* src/buffer.h (uppercasep, lowercasep): Delete.
* src/character.c (uppercasep, lowercasep): New functions which base
their test on Unicode character properties rather than case table.
* test/src/regex-tests.el (regex-tests-lower-character-class): Include
fi ligature, σ letter and ς IPA symbol in the test.
---
etc/NEWS | 7 +++++++
src/buffer.h | 13 -------------
src/character.c | 24 ++++++++++++++++++++++++
src/character.h | 2 ++
test/src/regex-tests.el | 4 +---
5 files changed, 34 insertions(+), 16 deletions(-)
diff --git a/etc/NEWS b/etc/NEWS
index 4516812..727af59 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -251,6 +251,10 @@ For example, fi ligature is converted to FI when upper
cased.
Strings such as ΌΣΟΣ are now correctly converted to Όσος when
capitalised to follow rules of Greek spelling.
+*** 'upper' and 'lower' classes match characters w/o simple cased forms.
+For instance, ß letter and fi ligature are now matched by [[:lower:]]
+regular expression.
+
* Changes in Specialized Modes and Packages in Emacs 26.1
@@ -512,6 +516,9 @@ of curved quotes in format arguments to functions like
'message' and
now generate less chatter and more-compact diagnostics. The auxiliary
function 'check-declare-errmsg' has been removed.
+** 'upper' and 'lower' character classes are unaffected by case table
+since they are now based purely on Unicode properties.
+
* Lisp Changes in Emacs 26.1
diff --git a/src/buffer.h b/src/buffer.h
index 1543f67..aade0ea 100644
--- a/src/buffer.h
+++ b/src/buffer.h
@@ -1358,19 +1358,6 @@ upcase (int c)
return NATNUMP (up) ? XFASTINT (up) : c;
}
-/* True if C is upper case. */
-INLINE bool uppercasep (int c)
-{
- return downcase (c) != c;
-}
-
-/* True if C is lower case. */
-INLINE bool
-lowercasep (int c)
-{
- return !uppercasep (c) && upcase (c) != c;
-}
-
INLINE_HEADER_END
#endif /* EMACS_BUFFER_H */
diff --git a/src/character.c b/src/character.c
index 1e49536..707ae10 100644
--- a/src/character.c
+++ b/src/character.c
@@ -967,6 +967,30 @@ char_unicode_category (int c)
return INTEGERP (category) ? XINT (category) : UNICODE_CATEGORY_UNKNOWN;
}
+/* Return true if C is a upper case character. This does not imply mean it
+ has a lower case form. */
+bool
+uppercasep (int c)
+{
+ unicode_category_t gen_cat = char_unicode_category (c);
+
+ /* See UTS #18. There are additional characters that should be
+ here, those designated as Other_uppercase; FIXME. */
+ return gen_cat == UNICODE_CATEGORY_Lu;
+}
+
+/* Return true if C is a lower case character. This does not imply mean it
+ has an upper case form. */
+bool
+lowercasep (int c)
+{
+ unicode_category_t gen_cat = char_unicode_category (c);
+
+ /* See UTS #18. There are additional characters that should be
+ here, those designated as Other_lowercase; FIXME. */
+ return gen_cat == UNICODE_CATEGORY_Ll;
+}
+
/* Return true if C is an alphabetic character. */
bool
alphabeticp (int c)
diff --git a/src/character.h b/src/character.h
index fc8a0dd..5931c5c 100644
--- a/src/character.h
+++ b/src/character.h
@@ -676,6 +676,8 @@ extern ptrdiff_t lisp_string_width (Lisp_Object, ptrdiff_t,
extern Lisp_Object Vchar_unify_table;
extern Lisp_Object string_escape_byte8 (Lisp_Object);
+extern bool uppercasep (int);
+extern bool lowercasep (int);
extern bool alphabeticp (int);
extern bool alphanumericp (int);
extern bool graphicp (int);
diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el
index fa66ff1..fc50344 100644
--- a/test/src/regex-tests.el
+++ b/test/src/regex-tests.el
@@ -70,9 +70,7 @@ regex--test-cc
("digit" "012" "abcABCłąka-, \t\n")
("xdigit" "0123aBc" "łąk-, \t\n")
("upper" "ABCŁĄKADZIJΣ" "abcß0fiσςɕ12-, \t\n")
- ;; FIXME: ßfiɕ are all lower case (even though they don’t have
- ;; (single-character) upper-case form).
- ("lower" "abcłąkaσς" "ABC012DZIJΣ-, \t\n")
+ ("lower" "abcłąkaßfiσς" "ABC012DZIJΣ-, \t\n")
("word" "abcABC012\u2620DZßIJfiDzΣσςɕ" "-, \t\n")
--
2.8.0.rc3.226.g39d4020
- bug#24603: [RFC 00/18] Improvement to casing, Michal Nazarewicz, 2016/10/03
- bug#24603: [RFC 01/18] Add tests for casefiddle.c, Michal Nazarewicz, 2016/10/03
- bug#24603: [RFC 05/18] Introduce case_character function, Michal Nazarewicz, 2016/10/03
- bug#24603: [RFC 06/18] Add support for title-casing letters, Michal Nazarewicz, 2016/10/03
- bug#24603: [RFC 13/18] Add some tricky Unicode characters to regex test, Michal Nazarewicz, 2016/10/03
- bug#24603: [RFC 15/18] Base lower- and upper-case tests on Unicode properties,
Michal Nazarewicz <=
- bug#24603: [RFC 04/18] Split casify_object into multiple functions, Michal Nazarewicz, 2016/10/03
- bug#24603: [RFC 03/18] Don’t assume character can be either upper- or lower-case when casing, Michal Nazarewicz, 2016/10/03
- bug#24603: [RFC 12/18] Implement rules for title-casing Dutch ij ‘letter’, Michal Nazarewicz, 2016/10/03
- bug#24603: [RFC 11/18] Implement casing rules for Lithuanian, Michal Nazarewicz, 2016/10/03
- bug#24603: [RFC 16/18] Refactor character class checking; optimise ASCII case, Michal Nazarewicz, 2016/10/03
- bug#24603: [RFC 09/18] Implement special sigma casing rule, Michal Nazarewicz, 2016/10/03