From eb8f72e7d238ee3351411b903898075c2787fc07 Mon Sep 17 00:00:00 2001 From: Julian Graham
Date: Sat, 2 Jan 2010 00:06:23 -0500 Subject: [PATCH] Support for Unicode string normalization functions * libguile/strings.c, libguile/strings.h (normalize_str, scm_string_normalize_nfc, scm_string_normalize_nfd, scm_normalize_nfkc, scm_string_normalize_nfkd): New functions. * test-suite/tests/strings.test: Unit tests for `string-normalize-nfc', `string-normalize-nfd', `string-normalize-nfkc', and `string-normalize-nfkd'. * doc/ref/api-data.texi (String Comparison): Documentation for normalization functions. --- doc/ref/api-data.texi | 64 ++++++++++++++++++++++++++++++++++++ libguile/strings.c | 73 +++++++++++++++++++++++++++++++++++++++++ libguile/strings.h | 5 +++ test-suite/tests/strings.test | 40 ++++++++++++++++++++++ 4 files changed, 182 insertions(+), 0 deletions(-) diff --git a/doc/ref/api-data.texi b/doc/ref/api-data.texi index b959ab9..9cc8ea5 100755 --- a/doc/ref/api-data.texi +++ b/doc/ref/api-data.texi @@ -3252,6 +3252,70 @@ Compute a hash value for @var{S}. the optional argument @var{bound} is a non-ne Compute a hash value for @var{S}. the optional argument @var{bound} is a non-negative exact integer specifying the range of the hash function. A positive value restricts the return value to the range [0,bound). @end deffn +Because the same visual appearance of an abstract Unicode character can +be obtained via multiple sequences of Unicode characters, even the +case-insensitive string comparison functions described above may return address@hidden when presented with strings containing different +representations of the same character. For example, the Unicode +character ``LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE'' can be +represented with a single character (U+1E69) or by the character ``LATIN +SMALL LETTER S'' (U+0073) followed by the combining marks ``COMBINING +DOT BELOW'' (U+0323) and ``COMBINING DOT ABOVE'' (U+0307). + +For this reason, it is often desirable to ensure that the strings +to be compared are using a mutually consistent representation for every +character. The Unicode standard defines two methods of normalizing the +contents of strings: Decomposition, which breaks composite characters +into a set of constituent characters with an ordering defined by the +Unicode Standard; and composition, which performs the converse. + +There are two decomposition operations. ``Canonical decomposition'' +produces character sequences that share the same visual appearance as +the original characters, while ``compatiblity decomposition'' produces +ones whose visual appearances may differ from the originals but which +represent the same abstract character. + +These operations are encapsulated in the following set of normalization +forms: + address@hidden @dfn address@hidden NFD +Characters are decomposed to their canonical forms. + address@hidden NFKD +Characters are decomposed to their compatibility forms. + address@hidden NFC +Characters are decomposed to their canonical forms, then composed. + address@hidden NFKC +Characters are decomposed to their compatibility forms, then composed. + address@hidden table + +The functions below put their arguments into one of the forms described +above. + address@hidden {Scheme Procedure} string-normalize-nfd s address@hidden {C Function} scm_string_normalize_nfd (s) +Return the @code{NFD} normalized form of @var{s}. address@hidden deffn + address@hidden {Scheme Procedure} string-normalize-nfkd s address@hidden {C Function} scm_string_normalize_nfkd (s) +Return the @code{NFKD} normalized form of @var{s}. address@hidden deffn + address@hidden {Scheme Procedure} string-normalize-nfc s address@hidden {C Function} scm_string_normalize_nfc (s) +Return the @code{NFC} normalized form of @var{s}. address@hidden deffn + address@hidden {Scheme Procedure} string-normalize-nfkc s address@hidden {C Function} scm_string_normalize_nfkc (s) +Return the @code{NFKC} normalized form of @var{s}. address@hidden deffn + @node String Searching @subsubsection String Searching diff --git a/libguile/strings.c b/libguile/strings.c index 711da9c..84df48a 100644 --- a/libguile/strings.c +++ b/libguile/strings.c @@ -25,6 +25,7 @@ #include