bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t


From: Eli Zaretskii
Subject: bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t
Date: Tue, 16 Feb 2016 20:57:41 +0200

> From: Michael Heerdegen <michael_heerdegen@web.de>
> Cc: Marcin Borkowski <mbork@mbork.pl>,  18150@debbugs.gnu.org
> Date: Tue, 16 Feb 2016 19:38:21 +0100
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > What do we expect the result to be in the variant below?
> >
> >    (let ((str "ecole")
> >          (case-fold-search t))
> >      (when (string-match "[[:upper:]]" str)
> >        (match-string 0 str)))
> 
> According to the docstring of `case-fold-search', I would expect "e"
> (which the expression returns here).
> 
> Before having thought about it, 70% of me expected `nil'.

That's exactly the point.

If, when case-fold-search is non-nil, we want both [:upper:] and
[:lower:] to match any letter that has a case variant, then the patch
below seems to do the job.  Does anyone see a problem with it?

The gotcha here is that regex.c doesn't know what TRANSLATE does, and
no one promises that TRANSLATE downcases characters.  It could fold
them, for example, or, more generally, transform them in any way the
caller wants.  The patch below is TRT when TRANSLATE downcases; when
it does something else, the question is: do we want to test the match
only on the result of TRANSLATE (which is what the original code
does), or do we want something else?

For the unibyte case, re_compile_pattern sets up a bitmap for
characters _after_ TRANSLATE, so things work as expected.  We cannot
do that for multibyte characters -- there are too many of them -- so
this problem arises.  AFAICS, it existed since Emacs 20.

diff --git a/src/regex.c b/src/regex.c
index dd3f2b3..27dce8b 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -5444,7 +5444,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, 
const_re_char *string1,
        case charset:
        case charset_not:
          {
-           register unsigned int c;
+           register unsigned int c, corig;
            boolean not = (re_opcode_t) *(p - 1) == charset_not;
            int len;
 
@@ -5473,7 +5473,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, 
const_re_char *string1,
              }
 
            PREFETCH ();
-           c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte);
+           corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte);
            if (target_multibyte)
              {
                int c1;
@@ -5517,11 +5517,13 @@ re_match_2_internal (struct re_pattern_buffer *bufp, 
const_re_char *string1,
              {
                int class_bits = CHARSET_RANGE_TABLE_BITS (&p[-1]);
 
-               if (  (class_bits & BIT_LOWER && ISLOWER (c))
+               if (  (class_bits & BIT_LOWER
+                      && (ISLOWER (c) || (corig != c && ISUPPER(c))))
                    | (class_bits & BIT_MULTIBYTE)
                    | (class_bits & BIT_PUNCT && ISPUNCT (c))
                    | (class_bits & BIT_SPACE && ISSPACE (c))
-                   | (class_bits & BIT_UPPER && ISUPPER (c))
+                   | (class_bits & BIT_UPPER
+                      && (ISUPPER (c) || (corig != c && ISLOWER (c))))
                    | (class_bits & BIT_WORD  && ISWORD  (c))
                    | (class_bits & BIT_ALPHA && ISALPHA (c))
                    | (class_bits & BIT_ALNUM && ISALNUM (c))





reply via email to

[Prev in Thread] Current Thread [Next in Thread]