bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH] regex: Fix fastmap for multibyte character ranges.


From: Paolo Bonzini
Subject: [PATCH] regex: Fix fastmap for multibyte character ranges.
Date: Wed, 25 Nov 2009 11:46:32 +0100

This is another bug in computing the fastmap.  I had overlooked it when
fixing the fastmap mess, because it usually does not happen with !_LIBC.
However, it is there in that case too.

The bug is that whenever we have a range at the beginning of the regex,
the regex must be tested on any possible multibyte character.  The reason
why _LIBC masks it, is that almost always there is a collation symbol for
each possible multibyte-character lead byte, so all the lead bytes are
in general already part of the fastmap.

A simple reproducer is the following sed script:

$ echo 'абвгдеёжзийклмнопрстуфхцчшщъыьэюя' | ./bad-sed -e 's/[а-я]/!/g'
абвгдеёжзийклмнопрстуфхцчшщъыьэюя
$ echo 'абвгдеёжзийклмнопрстуфхцчшщъыьэюя' | ./good-sed -e 's/[а-я]/!/g'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

2009-11-25  Paolo Bonzini  <address@hidden>

        * lib/regcomp.c (re_compute_fastmap_iter): Add all multibyte lead
        characters when a multibyte character range is included.
---
 ChangeLog     |    6 ++++++
 lib/regcomp.c |    2 +-
 2 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index fcdf307..54c5514 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,9 @@
+2009-11-25  Paolo Bonzini  <address@hidden>
+
+       regex: Fix fastmap for multibyte character ranges.
+       * lib/regcomp.c (re_compute_fastmap_iter): Add all multibyte lead
+       characters when a multibyte character range is included.
+
 2009-11-22  Andy Wingo  <address@hidden>
 
        version-etc: work also with AM_INIT_AUTOMAKE's no-define option
diff --git a/lib/regcomp.c b/lib/regcomp.c
index 6472ff6..6aef405 100644
--- a/lib/regcomp.c
+++ b/lib/regcomp.c
@@ -383,7 +383,7 @@ re_compile_fastmap_iter (regex_t *bufp, const re_dfastate_t 
*init_state,
             applies to multibyte character sets; for single byte character
             sets, the SIMPLE_BRACKET again suffices.  */
          if (dfa->mb_cur_max > 1
-             && (cset->nchar_classes || cset->non_match
+             && (cset->nchar_classes || cset->non_match || cset->nranges
 # ifdef _LIBC
                  || cset->nequiv_classes
 # endif /* _LIBC */
-- 
1.6.5.2





reply via email to

[Prev in Thread] Current Thread [Next in Thread]