[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

grep branch, master, updated. v2.15-15-g178ed7c

From: Jim Meyering
Subject: grep branch, master, updated. v2.15-15-g178ed7c
Date: Sat, 21 Dec 2013 18:58:54 +0000

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "grep".

The branch, master has been updated
       via  178ed7cc324bc2000c19a3f7a4be649dfa99b44a (commit)
      from  1a8b1b370eace41be892e9fef041f36b72baeefb (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------

commit 178ed7cc324bc2000c19a3f7a4be649dfa99b44a
Author: Santiago Ruano Rincón <address@hidden>
Date:   Fri Dec 13 07:53:37 2013 -0800

    pcre: tell grep -P to relax its stance on invalid multibyte chars
    Do not exit-2 for invalid UTF-8 characters.  Just prior to this
    change, this command would match no lines and fail like this:
      $ printf 'j\x82\nj\n'|LC_ALL=en_US.UTF-8 grep -P j|cat -A; echo $?
      grep: invalid UTF-8 byte sequence in input
    After this change, the same command matches both lines, and succeeds:
    * src/pcresearch.c (Pcompile): Use PCRE_NO_UTF8_CHECK, too, and
    add a comment.
    * tests/pcre-utf8: Add a test and a comment.
    This change did not work with Debian unstable pcre-8.31-2
    or with some 8.33 and 8.34-based versions, but does work with
    Fedora 20's 8.33 and with a built-from-latest source library.
    Based on a patch by Santiago Ruano Rincón.

diff --git a/src/pcresearch.c b/src/pcresearch.c
index 7e81a31..664070d 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -62,7 +62,11 @@ Pcompile (char const *pattern, size_t size)
   if (STREQ (nl_langinfo (CODESET), "UTF-8"))
-    flags |= PCRE_UTF8;
+    {
+      /* Enable PCRE's UTF-8 matching, but disable the check that would
+         make an invalid byte seqence *in the input* trigger a failure.   */
+      flags |= PCRE_UTF8 | PCRE_NO_UTF8_CHECK;
+    }
 # endif
   /* FIXME: Remove these restrictions.  */
diff --git a/tests/pcre-utf8 b/tests/pcre-utf8
index b8228d5..a3b9390 100755
--- a/tests/pcre-utf8
+++ b/tests/pcre-utf8
@@ -19,9 +19,15 @@ echo '$' | LC_ALL=en_US.UTF-8 grep -qP '\p{S}' \
 euro='\342\202\254 euro'
 printf "$euro\\n" > in || framework_failure_
+# The euro sign has the unicode "Symbol" property, so this must match:
 LC_ALL=en_US.UTF-8 grep -P '^\p{S}' in > out || fail=1
 compare in out || fail=1
+# This RE must *not* match in the C locale, because the first
+# byte is not a "Symbol".
+LC_ALL=C grep -P '^\p{S}' in > out && fail=1
+compare /dev/null out || fail=1
 LC_ALL=en_US.UTF-8 grep -P '^. euro$' in > out2 || fail=1
 compare in out2 || fail=1


Summary of changes:
 src/pcresearch.c |    6 +++++-
 tests/pcre-utf8  |    6 ++++++
 2 files changed, 11 insertions(+), 1 deletions(-)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]