grep-commit
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

grep branch, master, updated. v3.8-15-g5e3b760


From: Jim Meyering
Subject: grep branch, master, updated. v3.8-15-g5e3b760
Date: Sat, 7 Jan 2023 21:26:42 -0500 (EST)

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "grep".

The branch, master has been updated
       via  5e3b760f65f13856e5717e5b9d935f5b4a615be3 (commit)
      from  45e1158a4bb44e507239274535290db61dd27577 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
http://git.savannah.gnu.org/cgit/grep.git/commit/?id=5e3b760f65f13856e5717e5b9d935f5b4a615be3


commit 5e3b760f65f13856e5717e5b9d935f5b4a615be3
Author: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Date:   Fri Jan 6 19:34:56 2023 -0800

    pcre: use UCP in UTF mode
    
    This fixes a serious bug affecting word-boundary and word-constituent 
regular
    expressions when the desired match involves non-ASCII UTF8 characters.
    * src/pcresearch.c: Set PCRE2_UCP together with PCRE2_UTF
    * tests/pcre-utf8-w: New file.
    * tests/Makefile.am (TESTS): Add it.
    * NEWS (Bug fixes): Mention this.
    * THANKS.in: Add Gro-Tsen and Karl Petterson.
    Reported by Gro-Tsen https://twitter.com/gro_tsen/status/1610972356972875777
    via Karl Pettersson in https://github.com/PCRE2Project/pcre2/issues/185
    This bug was present from grep-2.5, when --perl-regexp (-P) support was 
added.

diff --git a/NEWS b/NEWS
index b404708..24ee084 100644
--- a/NEWS
+++ b/NEWS
@@ -4,6 +4,12 @@ GNU grep NEWS                                    -*- outline 
-*-
 
 ** Bug fixes
 
+  With -P, some non-ASCII UTF8 characters were not recognized as
+  word-constituent due to our omission of the PCRE2_UCP flag. E.g.,
+  given f(){ echo Perú|LC_ALL=en_US.UTF-8 grep -Po "$1"; } and
+  this command, echo $(f 'r\w'):$(f '.\b'), before it would print ":r".
+  After the fix, it prints the correct results: "rú:ú".
+
   When given multiple patterns the last of which has a back-reference,
   grep no longer sometimes mistakenly matches lines in some cases.
   [Bug#36148#13 introduced in grep 3.4]
diff --git a/THANKS.in b/THANKS.in
index 9872bfa..d0d6f92 100644
--- a/THANKS.in
+++ b/THANKS.in
@@ -35,6 +35,7 @@ Gerald Stoller                      gerald_stoller@hotmail.com
 Grant McDorman                      grant@isgtec.com
 Greg Boyd                           gboyd.ccsf@gmail.com
 Greg Louis                          glouis@dynamicro.on.ca
+Gro-Tsen                            https://twitter.com/gro_tsen
 Guglielmo 'bond' Bondioni           g.bondioni@libero.it
 H. Merijn Brand                     h.m.brand@hccnet.nl
 Harald Hanche-Olsen                 hanche@math.ntnu.no
@@ -50,6 +51,7 @@ Joel N. Weber II                    devnull@gnu.org
 John Hughes                         john@nitelite.calvacom.fr
 Jorge Stolfi                        stolfi@dcc.unicamp.br
 Karl Heuer                          kwzh@gnu.org
+Karl Petterson                      karl.pettersson@klpn.se
 Kaveh R. Ghazi                      ghazi@caip.rutgers.edu
 Kazuro Furukawa                     furukawa@apricot.kek.jp
 Keith Bostic                        bostic@bsdi.com
diff --git a/src/pcresearch.c b/src/pcresearch.c
index a107f4d..45b67ee 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -149,7 +149,7 @@ Pcompile (char *pattern, idx_t size, reg_syntax_t ignored, 
bool exact)
     {
       if (! localeinfo.using_utf8)
         die (EXIT_TROUBLE, 0, _("-P supports only unibyte and UTF-8 locales"));
-      flags |= PCRE2_UTF;
+      flags |= (PCRE2_UTF | PCRE2_UCP);
 #if 0
       /* Do not match individual code units but only UTF-8.  */
       flags |= PCRE2_NEVER_BACKSLASH_C;
diff --git a/tests/Makefile.am b/tests/Makefile.am
index e0b0503..a47cf5c 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -147,6 +147,7 @@ TESTS =                                             \
   pcre-jitstack                                        \
   pcre-o                                       \
   pcre-utf8                                    \
+  pcre-utf8-w                                  \
   pcre-w                                       \
   pcre-wx-backref                              \
   pcre-z                                       \
diff --git a/tests/pcre-utf8-w b/tests/pcre-utf8-w
new file mode 100755
index 0000000..4cd7db6
--- /dev/null
+++ b/tests/pcre-utf8-w
@@ -0,0 +1,28 @@
+#!/bin/sh
+# Ensure non-ASCII UTF-8 characters are correctly identified as word-consituent
+#
+# Copyright (C) 2023 Free Software Foundation, Inc.
+#
+# Copying and distribution of this file, with or without modification,
+# are permitted in any medium without royalty provided the copyright
+# notice and this notice are preserved.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+require_en_utf8_locale_
+LC_ALL=en_US.UTF-8
+export LC_ALL
+require_pcre_
+
+fail=0
+
+echo 'Perú'> in || framework_failure_
+
+echo 'ú' > exp || framework_failure_
+grep -Po '.\b' in > out || fail=1
+compare exp out || fail=1
+
+echo 'rú' > exp || framework_failure_
+grep -Po 'r\w' in > out || fail=1
+compare exp out || fail=1
+
+Exit $fail

-----------------------------------------------------------------------

Summary of changes:
 NEWS              |  6 ++++++
 THANKS.in         |  2 ++
 src/pcresearch.c  |  2 +-
 tests/Makefile.am |  1 +
 tests/pcre-utf8-w | 28 ++++++++++++++++++++++++++++
 5 files changed, 38 insertions(+), 1 deletion(-)
 create mode 100755 tests/pcre-utf8-w


hooks/post-receive
-- 
grep



reply via email to

[Prev in Thread] Current Thread [Next in Thread]