grep-commit
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

grep branch, master, updated. v2.16-7-g1078b64


From: Paul Eggert
Subject: grep branch, master, updated. v2.16-7-g1078b64
Date: Fri, 17 Jan 2014 22:32:44 +0000

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "grep".

The branch, master has been updated
       via  1078b64302bbf5c0a46635772808ff7f75171dbc (commit)
       via  45284e38cfb07343ab50d20b116375c8a1d64196 (commit)
      from  97d3430c75a9dd82d871eca170b13c1f8d895fad (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
http://git.savannah.gnu.org/cgit/grep.git/commit/?id=1078b64302bbf5c0a46635772808ff7f75171dbc


commit 1078b64302bbf5c0a46635772808ff7f75171dbc
Author: Paul Eggert <address@hidden>
Date:   Fri Jan 17 14:32:10 2014 -0800

    grep: DFA now uses rational ranges in unibyte locales
    
    Problem reported by Aharon Robbins in <http://bugs.gnu.org/16481>.
    * NEWS:
    * doc/grep.texi (Environment Variables)
    (Character Classes and Bracket Expressions):
    Document this.
    * src/dfa.c (parse_bracket_exp): Treat unibyte locales like multibyte.

diff --git a/NEWS b/NEWS
index 6e46684..589b2ac 100644
--- a/NEWS
+++ b/NEWS
@@ -7,6 +7,14 @@ GNU grep NEWS                                    -*- outline 
-*-
   grep -i in a multibyte locale is now typically 10 times faster
   for patterns that do not contain \ or [.
 
+  Range expressions in unibyte locales now ordinarily use the rational
+  range interpretation, in which [a-z] matches only lower-case ASCII
+  letters regardless of locale, and similarly for other ranges.  (This
+  was already true for multibyte locales.)  Portable programs should
+  continue to specify the C locale when using range expressions, since
+  these expressions have unspecified behavior in non-GNU systems and
+  are not yet guaranteed to use the rational range interpretation even
+  in GNU systems.
 
 * Noteworthy changes in release 2.16 (2014-01-01) [stable]
 
diff --git a/doc/grep.texi b/doc/grep.texi
index 473a181..42fb9a2 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -960,8 +960,8 @@ They are omitted (i.e., false) by default and become true 
when specified.
 @cindex national language support
 @cindex NLS
 These variables specify the locale for the @code{LC_COLLATE} category,
-which determines the collating sequence
-used to interpret range expressions like @samp{[a-z]}.
+which might affect how range expressions like @samp{[a-z]} are
+interpreted.
 
 @item LC_ALL
 @itemx LC_CTYPE
@@ -1223,14 +1223,13 @@ For example, the regular expression
 Within a bracket expression, a @dfn{range expression} consists of two
 characters separated by a hyphen.
 It matches any single character that
-sorts between the two characters, inclusive, using the locale's
-collating sequence and character set.
-For example, in the default C
-locale, @samp{[a-d]} is equivalent to @samp{[abcd]}.
-Many locales sort
-characters in dictionary order, and in these locales @samp{[a-d]} is
-typically not equivalent to @samp{[abcd]};
-it might be equivalent to @samp{[aBbCcDd]}, for example.
+sorts between the two characters, inclusive.
+In the default C locale, the sorting sequence is the native character
+order; for example, @samp{[a-d]} is equivalent to @samp{[abcd]}.
+In other locales, the sorting sequence is not specified, and
address@hidden might be equivalent to @samp{[abcd]} or to
address@hidden, or it might fail to match any character, or the set of
+characters that it matches might even be erratic.
 To obtain the traditional interpretation
 of bracket expressions, you can use the @samp{C} locale by setting the
 @env{LC_ALL} environment variable to the value @samp{C}.
diff --git a/src/dfa.c b/src/dfa.c
index 6ab4e05..5e3140d 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -1108,30 +1108,14 @@ parse_bracket_exp (void)
             }
           else
             {
-              /* Defer to the system regex library about the meaning
-                 of range expressions.  */
-              regex_t re;
-              char pattern[6] = { '[', 0, '-', 0, ']', 0 };
-              char subject[2] = { 0, 0 };
               c1 = c;
               if (case_fold)
                 {
                   c1 = tolower (c1);
                   c2 = tolower (c2);
                 }
-
-              pattern[1] = c1;
-              pattern[3] = c2;
-              regcomp (&re, pattern, REG_NOSUB);
-              for (c = 0; c < NOTCHAR; ++c)
-                {
-                  if ((case_fold && isupper (c)))
-                    continue;
-                  subject[0] = c;
-                  if (regexec (&re, subject, 0, NULL, 0) != REG_NOMATCH)
-                    setbit_case_fold_c (c, ccl);
-                }
-              regfree (&re);
+              for (c = c1; c <= c2; c++)
+                setbit_case_fold_c (c, ccl);
             }
 
           colon_warning_state |= 8;

http://git.savannah.gnu.org/cgit/grep.git/commit/?id=45284e38cfb07343ab50d20b116375c8a1d64196


commit 1078b64302bbf5c0a46635772808ff7f75171dbc
Author: Paul Eggert <address@hidden>
Date:   Fri Jan 17 14:32:10 2014 -0800

    grep: DFA now uses rational ranges in unibyte locales
    
    Problem reported by Aharon Robbins in <http://bugs.gnu.org/16481>.
    * NEWS:
    * doc/grep.texi (Environment Variables)
    (Character Classes and Bracket Expressions):
    Document this.
    * src/dfa.c (parse_bracket_exp): Treat unibyte locales like multibyte.

diff --git a/NEWS b/NEWS
index 6e46684..589b2ac 100644
--- a/NEWS
+++ b/NEWS
@@ -7,6 +7,14 @@ GNU grep NEWS                                    -*- outline 
-*-
   grep -i in a multibyte locale is now typically 10 times faster
   for patterns that do not contain \ or [.
 
+  Range expressions in unibyte locales now ordinarily use the rational
+  range interpretation, in which [a-z] matches only lower-case ASCII
+  letters regardless of locale, and similarly for other ranges.  (This
+  was already true for multibyte locales.)  Portable programs should
+  continue to specify the C locale when using range expressions, since
+  these expressions have unspecified behavior in non-GNU systems and
+  are not yet guaranteed to use the rational range interpretation even
+  in GNU systems.
 
 * Noteworthy changes in release 2.16 (2014-01-01) [stable]
 
diff --git a/doc/grep.texi b/doc/grep.texi
index 473a181..42fb9a2 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -960,8 +960,8 @@ They are omitted (i.e., false) by default and become true 
when specified.
 @cindex national language support
 @cindex NLS
 These variables specify the locale for the @code{LC_COLLATE} category,
-which determines the collating sequence
-used to interpret range expressions like @samp{[a-z]}.
+which might affect how range expressions like @samp{[a-z]} are
+interpreted.
 
 @item LC_ALL
 @itemx LC_CTYPE
@@ -1223,14 +1223,13 @@ For example, the regular expression
 Within a bracket expression, a @dfn{range expression} consists of two
 characters separated by a hyphen.
 It matches any single character that
-sorts between the two characters, inclusive, using the locale's
-collating sequence and character set.
-For example, in the default C
-locale, @samp{[a-d]} is equivalent to @samp{[abcd]}.
-Many locales sort
-characters in dictionary order, and in these locales @samp{[a-d]} is
-typically not equivalent to @samp{[abcd]};
-it might be equivalent to @samp{[aBbCcDd]}, for example.
+sorts between the two characters, inclusive.
+In the default C locale, the sorting sequence is the native character
+order; for example, @samp{[a-d]} is equivalent to @samp{[abcd]}.
+In other locales, the sorting sequence is not specified, and
address@hidden might be equivalent to @samp{[abcd]} or to
address@hidden, or it might fail to match any character, or the set of
+characters that it matches might even be erratic.
 To obtain the traditional interpretation
 of bracket expressions, you can use the @samp{C} locale by setting the
 @env{LC_ALL} environment variable to the value @samp{C}.
diff --git a/src/dfa.c b/src/dfa.c
index 6ab4e05..5e3140d 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -1108,30 +1108,14 @@ parse_bracket_exp (void)
             }
           else
             {
-              /* Defer to the system regex library about the meaning
-                 of range expressions.  */
-              regex_t re;
-              char pattern[6] = { '[', 0, '-', 0, ']', 0 };
-              char subject[2] = { 0, 0 };
               c1 = c;
               if (case_fold)
                 {
                   c1 = tolower (c1);
                   c2 = tolower (c2);
                 }
-
-              pattern[1] = c1;
-              pattern[3] = c2;
-              regcomp (&re, pattern, REG_NOSUB);
-              for (c = 0; c < NOTCHAR; ++c)
-                {
-                  if ((case_fold && isupper (c)))
-                    continue;
-                  subject[0] = c;
-                  if (regexec (&re, subject, 0, NULL, 0) != REG_NOMATCH)
-                    setbit_case_fold_c (c, ccl);
-                }
-              regfree (&re);
+              for (c = c1; c <= c2; c++)
+                setbit_case_fold_c (c, ccl);
             }
 
           colon_warning_state |= 8;

-----------------------------------------------------------------------

Summary of changes:
 NEWS          |    8 ++++++++
 doc/grep.texi |   19 +++++++++----------
 src/dfa.c     |   20 ++------------------
 src/grep.c    |   14 ++++++++++++++
 4 files changed, 33 insertions(+), 28 deletions(-)


hooks/post-receive
-- 
grep



reply via email to

[Prev in Thread] Current Thread [Next in Thread]