[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: RRI patches for grep
From: |
Paolo Bonzini |
Subject: |
Re: RRI patches for grep |
Date: |
Fri, 27 Apr 2012 12:27:29 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20120329 Thunderbird/11.0.1 |
Il 27/04/2012 11:07, Aharon Robbins ha scritto:
> Here are the updated RRI patches for grep. First one is for dfa.c and
> doc/grep.texi. NOT handled is removal of hard-locale.[ch] from lib/ and
> from the make infrastructure.
>
> The second patch is for gnulib. Both are relative to master in both
> git repos as of less than an hour ago.
>
> Thanks,
>
> Arnold
> ------------------
> From 9b16fdee4edf2b4ea8fc4cfc6b6c45bde6ec8cd4 Mon Sep 17 00:00:00 2001
> From: Arnold D. Robbins <address@hidden>
> Date: Fri, 27 Apr 2012 12:03:16 +0300
> Subject: [PATCH] Implement/Document Rational Range Interpretation.
>
> ---
> doc/grep.texi | 21 ++++++++++++++++-----
> src/dfa.c | 40 ++++++----------------------------------
> 2 files changed, 22 insertions(+), 39 deletions(-)
>
> diff --git a/doc/grep.texi b/doc/grep.texi
> index 000a844..3af72f3 100644
> --- a/doc/grep.texi
> +++ b/doc/grep.texi
> @@ -958,9 +958,7 @@ They are omitted (i.e., false) by default and become true
> when specified.
> @cindex character type
> @cindex national language support
> @cindex NLS
> -These variables specify the locale for the @code{LC_COLLATE} category,
> -which determines the collating sequence
> -used to interpret range expressions like @samp{[a-z]}.
> +These variables specify the locale for the @code{LC_COLLATE} category.
>
> @item LC_ALL
> @itemx LC_CTYPE
> @@ -1221,7 +1219,12 @@ For example, the regular expression
> Within a bracket expression, a @dfn{range expression} consists of two
> characters separated by a hyphen.
> It matches any single character that
> -sorts between the two characters, inclusive, using the locale's
> +sorts between the two characters, inclusive,
> +using the machine's character set.
> +
> +Up to and including version 2.12 of @command{grep},
> +range expressions would match any single character that sorted between
> +the two characters, inclusive, using the current locale's
> collating sequence and character set.
> For example, in the default C
> locale, @samp{[a-d]} is equivalent to @samp{[abcd]}.
> @@ -1230,9 +1233,17 @@ characters in dictionary order, and in these locales
> @samp{[a-d]} is
> typically not equivalent to @samp{[abcd]};
> it might be equivalent to @samp{[aBbCcDd]}, for example.
> To obtain the traditional interpretation
> -of bracket expressions, you can use the @samp{C} locale by setting the
> +of bracket expressions, it was necessary to use the @samp{C} locale
> +by setting the
> @env{LC_ALL} environment variable to the value @samp{C}.
>
> +Since the current POSIX standard now makes the behavior of range expressions
> +be implementation-defined, instead of requiring the locale's
> +collating order, @command{grep} has reverted to the traditional Unix
> +behavior of defining ranges based on the machine character address@hidden
> +is known as ``Rational Range Interpretation,'' a lovely phrase
> +coined by Karl Berry.}
> +
> Finally, certain named classes of characters are predefined within
> bracket expressions, as follows.
> Their interpretation depends on the @code{LC_CTYPE} locale;
These cannot yet go in, because the documentation would be wrong for
--without-included-regex.
> diff --git a/src/dfa.c b/src/dfa.c
> index 1cbe537..c690e10 100644
> --- a/src/dfa.c
> +++ b/src/dfa.c
> @@ -29,6 +29,7 @@
> #include <limits.h>
> #include <string.h>
> #include <locale.h>
> +#include <stdbool.h>
>
> #define STREQ(a, b) (strcmp (a, b) == 0)
>
> @@ -46,7 +47,7 @@
> #include "gettext.h"
> #define _(str) gettext (str)
>
> -#include "mbsupport.h" /* defines MBS_SUPPORT if appropriate */
> +#include "mbsupport.h" /* defines MBS_SUPPORT to 1 or 0, as
> appropriate */
> #include <wchar.h>
> #include <wctype.h>
>
> @@ -56,7 +57,6 @@
>
> #include "regex.h"
> #include "dfa.h"
> -#include "hard-locale.h"
> #include "xalloc.h"
>
> /* HPUX, define those as macros in sys/param.h */
> @@ -777,7 +777,6 @@ static int laststart; /* True if we're
> separated from beginning or (,
> only by zero-width characters. */
> static size_t parens; /* Count of outstanding left parens. */
> static int minrep, maxrep; /* Repeat counts for {m,n}. */
> -static int hard_LC_COLLATE; /* Nonzero if LC_COLLATE is hard. */
>
> static int cur_mb_len = 1; /* Length of the multibyte representation of
> wctok. */
> @@ -1111,26 +1110,8 @@ parse_bracket_exp (void)
> c1 = tolower (c1);
> c2 = tolower (c2);
> }
> - if (!hard_LC_COLLATE)
> - for (c = c1; c <= c2; c++)
> - setbit_case_fold_c (c, ccl);
> - else
> - {
> - /* Defer to the system regex library about the meaning
> - of range expressions. */
> - regex_t re;
> - char pattern[6] = { '[', c1, '-', c2, ']', 0 };
> - char subject[2] = { 0, 0 };
> - regcomp (&re, pattern, REG_NOSUB);
> - for (c = 0; c < NOTCHAR; ++c)
> - {
> - subject[0] = c;
> - if (!(case_fold && isupper (c))
> - && regexec (&re, subject, 0, NULL, 0) !=
> REG_NOMATCH)
> - setbit_case_fold_c (c, ccl);
> - }
> - regfree (&re);
> - }
> + for (c = c1; c <= c2; c++)
> + setbit_case_fold_c (c, ccl);
Again, this is wrong, and unnecessary if the regex library uses the
desired meaning of range expressions.
If anything, the "if" can be removed, and only the "else" left in.
> }
>
> colon_warning_state |= 8;
> @@ -1878,9 +1859,6 @@ dfaparse (char const *s, size_t len, struct dfa *d)
> lasttok = END;
> laststart = 1;
> parens = 0;
> -#ifdef LC_COLLATE
> - hard_LC_COLLATE = hard_locale (LC_COLLATE);
> -#endif
> if (MB_CUR_MAX > 1)
> {
> cur_mb_len = 0;
> @@ -2966,7 +2944,6 @@ match_mb_charset (struct dfa *d, state_num s, position
> pos, size_t idx)
> with which this operator match. */
> int op_len; /* Length of the operator. */
> char buffer[128];
> - wchar_t wcbuf[6];
>
> /* Pointer to the structure to which we are currently referring. */
> struct mb_char_classes *work_mbc;
> @@ -3039,16 +3016,11 @@ match_mb_charset (struct dfa *d, state_num s,
> position pos, size_t idx)
> }
> }
>
> - wcbuf[0] = wc;
> - wcbuf[1] = wcbuf[3] = wcbuf[5] = '\0';
> -
> /* match with a range? */
> for (i = 0; i < work_mbc->nranges; i++)
> {
> - wcbuf[2] = work_mbc->range_sts[i];
> - wcbuf[4] = work_mbc->range_ends[i];
> -
> - if (wcscoll (wcbuf, wcbuf + 2) >= 0 && wcscoll (wcbuf + 4, wcbuf) >= 0)
> + if (work_mbc->range_sts[i] <= wc &&
> + wc <= work_mbc->range_ends[i])
> goto charset_matched;
> }
>
I'm applying this part of the patch.
Paolo