[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions
From: |
Aharon Robbins |
Subject: |
bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions |
Date: |
Thu, 27 Feb 2014 22:31:14 +0200 |
User-agent: |
Heirloom mailx 12.5 6/20/10 |
Hi Paul.
> Subject: bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions
> To: address@hidden
> Date: Thu, 27 Feb 2014 09:34:33 -0800
> From: Paul Eggert <address@hidden>
>
> I'm afraid there are several problems in the dfa code. I still don't
> have a handle on all of them, but here's my first patch to deal with the
> first major one I found. Patterns like [a-[.z.]], which caused 'grep'
> to dump core until recently, still aren't being handled correctly, and
> there are several closely related bugs here. I've taken the liberty of
> pushing the attached patch.
Thanks. This looks promising. A few comments / questions.
> +/* Return true if the current locale is known to be a unibyte locale
> + without multicharacter collating sequences and where range
> + comparisons simply use the native encoding. These locales can be
> + processed more efficiently. */
> +
> +static bool
> +using_simple_locale (void)
> +{
> + /* True if the native character set is known to be compatible with
> + the C locale. The following test isn't perfect, but it's good
> + enough in practice, as only ASCII and EBCDIC are in common use
> + and this test correctly accepts ASCII and rejects EBCDIC. */
> + enum { native_c_charset =
> + ('\b' == 8 && '\t' == 9 && '\n' == 10 && '\v' == 11 && '\f' == 12
> + && '\r' == 13 && ' ' == 32 && '!' == 33 && '"' == 34 && '#' == 35
> + && '%' == 37 && '&' == 38 && '\'' == 39 && '(' == 40 && ')' == 41
> + && '*' == 42 && '+' == 43 && ',' == 44 && '-' == 45 && '.' == 46
> + && '/' == 47 && '0' == 48 && '9' == 57 && ':' == 58 && ';' == 59
> + && '<' == 60 && '=' == 61 && '>' == 62 && '?' == 63 && 'A' == 65
> + && 'Z' == 90 && '[' == 91 && '\\' == 92 && ']' == 93 && '^' == 94
> + && '_' == 95 && 'a' == 97 && 'z' == 122 && '{' == 123 && '|' == 124
> + && '}' == 125 && '~' == 126)
> + };
What a mouthful! Is all that really necessary?
> + if ((c1 == ':' && syntax_bits & RE_CHAR_CLASSES)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I'd suggest parentheses around the bit with the bitwise operator,
both for readability and to match the rest of the code.
> @@ -1000,7 +1043,10 @@ parse_bracket_exp (void)
> /* Fetch bracket. */
> FETCH_WC (c, wc, _("unbalanced ["));
> if (c1 == ':')
> - /* build character class. */
> + /* Build character class. POSIX allows character
> + classes to match multicharacter collating elements,
> + but the regex code does not support that, so do not
> + worry about that possibility. */
I thought GLIBC did support them?
I will try this out in gawk, sometime in the next few days and
let you know how it goes.
Thanks for the work!
Arnold