bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] fall back to glibc matcher if a MBCSET is found


From: Jim Meyering
Subject: Re: [PATCH] fall back to glibc matcher if a MBCSET is found
Date: Mon, 13 Sep 2010 11:20:38 +0200

Paolo Bonzini wrote:
> On 09/12/2010 02:55 PM, Jim Meyering wrote:
>>> Regarding the failing test-case, the lack of equivalence class support
>>> is in glibc.  Should I commit the patch without the testcase?
>>
>> ??  Keep the test, of course.
>> It's alerting us to a legitimate problem.
>> We don't want to sweep that under the rug.
>
> I thought it's was just an incomplete locale definition, which would
> be legitimate and not really a bug per se.
>
> However, this program passes for me with glibc-2.5-42.el5:
>
> #include <regex.h>
> #include <locale.h>
> #include <stdio.h>
> int main()
> {
>         regex_t reg;
>
>         setlocale (LC_ALL, "en_US.UTF-8");
>         regcomp (&reg, "[[=a=]]", REG_NOSUB);
>         if (regexec(&reg, "\xc3\xa0", 0, NULL, 0) == REG_NOMATCH)
>                 printf ("Equivalence classes fail\n");
>         else
>                 printf ("Equivalence classes work\n");
> }
>
> Can you test it as well on your system?

Sure, it passes there, too.

>> This is a strong argument against using --without-included-regex,
>> unless you know you're using a new-enough glibc.
>
> Why?  Note that the test is unconditionally XFAILed unless you use
> --without-included-regex.

Using --without-included-regex is risky, because it tells grep
to use the libc-supplied regex functions, no matter how many
flaws they have -- flaws that would be avoided if it *did* use
the package-provided version.

>> Your NEWS addition recommends --without-included-regex unconditionally.
>
> I committed now my gnulib patch to m4/regex.m4, which required editing
> the NEWS entry anyway.  See the attached updated 3-patch series.

I was a little surprised to see you push that gnulib change, seeing as
how it is contrary to POSIX, and also goes against our philosophy
of avoiding arbitrary limits, but do see why you wanted it.

>> Finally, since this change induces a performance penalty on at
>> least one system, please adjust or remove this comment in the commit log.
>
> It's a 3-4% performance penalty on one system, versus a 30% gain on
> more recent ones, so I'll make the performance comment less marked but
> I still think it should be there.  Again, see the attachment.
...
> Subject: [PATCH 2/3] dfa: fall back to glibc matcher if a MBCSET is found
>
> This patch enables full support of equivalence classes and multicharacter
> collation symbols.  It can also improve performance problems in some
> cases for multibyte grep.  Both these changes however depend on the

s/Both/Both of/

> glibc version installed in the system.
>
> For UTF-8 it should trigger only in the presence of MBCSET, e.g. [a-z].
> For other character sets all brackets and `.` as well will trigger it.

With the above tweak, you're welcome to push the series.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]