[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Plan for grep [bug-grep]

From: Charles Levert
Subject: Re: Plan for grep [bug-grep]
Date: Tue, 8 Mar 2005 09:47:26 -0500
User-agent: Mutt/1.4.1i

* On Tuesday 2005-03-08 at 13:13:07 +0000, Tim Waugh wrote:
> On Tue, Mar 08, 2005 at 05:38:36AM -0500, Charles Levert wrote:
> > BTW, is the assumption (in the current code)
> > that any two corresponding uppercase and
> > lowercase Unicode code points have the same
> > UTF-8 octet length (or 8-bit code unit lenght)
> > always a safe (secure) one?
> Where do you see that assumption?  Is that assumption also in the
> Fedora Core patched grep?

In main() in src/grep.c, it seems to me that
the keys variable is being towlower()ed under
that assumption, as there is only a single i
loop index variable.

In check_multibyte_string() in src/search.c,
same thing for the buf variable.

Please confirm or deny.

There may be other places, I didn't perform an
exhaustive search of the code.  Note that under
a normal "en_US.UTF-8" locale definition, that
assumption is reasonable, hence the formulation
of my original question.

> > Since performance is an issue, measuring it could
> > be included in testing, as well as reporting
> > serious discrepancies between the results of
> > identical tests being performed under various
> > different locales.
> As part of 'make check'?  If that's what you mean, better make sure
> not to use wall-clock time to measure against but 'user' as reported
> by time(1)!

Well obviously!  That or user+system, although
system is much less relevant here.

Also the time bash builtin and /usr/bin/time
don't seem to share the same output syntax,
so we have to be careful with that.

> > The only danger I see in waiting to do this is
> > that there seems to have been improvements in
> > UTF-8 handling by glibc's regex code.  Maybe all
> > the -i kludges are not even needed anymore.
> > Maybe there are also performance issues (either
> > way) with this.
> > 
> > That's why I previously stated that I saw doing
> > this as a priority:  other items are affected.
> The undeniable improvements in the glibc regex code are very useful --
> however, the current (unpatched) grep multibyte handling is flawed in
> many more ways than you might guess, and *that* is the thing to fix
> first when doing performance testing.  See
> grep-2.5.1-egf-speedup.patch.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]