[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Plan for grep [bug-grep]

From: Charles Levert
Subject: Re: Plan for grep [bug-grep]
Date: Tue, 8 Mar 2005 05:38:36 -0500
User-agent: Mutt/1.4.1i

* On Tuesday 2005-03-08 at 09:07:41 +0100, Stepan Kasal wrote:
> Hello,
> On Mon, Mar 07, 2005 at 02:09:54AM -0500, Charles Levert wrote:
> > Stepan:  It would be nice to devise a summary roadmap
> > including an identification of things that should go in a
> > 2.5.2 milestone release and of other things that should go
> > in a 2.6.0 milestone release (or any in between).
> > Maybe bug fixes vs. new functionality, enhanced performance,
> > or heavy refactoring.
> yes, I should outline that.
> I did one mistake: I wanted to react promptly to newcomers, like
> you and Claudio, in order to give a sense that the grep community
> is alive.  In fact, I should rather spend my time doing things which
> I promised long ago.

I'd say the number one pitfall to avoid is lack
of communications.  Things can be late; I have
no problem with the "when it's done" approach.

The plan presented here can change.  That would
be fine as long as everyone is promptly kept

Private email discussion can be had.  That's also
fine as long as any decisions that are its
outcome are made public.

There may be other newcomers who will show up
and be eager to help and they should quickly
be able to find out what the current focus is,
so I propose this.

I can prepare a new current-developments web
page, to be linked prominently from both web
sites (we'll figure out how for Savannah),
that explains and documents all this and that
is kept up to date.

> I'm afraid that most of the patches currently on savannah will have
> to wait some time.  I apologize to you, who invested your effort to
> develop them.  Details below.

That's allright as long as my expectations
are set accordingly.  There is also no point
in splitting and finessing patches (or asking
their author to do it) right now if the very code
they modify is to undergo a major revamping or
synching first.

Consequently, we should only be discussing
ideas, design, and algorithms, but not specific
implementation details, for any feature that is
not in the immediate future on the roadmap.

> 2.5.2
> =====
> Our main goal for grep 2.5.2 is to get sane performance with utf-8.
> That can be achieved by the patches written by Tim Waugh for Red Hat.

I have good knowledge of UTF-8, including how
to do proper validation without necessarily
decoding, how to distinguish between and
report the various precise incarnations of
ill-formed input (and help the user in doing
so), the difference between CEF and CES, BOM
and signature, other encodings including the
bastard CESU-8 one, etc.

BTW, is the assumption (in the current code)
that any two corresponding uppercase and
lowercase Unicode code points have the same
UTF-8 octet length (or 8-bit code unit lenght)
always a safe (secure) one?

> Besides that, I can do some changes in the infrastructure, so that
> I can "breathe":
> 1) rewrite the configure.in script, perhaps also Makefile.am
> 2) set up for gnulib-tool --import
> 3) improve the test ifrastructure
> I'm afraid I have to do 1) myself, and it is closely tied with 2),
> so they probably have to be done together.
> If someone likes awk and wanted to help with 3), it could help.
> In short, there should be only one awk script for .test-->.script
> rule.  The header of each .test file should state some details,
> like which command to run, eg. "grep -E".  We also heve to invent
> a way to collect the test cases for non-C locales; either by
> running the whole set twice, or by creating a separate .test files.
> The "make check" goal should run this, if the computer has a locale
> like en_US.utf8 installed.

Since performance is an issue, measuring it could
be included in testing, as well as reporting
serious discrepancies between the results of
identical tests being performed under various
different locales.  I'd say the same tests should
be run at least for the C locale, an 8-bit-code
one, and an UTF-8 one.

> After completing these, we can:
> 4) check in the patches for the sync of dfa.c with GNU awk

So I will wait before spending any more time
on adapting my re-entrant code for this to the
CVS version.

> 5) other small patches which wait for a test case
> 6) process the RedHat patches
> After 6), I should repeat Tim's measurments and see whether the utf8
> performance improved.
> Independently, I'd like to see
> 7) some _minimal_ cleanup of the grep(), grepdir(), recursion
>    (the "main loop") and fix --directories=read
> 8) mark the -P option clearly as "experimental";
> Well, that'll be perhaps enough for a release.
> 2.5.3
> =====
> Fix the combinations:
>  * -i -o
>  * --colour -i
>  * -o -b
>  * -o and zero-width matches
> Go through the bug list im my mailbox and fix fixable.
> Fix bugs reported with 2.5.2.
> 2.6.x
> =====
> The following should go here:
>  - upgrade to current regex.c from glibc,

The only danger I see in waiting to do this is
that there seems to have been improvements in
UTF-8 handling by glibc's regex code.  Maybe all
the -i kludges are not even needed anymore.
Maybe there are also performance issues (either
way) with this.

That's why I previously stated that I saw doing
this as a priority:  other items are affected.

>  - new functionality,
>  - fixes for -P,
>  - heavy refactoring.
> OK, we have plan.  I'm afraid I should invest my time to these points
> rather then to trying to be a good netizen and answer mails.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]