bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: dfa.h / dfa.c diff versus gawk attached


From: Aharon Robbins
Subject: Re: dfa.h / dfa.c diff versus gawk attached
Date: Sat, 20 Oct 2007 22:21:53 +0200

Hi Tony.

> Date: Fri, 19 Oct 2007 00:02:59 -0300 (ADT)
> From: Tony Abou-Assaleh <address@hidden>
> Subject: Re: dfa.h / dfa.c diff versus gawk attached
> To: Aharon Robbins <address@hidden>
> Cc: address@hidden
>
> Hi Arnold,
>
> > Attached is a diff of the grep 2.5.3 dfa.h and dfa.c against the current
> > version of same in the gawk CVS. (Or, it'll be in CVS within an hour or
> > so. :-)
>
> Thanks for the patch. BTW, I still don't see it committed on Savannah
> gawk CVS.

Most of the changes in that patch are years old and have been in gawk's
code for a long time.  The only changes that were put into CVS are
minor cosmetic ones noted here:

Tue Aug 21 17:47:07 2007  Arnold D. Robbins  <address@hidden>

        * main.c (copyleft): Cite version 3 of the license.
        * dfa.c: Minor edits to sync with grep 2.5.3.

This had mainly to do with GPL 3 and the format of the FSF's address.

I had submitted largely the same patch years ago; note the entry in
the TODO file in the grep source code. :-)

> > The changes fall into two categories: bug fixes, mostly having to do
> > with multibyte character sets,
>
> Can you point out or submit some test cases to demonstrate the bug and the
> fix? Including them in the test set would reduce the chances of these
> bugs resurfacing uncaught.

Sorry. There are, I'm sure, tests in the gawk test suite, but I don't
remember which tests go with which fixes.  If you or someone else is
up to working their way back through the gawk ChangeLogs, the dates in the
main ChangeLog can be correlated to the dates of new tests being added
to the test suite.

> > and reviving the DFA matcher's ability
> > to match across newlines, which grep doesn't need but which gawk does.
> > This latter changes the interface to dfaexec.
>
> Could you elaborate on this a bit? Grep already matches across newlines
> with the -z option. If the current implementation is buggy, some test
> cases would be appreciated.

I am not familiar enough with the grep code to really answer this. Here
is the history.  GNU grep has both a fast DFA matcher and a slower regex
matcher.  The DFA matcher cannot match some things (like "\(foo\)bar\1"),
so it needs both.  Gawk for many years has used the DFA matcher for
"does it match" kinds of things, falling back to regex for "where does
it match" operations.

Up to but not including grep 2.5, the DFA matcher was able to match across
newline boundaries, if one handed it a string with embedded newlines. This
is critical for gawk.  At 2.5, someone "simplified" things by removing
this capability, since grep normally matches only within single lines.
I very carefully restored the code from the 2.4.x version to do this so
that I could continue using the DFA matcher.  However, I don't really
understand the DFA matcher; I worked mostly by careful pattern matching
of old vs. new code.

> Does the change to the dfaexec interface have any side effects or
> implication on grep that are not taken care of by the patch?

The mainline code that invokes dfaexec will have to change. You can see
in grep 2.4.x how it used to be done.

> I am not familiar with the code in dfa.[ch] and the patch looks
> non-trivial. If there was some discussion about the patch on the gawk or
> other lists, please point me in that direction.

No discussion, sorry, it just had to be done.  The multibyte character
patches can probably be found individually in the bug-gnu-utils list if
you search back far enough.

It'd be nice if grep could go back to being the canonical source for
dfa.h and dfa.c, just as GLIBC is the canonical source for regex*.[hc].

Thanks,

Arnold




reply via email to

[Prev in Thread] Current Thread [Next in Thread]