bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: dfa.h / dfa.c diff versus gawk attached


From: Aharon Robbins
Subject: Re: dfa.h / dfa.c diff versus gawk attached
Date: Sun, 21 Oct 2007 23:55:33 +0200

Tony,

> Does this mean that grep uses dfa if it can?

It should. If it doesn't there's no reason for it to be there, right?  (:-)

> This question has been raised a few times. When does grep use dfa and when
> regex? Can anyone on the list shed some light?

Logically, it should be using dfa only when it needs to answer a "does it
match?" question, such as the standard case of printing matching lines.
It should fall back to regex when it needs something dfa can't do
(like backrefs) or when it needs "where does it match?" answers (e.g.,
highlighting the matching text).

What the code _actually_ does I dunno --- I have my hands full enough
with gawk, and the grep code has bloated, er, _evolved_ (yeah, that's
the ticket :-) over the years, so it's gotten harder to read.  You might
want to find some of the earlier, less, er, featureful, versions to try
to discern the original logic.

> > Up to but not including grep 2.5, the DFA matcher was able to match across
> > newline boundaries, if one handed it a string with embedded newlines.
>
> Is there a way to pass grep a pattern with an embedded new lines from the
> command line? Or do I really have to write some code that uses the dfa
> directly to test this?

You can always use

        grep 'some
        text' file1 ...

to embed newlines. But you'll have to check the syntax bits to see how dfa
and regex will react to the newlines.  (Syntax bits: _that_ is yet another
mind trip.  Recreational drugs are helpful for recovering from the
experience.)

> > The mainline code that invokes dfaexec will have to change. You can see
> > in grep 2.4.x how it used to be done.
>
> Ouch. In my opinion, this makes it all the more crucial that we have some
> test cases in place. I don't feel comfortable committing code that I don't
> understand and can't verify. However, I am willing to work on rectifying
> the situation.

I commend you. I wish I could help more.  In college, maybe, I understood
the dfa theory.  I don't anymore, although any edition of the Dragon Book
would be a good place to start.

> > No discussion, sorry, it just had to be done.  The multibyte character
> > patches can probably be found individually in the bug-gnu-utils list if
> > you search back far enough.
>
> Ouch again. I probably wouldn't be able to recognize it even if I found
> it.

They're often against either grep or gawk...

> Some idea of how we could proceed (in no particular order):
>
> 1) Assume that if it works for gawk then it's good enough for grep. Apply
> the patch and make the necessary changes to make it work in grep.

I'm all for this, as long as it doesn't break any of your regression tests. :-)

> 2) You break the patch into multiple functional units. I (and others
> on the list) try to verify the sub-patches by inspection, and maybe create
> some test cases if things start to make sense. We commit sub-patches that
> have been reviewed and accepted.

I can, MAYBE, split it out into 2 patches: restore-newline-matching
and cumultaive-bug-fixes, but I don't have the energy for any more than
that.

> dfa->broken is used only with ifdef GAWK. I am wondering what implications
> this may have for grep? I.e., is there a case where grep may also want to
> consider this "broken" case?

I don't mind that bit becoming mainline (no #ifdef GAWK), but I didn't feel
that it was my decision to make.

> There are no comments in dfa.c documenting
> what is wrong with the DFA that triggers dfa->broken to be set.

I thought I added a comment. If not, the code is clear enough. A pattern
of the form    ab{0}c   is treated by dfa as   ab{1}c   and incorrectly
does not match "ac".  Regex gets this right.  This is the only case I've
found so far where dfa is broken, but if I find others, at least I now
have a way to signal the caller.

This is very much a band-aid.  The Right Thing is to fix the parser that
builds the dfa; I have tried, but I have not been successful.

> We should probably put a comment in dfa.[ch] that all patches should be
> sent to grep to keep things in sync. A similar note would go in regex.[ch]
> to send patches to gnulib/libc.

Fine with me. (See my other note about regex coming from libc and not gnulib,
at least for gawk.)

Thanks,

Arnold




reply via email to

[Prev in Thread] Current Thread [Next in Thread]