bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#22655: grep -Pz '^' now fails!


From: Stephane Chazelas
Subject: bug#22655: grep -Pz '^' now fails!
Date: Fri, 18 Nov 2016 18:07:51 +0000
User-agent: Mutt/1.5.21 (2010-09-15)

2016-11-18 09:47:50 -0800, Paul Eggert:
> Stephane Chazelas wrote:
> >$ time grep -Pz '(?-m)^/' ~/a > /dev/null
> 
> It looks like you want "^" to stand for a newline character, not the
> start of a line. That is not how grep -z works. -z causes the null
> byte to be the line delimiter, and "^" should stand for a position
> immediately after a null byte (or at start of file).
[...]

No, sorry if I wasn't very clear, that's the other way round and
it's the whole point of this discussion.

grep had a bug in that it was calling pcre_exec on the content
of each null delimited record with a regex compiled with
PCRE_MULTILINE

That caused

printf 'a\nb\0' | grep -zP '^b'

to match even though the record doesn't start with a "b".

To work around it, you have to disable the PCRE_MULTILINE flag
in the regexp syntax with the (?-m) PCRE operator, or use \A
instead of ^.

The problem was /fixed/ (and I'm arguing here it's the wrong fix),
by disallowing ^ with -Pz while the obvious fix is to remove
that PCRE_MULTILINE flag.

As it turns out PCRE_MULTILINE is there because in the old days,
before grep -Pz was supported, with grep -P (without -z), grep
would pass more than one line to pcre_exec. If you look at the
grep bug history, 90% of the grep pcre related bugs were caused
by that.

It was fixed/changed in
http://git.savannah.gnu.org/cgit/grep.git/commit/?id=a14685c2833f7c28a427fecfaf146e0a861d94ba
but Paolo forgot to remove the PCRE_MULTILINE flag when the code
was changed to pass one line at a time to pcre_exec and
PCRE_MULTILINE was no longer needed anymore (and later called
problem when grep -Pz was supported).

> It might be nice to have a syntax for matching a newline byte with
> -z (or a null byte without -z, for that matter). But that would be a
> new feature.

That feature is already there. That's the (?m) PCRE operator.

That's the whole point. That m flag (PCRE_MULTILINE) is on by
default in GNU grep, and that's what it's causing all the
problems.

Once you turn it off *by default*, that makes ^ match the
beginning of the NUL-delimited record as it should and one can
use (?m) if he wants ^ to match the beginning of each line in
the NUL-delimited record instead of just the beginning of the
record.

-- 
Stephane





reply via email to

[Prev in Thread] Current Thread [Next in Thread]