Re: [bug #37737] Contained subexpressions don't follow POSIX

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug #37737] Contained subexpressions don't follow POSIX

From:	Eric Blake
Subject:	Re: [bug #37737] Contained subexpressions don't follow POSIX
Date:	Wed, 14 Nov 2012 15:11:57 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121029 Thunderbird/16.0.2

[adding gnulib]

On 11/14/2012 02:26 PM, РоманДонченко wrote:
> Here's what POSIX says
> <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html>:
> 
> 9.3.6 BREs Matching Multiple Characters
> 
> <...> The string matched by a contained subexpression shall be within the
> string matched by the containing subexpression. If the containing
> subexpression does not match, or if there is no match for the contained
> subexpression within the string matched by the containing subexpression, then
> back-reference expressions corresponding to the contained subexpression shall
> not match. <...> For example, <...> the expression "\(a\(b\)*\)*\2" fails to
> match 'abab' <...> .

Thanks for the report.  Yep, I agree you've found a bug in glibc's regex
engine.  Other useful context from that POSIX page:

If the pattern permits a variable number of matching characters and thus
there is more than one such sequence starting at that point, the longest
such sequence is matched...

Consistent with the whole match being the longest of the leftmost
matches, each subpattern, from left to right, shall match the longest
possible string. For this purpose, a null string shall be considered to
be longer than no match at all. For example, matching the BRE "\(.*\).*"
against "abcdef" , the subexpression "(\1)" is "abcdef" , and matching
the BRE "\(a*\)*" against "bc" , the subexpression "(\1)" is the null
string...

A subexpression repeated by an <asterisk> ( '*' ) or an interval
expression shall not match a null expression unless this is the only
match for the repetition or it is necessary to satisfy the exact or
minimum number of occurrences for the interval expression.

My read of POSIX is that given the string 'abab', if you let
'\(a\(b\)*\)*' match 'abab', then \2 would also have to match 'b', which
is not present at that point in the string.  Trying successively shorter
options, if you try to match 'aba', then the second occurrence of the
containing subexpression \(a\(b\)*\) has no matches of the contained
subexpression \(b\), so \2 must fail to match.  If you try to match
'ab', then \2 would still have to match 'b'.  If you let \(b\)* match
the null string, then \(a\(b\)*\)* matches 'a'; but in this case, there
is no match for the contained subexpression '\(b\)' within the
containing subexpression, so any attempt to use \2 in that scenario must
fail.  Finally, if you let '\(a\(b\)*\)*' match the null string, then
you again have a case where \(b\) was not matched, so \2 again fails.
So POSIX is right that 'abab' cannot match, no matter how you slice the
string; and therefore grep has a bug.  More precisely, glibc has a bug;
you can trigger the same bug via other programs, such as m4:

$ printf %s 'regexp(abab, \(a\(b\)*\)*\2, \&-\1-\2)' | m4
abab-a-b

which shows that glibc is treating \2 as the successful backreference to
'b' from the FIRST match of a\(b\)*, even though POSIX requires it to
refer to the second match where \(b\) did not match.

For reference, on Cygwin (where m4 uses gnulib's regex engine, rather
than glibc's) I get correct behavior:

$ printf %s 'regexp(abab, \(a\(b\)*\)*\2, \&-\1-\2)' | m4
-1

> And yet...
> 
> $ src/grep '\(a\(b\)*\)*\2' <<< 'abab'
> abab
> 
> I'm guessing that it assumes that \1 covers character #3, but \2 covers
> character #2, which is against the spec.

Yep, your analysis is matching mine.

We should get glibc to fix their bug; but we should also get gnulib to
work around the bug.

-- 
Eric Blake   address@hidden    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

[bug #37737] Contained subexpressions don't follow POSIX, РоманДонченко, 2012/11/14
- [bug #37737] Contained subexpressions don't follow POSIX, РоманДонченко, 2012/11/14
  - [bug #37737] Contained subexpressions don't follow POSIX, РоманДонченко, 2012/11/15
- Re: [bug #37737] Contained subexpressions don't follow POSIX, Eric Blake <=
  - Re: [bug #37737] Contained subexpressions don't follow POSIX, Роман Донченко, 2012/11/14
    - Re: [bug #37737] Contained subexpressions don't follow POSIX, Eric Blake, 2012/11/14

Prev by Date: [bug #37737] Contained subexpressions don't follow POSIX
Next by Date: Re: [bug #37737] Contained subexpressions don't follow POSIX
Previous by thread: [bug #37737] Contained subexpressions don't follow POSIX
Next by thread: Re: [bug #37737] Contained subexpressions don't follow POSIX
Index(es):
- Date
- Thread