bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Fwd: [^\]] in basic regexes]


From: Wacek Kusnierczyk
Subject: [Fwd: [^\]] in basic regexes]
Date: Sat, 14 Feb 2009 09:56:58 +0100
User-agent: Thunderbird 2.0.0.19 (X11/20090105)

yesterday i posted the below, but it seems not to have gine through the
system.  i have just registered, and maybe the post was rejected -- just
in case, i resend it here, with further examples.

-------- Original Message --------
Subject:        [^\]] in basic regexes
Date:   Fri, 13 Feb 2009 16:10:47 +0100
From:   Wacek Kusnierczyk <address@hidden>
To:     address@hidden



hello,

i observe a behaviour of grep that i am not sure is correct, possibly
due to my misunderstanding.

i've recently reviewed code written is some language were the intent was
to match a sequence of any number of non-']' characters.  the matching
was done with an underlying regex library, and i have tried the pattern
directly with grep.

with grep, the pattern '[^]]' matches one non-] character:

grep '[^]]' <<< '[\]'
# match

however, in that code the pattern was '[^\]]*' (with the idea that the
character ']' is a metacharacter and therefore must be escaped). 
according to the docs i know, it is not necessary to escape ']' within a
character class when it's the first character there (as in '[]]'), since
it then is not considered meta;  but it shouldn't be harmful.  it
happens that this pattern won't do:

grep '[^\]]' <<< '[\]'
# no match

this seems strange;  i'd read the pattern as 'one character that is not
]'.  clearly, the data has two such characters.  alternatively, the
pattern could be read as 'one character that is neither \ nor ]', but
this would require the backslash to be treated as a regular character
(not a meta):

grep '[\]' <<< '[\]'
# match
grep '[^\]' <<< '[\]'
# match
grep '[^\[]' <<< '[\]'
# match

in fact, the third above has one possible match, so the pattern is read
as 'one non-\ non-[' rather than as 'one non-[':

grep -o '[^\[]' <<< '[\]'
# ]

so the 'one non-\ non-]' reading of  '[^\]]' is not implausible;  then,
there would one match, but there is none. 

it actually appears that the pattern is read as 'one non-\ followed by
one ]':

grep -o '[^\]]' <<< '[]'
# []

that is, the first ] is not escaped (coherently with the case of
'[^\[]') but rather closes the character class, and the second
(unescaped!) ] does not close any class, but is taken literally! 
(should this not be an invalid regex, with an unmatched class-closing
bracket?)

i haven't looked at the sources of grep, so these are plain guesses, but
is the behaviour of grep with '[^\]]' correct and intended, or is it a bug?

grep -V
# GNU grep 2.5.3

regards,
wacek



ps. here are some further experiments, which seem to indicate that grep gets 
confused with some combinations of [, ], ^, and \.

# [[] should match one opening bracket
grep -o '[[]' <<< '[^\]'
# [
# OK

# []] should match one closing bracket
grep -o '[]]' <<< '[^\]'
# ]
# OK

# [][] should match one bracket
grep -o '[][]' <<< '[^\]'
# [
# ]
#OK

# [[]] should match one bracket
# alternatively (preferred?), report invalid regex (unmatched second ])
grep -o '[[]]' <<< '[^\]'
# WRONG (?) -- neither a match nor an error

# [\] shoud match one backslash
grep -o '[\]' <<< '[^\]'
# \
# OK

# [\[] should match one backslash or opening bracket
grep -o '[\[]' <<< '[^\]'
# [
# \
# OK

# [\]] should match one backslash or closing bracket 
# alternatively (preferred?), report invalid regex (unmatched second ])
grep -o '[\]]' <<< '[^\]'
# \]
# WRONG (?) -- matches *two* characters

# [[^] should match one opening bracket or caret
grep -o '[[^]' <<< '[^\]'
# [
# ^
# OK

# [[^\] should match one opening bracket, caret, or backslash
grep -o '[[^\]' <<< '[^\]'
# [
# ^
# \
# OK

# [[^\]] should match one opening bracket, caret, backslash, or closing bracket
# alternatively (preferred?), report invalid regex (unmatched second ])
grep -o '[[^\]]' <<< '[^\]'
# \]
# WRONG (?) -- matches *two* characters

# [\ ]] should match one backslash, space, or closing bracket 
# alternatively (preferred?), report invalid regex (unmatched second ])
grep -o '[\ ]]' <<< '[^\]'
# \]
# WRONG (?) -- matches *two* characters

# [\ ] ] should match one backslash, space, or closing bracket 
# alternatively (preferred?), report invalid regex (unmatched second ])
grep -o '[\ ] ]' <<< '[^\]'
# WRONG (?) -- neither a match nor an error
grep -o '[\ ] ]' <<< '[^\ ]'
# \ ]
# WRONG (?) -- matches *three* characters

# [\] ] should match one backslash, closing bracket, or space
# alternatively (preferred?), report invalid expression (unmatched second ])
grep -o '[\] ]' <<< '[^\]'
# WRONG (?) -- neither a match nor an error
grep -o '[\] ]' <<< '[^\ ]'
# \ ]
# WRONG (?) -- matches *three* characters

# [^] should report invalid regex (void ^or unmatched [)
grep -o '[^]' <<< '[^\]'
# grep: Unmatched [ or [^
# OK

# [^]\]  match one non-closing-bracket or non-backslash
# alternatively, report invalid regex (void ^)
grep -o '[^]\]' <<< '[^\]'
# [
# ^
# WRONG (?) -- matches *two* characters, seemingly inappropriately




-- 
-------------------------------------------------------------------------------
Wacek Kusnierczyk, MD PhD

Email: address@hidden
Phone: +47 73591875, +47 72574609

Department of Computer and Information Science (IDI)
Faculty of Information Technology, Mathematics and Electrical Engineering (IME)
Norwegian University of Science and Technology (NTNU)
Sem Saelands vei 7, 7491 Trondheim, Norway
Room itv303

Bioinformatics & Gene Regulation Group
Department of Cancer Research and Molecular Medicine (IKM)
Faculty of Medicine (DMF)
Norwegian University of Science and Technology (NTNU)
Laboratory Center, Erling Skjalgsons gt. 1, 7030 Trondheim, Norway
Room 231.05.060

-------------------------------------------------------------------------------





reply via email to

[Prev in Thread] Current Thread [Next in Thread]