bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Clang-built Gawk 5.2.1 regex oddity


From: arnold
Subject: Re: Clang-built Gawk 5.2.1 regex oddity
Date: Sun, 01 Jan 2023 12:40:58 -0700
User-agent: Heirloom mailx 12.5 7/5/10

Hi.

> > In any case, in the gawk repo in helpers/testdfa.c is a program that
> > may be useful for further isolating the problem, since it extracts
> > the regex building and matching from the rest of gawk's code. If
> > the problem persists with that program, it will be of more use
> > in making a bug report to the clang team.
> > 
>
> Unfortunately, no matter what input I give to testdfa,
> it seems to say "malloc failed", e.g.

OK, I have fixed testdfa.c and pushed it to the Git repo. Now that it
works:

| $ ./testdfa "[[][:blank:]]" < in
| Ignorecase: false
| Syntax: 
RE_BACKSLASH_ESCAPE_IN_LISTS|RE_CHAR_CLASSES|RE_CONTEXT_INDEP_ANCHORS|RE_DOT_NEWLINE|RE_INTERVALS|RE_NO_BK_BRACES|RE_NO_BK_PARENS|RE_NO_BK_VBAR|RE_NO_EMPTY_RANGES|RE_UNMATCHED_RIGHT_PAREN_ORD|RE_INVALID_INTERVAL_ORD
| Pattern: /[[][:blank:]]/, len = 13
| After setup_pattern(), len = 13
| MB_CUR_MAX = 6
| Calling dfacomp([[][:blank:]], 13, 0x55964afe0760, true)
| dfa warning: character class syntax is [[:space:]], not [:space:]
| data: <>
| re_search with NULL returned position -1 (false)
| re_search returned position -1 (false)
| dfaexec returned NULL

In particular, notice the warning; it seems that dfa is unhappy
with [[]. Swapping the 2nd and 3rd characters:

| $ ./testdfa "[][[:blank:]]" < in
| Ignorecase: false
| Syntax: 
RE_BACKSLASH_ESCAPE_IN_LISTS|RE_CHAR_CLASSES|RE_CONTEXT_INDEP_ANCHORS|RE_DOT_NEWLINE|RE_INTERVALS|RE_NO_BK_BRACES|RE_NO_BK_PARENS|RE_NO_BK_VBAR|RE_NO_EMPTY_RANGES|RE_UNMATCHED_RIGHT_PAREN_ORD|RE_INVALID_INTERVAL_ORD
| Pattern: /[][[:blank:]]/, len = 13
| After setup_pattern(), len = 13
| MB_CUR_MAX = 6
| Calling dfacomp([][[:blank:]], 13, 0x555ee760a7c0, true)
| data: <>
| re_search with NULL returned position -1 (false)
| re_search returned position -1 (false)
| dfaexec returned 0 ()

We see that dfa is happier.

I tend to doubt, however, that this will fix the original problem,
as RS matching uses regex, not dfa, and regex was happy in both
cases.  Although one never knows.

I was always taught that to get a ] into a [...] expression it
had to be the very first character, which is why I think the
swapped version is more correct.  But that may be just me.

So, thanks for the report on testdfa, and maybe this little
analysis will help, and maybe it won't.  But at least testdfa
is now useful if the clang team wants to use it.

Arnold



reply via email to

[Prev in Thread] Current Thread [Next in Thread]