bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ^ in FS


From: Dave B
Subject: Re: ^ in FS
Date: Tue, 25 Nov 2008 20:29:04 +0100
User-agent: Thunderbird 2.0.0.18 (X11/20081124)

Stepan Kasal wrote:

> Hello,
> 
> this is an incomplete answer to your mail.
> 
>> I'm having trouble in understanding the behavior of ^ in FS, [...]
> 
> It's a simple consequence of the straightforward implementation.
> One example is worth 1000 words:
> 
> $ echo 'XXf1 , f2, XXf3' | awk -v FS='^X+| *, *' \
>       '{for(i=1;i<=NF;i++)print "-->"$i"<--"}'
> --><--
> -->f1<--
> -->f2<--
> --><--
> -->f3<--
> 
> After the third field ("f2") has been found, awk moves past it and
> its delimiter (", "), so the remaining string is "XXf3".
> That string is passed to the regexp matcher.  Since the matcher is
> not told we are not at the beginnig of the string, it finds "XX",
> which delimits the empty "fourth field".

This is actually another good example. As in the other case, two different
awk implementations output two different results with your test case:

GNU awk 3.1.6:

--><--
-->f1<--
-->f2<--
--><--
-->f3<--

(imho wrong, since the second XX should not be matched by that FS)

Bell labs' original awk:

--><--
-->f1<--
-->f2<--
-->XXf3<--

(imho correct)

However, I can't find the precise circumstances under which the problem
shows up, since this one works fine in all implementations:

$ echo 'XXf1XXf2XXf3' | awk -v FS='^X+' '{for(i=1;i<=NF;i++)print "-->"$i"<--"}'
--><--
-->f1XXf2XXf3<--

If the matcher were not told that it's not at the beginning of the string,
as you suppose, it should produce wrong results in gawk, which it doesn't here.

In any case, it seems that alternation _with something that can match the
empty string_ in the regex used as FS has something to do with the problem,
because if I introduce it the error happens again:

$ echo 'XXf1XXf2XXf3' | gawk -v FS='^X+|k*' '{for(i=1;i<=NF;i++)print
"-->"$i"<--"}'
--><--
-->f1<--
-->f2<--
-->f3<--

But not if it cannot match the empty string:

$ echo 'XXf1XXf2XXf3' | gawk -v FS='^X+|k+' '{for(i=1;i<=NF;i++)print
"-->"$i"<--"}'
--><--
-->f1XXf2XXf3<--

Again, bell labs' awk works fine here:

$ echo 'XXf1XXf2XXf3' | bell_awk -v FS='^X+|k*' '{for(i=1;i<=NF;i++)print
"-->"$i"<--"}'
--><--
-->f1XXf2XXf3<--

(bell_awk is just the name I gave to bell labs' awk on my system to do the
tests)

This is just another example of the same behavior:

$ echo 'XXf1XXf2XXf3' | gawk -v FS='^X+|f*' '{for(i=1;i<=NF;i++)print
"-->"$i"<--"}'
--><--
--><--
-->1<--
--><--
-->2<--
--><--
-->3<--

Here, the "XX"s in the middle of the string are incorrectly matched by FS.
But if FS is changed so that it cannot match the empty string, then things
are normal again:

$ echo 'XXf1XXf2XXf3' | gawk -v FS='^X+|f+' '{for(i=1;i<=NF;i++)print
"-->"$i"<--"}'
--><--
--><--
-->1XX<--
-->2XX<--
-->3<--

(again, bell labs' awk produces correct results in all these cases)

> Now to your exmaple:
>> $ echo '  f1 ,  f2,f3  ,  f  4  ,f5' | awk -v FS='^ *| *, *'
>> '{for(i=1;i<=NF;i++)print "-->"$i"<--"}'
>> --><--
>> -->f1<--
>> -->f2<--
>> -->f3<--
>> -->f<--
>> -->4<--
>> -->f5<--
> 
> The differece is that your regexp can match empty string at the
> beginning.  So after fourth field and its delimiter has been removed,
> when we have "f  4  ,f5" the answer from the matcher is "empty string
> at the beginnnig".  But that cannot be a valid delimiter, so gawk
> skips one char and calls the matcher again on "  4  ,f5" .
> Now, the delimiter is "  " so it is not empty, and it is taken as the
> delimiter.

Makes sense, also considering the above results.

>> don't know whether it can be called "bug".
> 
> I'm afraid there are two bugs involved:
> gawk does not tell the matcher
> 1) that empty matches should be ignored (skipping one char to get
> past them is a kludge)
> 2) that it is not at the beginning of the string
> 
> But I'm afraid the regexp matcher(s) insige gawk can handle those
> features.  Consequently, the bugs are probably hard to fix.
> 
> Stay tuned, better answers may come later... ;-)

Thank you very much for your explanations!

-- 
D.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]