bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] regexp RS mangling input


From: Aharon Robbins
Subject: Re: [bug-gawk] regexp RS mangling input
Date: Fri, 10 Aug 2012 13:27:23 +0300
User-agent: Heirloom mailx 12.4 7/29/08

Hello. Apologies for the long delay in replying to this.

> Date: Sun, 20 May 2012 01:05:52 -0400
> From: Jay Michael <address@hidden>
> To: address@hidden
> Subject: [bug-gawk] regexp RS mangling input
>
>       I'm using a regular expression as RS to soak up everything I don't 
> want to see while parsing my input.  I want the record terminator to 
> include possibly multi-line expanses enclosed in braces.
>
>       The first problem I had, gawk seemed to be returning the same 
> string for several consecutive internal records.  When I tried to track 
> down what I was doing wrong, my reduced test case caused gawk to include 
> what should have been the first record in the first record's terminator, 
> while ending the terminator before the end of the second "comment". 
> Then, gawk acted like each character was a record terminator.
>
>       I'm running GNU Awk 3.1.3 under Windows XP.  I don't know who 
> built it, I don't remember where I got it.  I tried on a UNIX/Linux 
> shell to which I have access.  It was running 3.1.1 (or so), it behaved 
> the same way as the version on my PC.

You can get a current version of gawk (4.0.1) for Windows from:

        http://sourceforge.net/projects/ezwinports/files/

which I highly recommend doing. 3.1.3 is almost ten years old.

>       I have attached my program (d.awk) and input (d.i).  d.log is not 
> really a log file -- I pasted pieces and then appended the output of
> "gawk -f d.awk d.i".

I think the problem is that your regex for RS is too inclusive. You
have

    RS = "([ \\n]|(" re_bcom "))*" ;

I believe that the space is giving you problems; it causes each space
to act as a record separator, which is likely not what you want.

Crafting a regular expression can be difficult if you are trying to
match very variable input.  You may want to use a more simple RS and
use sub, gsub, or gensub to remove the stuff you don't want from the
record before processing it.

HTH,

Arnold



reply via email to

[Prev in Thread] Current Thread [Next in Thread]