bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] FIELDWIDTHS can miscount the number of fields


From: Wolfgang Laun
Subject: Re: [bug-gawk] FIELDWIDTHS can miscount the number of fields
Date: Sun, 21 May 2017 21:46:26 +0200

I haven't browsed through all the details in this discussion, but perhaps you consider the origin of fixed width fields, which before Unix creation time did not have to worry about text file lines being truncated by omitting trailing spaces. Those were the days when certain text files still typically had 80, perhaps 96 or even 132 characters and your (COBOL) structure would be defined by column or character counts. (This fixed field structure did survive in many applications, and perhaps that was the rationale at the time of the invention of awk to have this feature.) The traditional approach was, of course, to have trailing empty fields set to all spaces. Partially non-blank fields would have full length, padded with trailing spaces.

If someone wants to have that old behaviour, omitting trailing fields altogether and truncating an incomplete field would create a nuisance. Announcing a rigid field structure using FIELDWIDTHS is (IMHO) the sign that just this traditional behaviour is desired.

Cheers
Wolfgang






On 21 May 2017 at 21:12, Arnold Robbins <address@hidden> wrote:
Hi.

> Date: Sun, 21 May 2017 14:58:55 -0400
> From: "Andrew J. Schorr" <address@hiddeninvestments.com>
> To: Arnold Robbins <address@hidden>
> Cc: address@hidden
> Subject: Re: [bug-gawk] FIELDWIDTHS can miscount the number of fields
>
> Hi,
>
> On Sun, May 21, 2017 at 09:52:49PM +0300, Arnold Robbins wrote:
> > You ask tough questions.
>
> :-) Sorry to be a pain.

No, it's good.

> > To me it seems obvious that if all the
> > requested data isn't there then the number of fields should be smaller,
> > allowing a check against what's expected to make it possible to weed
> > out bad data.
>
> I agree with you, but it is a change in behavior. I think it's probably
> safe to do this, but we should document how this stuff works.

Right.

> > It opens up the question of what if there is short data for an
> > individual field - FIELDWIDTHS says field 2 is 5 characters but only
> > three are there.
>
> Yes. Wolfgang's example is on point.
>
> > None of this is well defined in the documentation, nor, obviously
> > was it well thought out to start with. :-(
> >
> > I have written the code to handle the suggested '*' at the end to
> > mean "the rest of the record", which in and of itself is probably
> > a good idea.
>
> Agreed.
>
> > I guess I need to define what happens in these corner cases and
> > put it up for discussion here and then go with whatever seems to
> > make the most sense after the discussion is done. Sigh.
>
> I don't expect much disagreement over this, but one never knows. I think
> we should state clearly what we are doing, and then we should be OK.
> It seems clear that nobody has yet written code with FIELDWIDTHS that
> depends on the subtle NF behavior that you are discussing, so I doubt
> we will break anything.
>
> Regards,
> Andy

So here's my thoughts.

Q2/A2 are the biggest real open.

Arnold
-----------------------------------------------------------------
Sun May 21 21:54:06 IDT 2017
============================

Some thoughts on better definitions of the behavior for FIELDWIDTHS.

Q1. Given FIELDWIDTHS = "2 3 4" and input data "aabb". How many fields
   should there be?
   A. Two, since that's all the data that's there
   B. Three, with $3 == "", since it's supposed to be all fixed width data

A1. Gawk currently says three. Arnold leans towards two, since it reflects
    the actual data and allows code expecting three fields to weed out
    bad records.

Q2. Given FIELDWIDTHS = "2 3 4" and input data "aab", should $2 have a
    value?
    A. No - we're expecting three characters and they weren't all there
    B. Yes - something was there, make it available

A2. Gawk currently says "yes".  Arnold isn't sure what's right here.
    Input is welcome.

Q3. Given FIELDWIDTHS = "2 3 4" and input data "aabbbccccddd" what should
    be done with the dddd?
    A. Nothing - it's extra, ignore it. NF should be set to 3. Code that
       wants to know if there's something extra can use length() and
       substr() to get it out of the record.
    B. Stick it into $4 anyway.

A3. Arnold and gawk agree on (A).

Q4. Given the idea that using "*" at the end of FIELDWIDTHS to mean
    anything else, then with FIELDWIDTHS = "2 3 4 *", and input
    data "aabbbccccdddd" the dddd would go into $4. The final data
    would be optional.  Is there any reason not to add this to gawk?
    It seems to be actually useful and not just theoretically useful.

A4. Arnold thinks it's right to add it.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]