Re: [bug-gawk] FPAT bug?

bug-gawk
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] FPAT bug?

From:	Arnold Robbins
Subject:	Re: [bug-gawk] FPAT bug?
Date:	Wed, 12 Apr 2017 12:44:39 +0300
User-agent:	Heirloom mailx 12.5 6/20/10
Hi All.

> Subject: Re: [bug-gawk] FPAT bug?
> To: address@hidden
> Cc: address@hidden
> From: Manuel Collado <address@hidden>
> Date: Wed, 5 Apr 2017 23:04:47 +0200
>
> El 03/04/2017 a las 8:29, address@hidden escribi?:
> >[...]
> > I will get around to investigating the issue with FPAT.
>
> Perhaps a problem with the FPAT feature is how the parsing algorithm is 
> structured. Thinking about it, I feel there is a simpler way to 
> structure the code.
>
> Attached is what can be an awk reference implementation of the FPAT 
> parser. It differs from the current implementation in some ways:
>
> - Each iteration of the parsing loop identifies exactly one
>        (separator[n-1],field[n])
>    pair. The last iteration may deliver just a final separator.
> - Each (separator/field) pair must consume some input characters.
>    Except for the first (sep[0]/field[1]) that can be both nulls.
> - So a null field is only valid as a first field or after
>    a non-null separator.
>
> This avoid the use of flags to remember some state from the previous 
> loop iteration.
>
> The attached code has been tested with the (adapted) set of fpat*.awk 
> tests, and some more.
>
> Please advice if it is worth the pain of trying to restructure the 
> current code, or if there is a simpler way to fix the FPAT bug.

Thanks Manuel.

For the bug-gawk list, Manuel contributed a code fix based on
his awk reference implementation and I have integrated his changes
into master.

Below is the essence of his fix (without some additional tests)
that might apply to gawk-4.1-stable without too much trouble.

In any case, his code is now in master and will thus be part of the
next major release.

Thanks!

Arnold
-------------------------------
diff --git a/ChangeLog b/ChangeLog
index 31c77de..fd75980 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,11 @@
+2017-04-12         Manuel Collado        <address@hidden>
+
+       Fix the FPAT bug reported by Ed Morton in the gawk-bug mailing list.
+
+       * awk.h (Regexp): Remove the non_empty flag.
+       * field.c (fpat_parse_field): Restructure the code to reduce complexity
+       and document the new structure.
+
 2017-04-10         Andrew J. Schorr     <address@hidden>
 
        * awk.h (enum opcodeval): For the avoidance of doubt, specify that
diff --git a/awk.h b/awk.h
index 88a495a..127eb92 100644
--- a/awk.h
+++ b/awk.h
@@ -210,7 +210,6 @@ typedef struct Regexp {
        struct re_pattern_buffer pat;
        struct re_registers regs;
        struct dfa *dfareg;
-       bool non_empty;         /* for use in fpat_parse_field */
        bool has_meta;          /* re has meta chars so (probably) isn't simple 
string */
        bool maybe_long;        /* re has meta chars that can match long text */
 } Regexp;
diff --git a/doc/ChangeLog b/doc/ChangeLog
index 82ae2ce..61d09e2 100644
--- a/doc/ChangeLog
+++ b/doc/ChangeLog
@@ -1,3 +1,7 @@
+2017-04-12         Manuel Collado        <address@hidden>
+
+       * gawktexi.in, gawk.1: Small clarification of the patsplit behavior.
+
 2017-04-11         Arnold D. Robbins     <address@hidden>
 
        * gawktexi.in: Minor style edits.
diff --git a/doc/gawk.1 b/doc/gawk.1
index a4f691d..b04cb01 100644
--- a/doc/gawk.1
+++ b/doc/gawk.1
@@ -2977,9 +2977,11 @@ that matched
 .IR r .
 The value of
 .BI seps[ i ]
-is the separator that appeared in
-front of
-.BI a[ i +1]\fR.
+is the possibly null separator that appeared after
+.BI a[ i ]\fR.
+The value of
+.B seps[0]
+is the possibly null leading separator.
 \&\fRIf
 .I r
 is omitted,
diff --git a/doc/gawktexi.in b/doc/gawktexi.in
index f4fe259..f991432 100644
--- a/doc/gawktexi.in
+++ b/doc/gawktexi.in
@@ -17267,7 +17267,7 @@ using a third argument is a fatal error.
 @cindexgawkfunc{patsplit}
 @cindex split string into array
 Divide
address@hidden into pieces defined by @var{fieldpat}
address@hidden into pieces (or ``fields'') defined by @var{fieldpat}
 and store the pieces in @var{array} and the separator strings in the
 @var{seps} array.  The first piece is stored in
 @address@hidden, the second piece in @address@hidden, and so
@@ -17278,9 +17278,11 @@ It may be either a regexp constant or a string.
 If @var{fieldpat} is omitted, the value of @code{FPAT} is used.
 @code{patsplit()} returns the number of elements created.
 @address@hidden@var{i}]} is
-the separator string
-between @address@hidden@var{i}]} and @address@hidden@var{i}+1]}.
-Any leading separator will be in @address@hidden
+the possibly null separator string
+after @address@hidden@var{i}]}.
+The possibly null leading separator will be in @address@hidden
+So a non-null @var{string} with @var{n} fields will have @var{n+1} separators.
+A null @var{string} will not have neither fields nor separators.
 
 The @code{patsplit()} function splits strings into pieces in a
 manner similar to the way input lines are split into fields using @code{FPAT}
diff --git a/field.c b/field.c
index a3be977..8145141 100644
--- a/field.c
+++ b/field.c
@@ -1502,101 +1502,65 @@ incr_scan(char **scanp, size_t len, mbstate_t *mbs)
  * via (*parse_field)().  This variation is for when FPAT is a regular
  * expression -- use the value to find field contents.
  *
- * This was really hard to get right.  It happens to bear many resemblances
- * to issues I had with getting gsub right with null matches. When dealing
- * with that I prototyped in awk and had the foresight to save the awk code
- * over in the C file.  Starting with that as a base, I finally got to this
- * awk code to do what I needed, and then translated it into C. Fortunately
- * the C code bears a closer correspondance to the awk code here than over
- * by gsub.
+ * The FPAT parsing logic is a bit difficult to specify. In particular
+ * to allow null fields at certain locations. To make the code as robust
+ * as possible, an awk reference implementation was written and tested
+ * as a first step, and later recoded in C, preserving its structure as
+ * much as possible.
  *
- * BEGIN {
- *     false = 0
- *     true = 1
- *
- *     fpat[1] = "([^,]*)|(\"[^\"]+\")"
- *     fpat[2] = fpat[1]
- *     fpat[3] = fpat[1]
- *     fpat[4] = "aa+"
- *     fpat[5] = fpat[4]
- *
- *     data[1] = "Robbins,,Arnold,"
- *     data[2] = "Smith,,\"1234 A Pretty Place, 
NE\",Sometown,NY,12345-6789,USA"
- *     data[3] = "Robbins,Arnold,\"1234 A Pretty Place, 
NE\",Sometown,NY,12345-6789,USA"
- *     data[4] = "bbbaaacccdddaaaaaqqqq"
- *     data[5] = "bbbaaacccdddaaaaaqqqqa" # should get trailing qqqa
- *
- *     for (i = 1; i in data; i++) {
- *             printf("Splitting: <%s>\n", data[i])
- *             n = mypatsplit(data[i], fields, fpat[i], seps)
- *             print "n =", n
- *             for (j = 1; j <= n; j++)
- *                     printf("fields[%d] = <%s>\n", j, fields[j])
- *             for (j = 0; j in seps; j++)
- *                     printf("seps[%s] = <%s>\n", j, seps[j])
- *     }
- * }
- *
- * function mypatsplit(string, array, pattern, seps,
- *                     eosflag, non_empty, nf) # locals
+ * # Reference implementation of the FPAT record parsing.
+ * #
+ * # Each loop iteration identifies a (separator[n-1],field[n]) pair.
+ * # Each loop iteration must consume some characters, except for the first 
field.
+ * # So a null field is only valid as a first field or after a non-null 
separator.
+ * # A null record has no fields (not a single null field).
+ * 
+ * function refpatsplit(string, fields, pattern, seps,
+ *         parse_start, sep_start, field_start, field_length, field_found, nf) 
# locals
  * {
- *     delete array
- *     delete seps
- *     if (length(string) == 0)
- *             return 0
- *
- *     eosflag = non_empty = false
- *     nf = 0
- *     while (match(string, pattern)) {
- *             if (RLENGTH > 0) {      # easy case
- *                     non_empty = true
- *                     if (! (nf in seps)) {
- *                             if (RSTART == 1)        # match at front of 
string
- *                                     seps[nf] = ""
- *                             else
- *                                     seps[nf] = substr(string, 1, RSTART - 1)
- *                     }
- *                     array[++nf] = substr(string, RSTART, RLENGTH)
- *                     string = substr(string, RSTART+RLENGTH)
- *                     if (length(string) == 0)
- *                             break
- *             } else if (non_empty) {
- *                     # last match was non-empty, and at the
- *                     # current character we get a zero length match,
- *                     # which we don't want, so skip over it
- *                     non_empty = false
- *                     seps[nf] = substr(string, 1, 1)
- *                     string = substr(string, 2)
- *             } else {
- *                     # 0 length match
- *                     if (! (nf in seps)) {
- *                             if (RSTART == 1)
- *                                     seps[nf] = ""
- *                             else
- *                                     seps[nf] = substr(string, 1, RSTART - 1)
- *                     }
- *                     array[++nf] = ""
- *                     if (! non_empty && ! eosflag) { # prev was empty
- *                             seps[nf] = substr(string, 1, 1)
- *                     }
- *                     if (RSTART == 1) {
- *                             string = substr(string, 2)
- *                     } else {
- *                             string = substr(string, RSTART + 1)
- *                     }
- *                     non_empty = false
- *             }
- *             if (length(string) == 0) {
- *                     if (eosflag)
- *                             break
- *                     else
- *                             eosflag = true
- *             }
- *     }
- *     if (length(string) > 0)
- *             seps[nf] = string
- *
- *     return length(array)
+ *     # Local state variables:
+ *     # - parse_start: pointer to the first not yet consumed character
+ *     # - sep_start: pointer to the beginning of the parsed separator
+ *     # - field start: pointer to the beginning of the parsed field
+ *     # - field length: length of the parsed field
+ *     # - field_found: flag for succesful field match
+ *     # - nf: Number of fields found so far
+ *     
+ *     # Prepare for parsing
+ *     parse_start = 1   # first not yet parsed char
+ *     nf = 0            # fields found so far
+ *     delete fields
+ *     delete seps
+ * 
+ *     # Loop that consumes the whole record
+ *     while (parse_start <= length(string)) {  # still something to parse
+ *     
+ *         # first attempt to match the next field
+ *         sep_start = parse_start
+ *         field_found = match(substr(string, parse_start), pattern)
+ *         
+ *         # check for an invalid null field and retry one character away
+ *         if (nf > 0 && field_found && RSTART==1 && RLENGTH==0) {
+ *             parse_start++
+ *             field_found = match(substr(string, parse_start), pattern)
+ *         }
+ *         
+ *         # store the (sep[n-1],field[n]) pair
+ *         if (field_found) {
+ *             field_start = parse_start + RSTART - 1
+ *             field_length = RLENGTH
+ *             seps[nf] = substr(string, sep_start, field_start-sep_start)
+ *             fields[++nf] = substr(string, field_start, field_length)
+ *             parse_start = field_start + field_length
+ *             
+ *         # store the final extra sep after the last field
+ *         } else {
+ *             seps[nf] = substr(string, sep_start)
+ *             parse_start = length(string) + 1
+ *         }
+ *     }
+ *     
+ *     return nf
  * }
  */
 static long
@@ -1615,10 +1579,9 @@ fpat_parse_field(long up_to,     /* parse only up to 
this field number */
        char *start;
        char *end = scan + len;
        int regex_flags = RE_NEED_START;
-       bool need_to_set_sep;
-       bool non_empty;
-       bool eosflag;
        mbstate_t mbs;
+       char* field_start;
+       bool field_found;
 
        memset(&mbs, 0, sizeof(mbstate_t));
 
@@ -1631,90 +1594,48 @@ fpat_parse_field(long up_to,    /* parse only up to 
this field number */
        if (rp == NULL) /* use FPAT */
                rp = FPAT_regexp;
 
-       if (in_middle) {
-               regex_flags |= RE_NO_BOL;
-       }
-       non_empty = rp->non_empty;
+       while (scan <= end && nf < up_to) {  /* still something to parse */
 
-       eosflag = false;
-       need_to_set_sep = true;
-       start = scan;
-       while (research(rp, scan, 0, (end - scan), regex_flags) != -1
-              && nf < up_to) {
+               /* first attempt to match the next field */
+               start = scan;
+               field_found = research(rp, scan, 0, (end - scan), regex_flags) 
!= -1;
+
+               /* check for an invalid null field and retry one character away 
*/ 
+               if (nf > 0 && field_found && REEND(rp, scan) == 0) { /* invalid 
null field */
+                       increment_scan(& scan, end - scan);
+                       field_found = research(rp, scan, 0, (end - scan), 
regex_flags) != -1;
+               }
 
-               if (REEND(rp, scan) > RESTART(rp, scan)) { /* if (RLENGTH > 0) 
*/
-                       non_empty = true;
-                       if (sep_arr != NULL && need_to_set_sep) {
-                               if (RESTART(rp, scan) == 0) /* match at front */
-                                       set_element(nf, start, 0L, sep_arr);
+               /* store the (sep[n-1],field[n]) pair */
+               if (field_found) {
+                       field_start = scan + RESTART(rp, scan);
+                       if (sep_arr != NULL) { /* store the separator */
+                               if (field_start == start) /* match at front */
+                                       set_element(nf, start, 0L, sep_arr);
                                else
-                                       set_element(nf,
+                                       set_element(nf,
                                                start,
-                                               (long) RESTART(rp, scan),
+                                               (long) (field_start - start),
                                                sep_arr);
                        }
                        /* field is text that matched */
                        (*set)(++nf,
-                               scan + RESTART(rp, scan),
+                               field_start,
                                (long)(REEND(rp, scan) - RESTART(rp, scan)),
                                n);
-
                        scan += REEND(rp, scan);
-                       if (scan >= end)
-                               break;
-                       need_to_set_sep = true;
-               } else if (non_empty) { /* else if non_empty */
+
+               } else {
                        /*
-                        * last match was non-empty, and at the
-                        * current character we get a zero length match,
-                        * which we don't want, so skip over it
+                        * No match, store the final extra separator after
+                        * the last field.
                         */
-                       non_empty = false;
-                       if (sep_arr != NULL) {
-                               need_to_set_sep = false;
-                               set_element(nf, start, 1L, sep_arr);
-                       }
-                       increment_scan(& scan, end - scan);
-               } else {
-                       /* 0 length match */
-                       if (sep_arr != NULL && need_to_set_sep) {
-                               if (RESTART(rp, scan) == 0) /* RSTART == 1 */
-                                       set_element(nf, start, 0L, sep_arr);
-                               else
-                                       set_element(nf, start,
-                                                       (long) RESTART(rp, 
scan),
-                                                       sep_arr);
-                       }
-                       need_to_set_sep = true;
-                       (*set)(++nf, scan, 0L, n);
-                       if (! non_empty && ! eosflag) { /* prev was empty */
-                               if (sep_arr != NULL) {
-                                       set_element(nf, start, 1L, sep_arr);
-                                       need_to_set_sep = false;
-                               }
-                       }
-                       if (RESTART(rp, scan) == 0)
-                               increment_scan(& scan, end - scan);
-                       else {
-                               scan += RESTART(rp, scan);
-                       }
-                       non_empty = false;
-               }
-               if (scan >= end) { /* length(string) == 0 */
-                       if (eosflag)
-                               break;
-                       else
-                               eosflag = true;
+                       if (sep_arr != NULL)
+                               set_element(nf, start, (long) (end - start), 
sep_arr);
+                       scan = end + 1;
                }
-
-               start = scan;
-       }
-       if (scan < end) {
-               if (sep_arr != NULL)
-                       set_element(nf, scan, (long) (end - scan), sep_arr);
        }
 
        *buf = scan;
-       rp->non_empty = non_empty;
        return nf;
 }
diff --git a/test/patsplit.ok b/test/patsplit.ok
index cda8319..02387d8 100644
--- a/test/patsplit.ok
+++ b/test/patsplit.ok
@@ -8,6 +8,7 @@ seps[0] = <>
 seps[1] = <,>
 seps[2] = <,>
 seps[3] = <,>
+seps[4] = <>
 Splitting: <Smith,,"1234 A Pretty Place, NE",Sometown,NY,12345-6789,USA>
 n = 7
 fields[1] = <Smith>
@@ -24,6 +25,7 @@ seps[3] = <,>
 seps[4] = <,>
 seps[5] = <,>
 seps[6] = <,>
+seps[7] = <>
 Splitting: <Robbins,Arnold,"1234 A Pretty Place, 
NE",Sometown,NY,12345-6789,USA>
 n = 7
 fields[1] = <Robbins>
@@ -40,6 +42,7 @@ seps[3] = <,>
 seps[4] = <,>
 seps[5] = <,>
 seps[6] = <,>
+seps[7] = <>
 Splitting: <bbbaaacccdddaaaaaqqqq>
 n = 2
 fields[1] = <aaa>
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [bug-gawk] FPAT bug?, (continued)
- Re: [bug-gawk] FPAT bug?, Ed Morton, 2017/04/01
Prev by Date: Re: [bug-gawk] Memory leak
Next by Date: [bug-gawk] gawk/master fails to build
Previous by thread: Re: [bug-gawk] FPAT bug?
Next by thread: Re: [bug-gawk] FPAT bug?
Index(es):
- Date
- Thread