[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] FPAT bug?
From: |
Arnold Robbins |
Subject: |
Re: [bug-gawk] FPAT bug? |
Date: |
Wed, 12 Apr 2017 12:44:39 +0300 |
User-agent: |
Heirloom mailx 12.5 6/20/10 |
Hi All.
> Subject: Re: [bug-gawk] FPAT bug?
> To: address@hidden
> Cc: address@hidden
> From: Manuel Collado <address@hidden>
> Date: Wed, 5 Apr 2017 23:04:47 +0200
>
> El 03/04/2017 a las 8:29, address@hidden escribi?:
> >[...]
> > I will get around to investigating the issue with FPAT.
>
> Perhaps a problem with the FPAT feature is how the parsing algorithm is
> structured. Thinking about it, I feel there is a simpler way to
> structure the code.
>
> Attached is what can be an awk reference implementation of the FPAT
> parser. It differs from the current implementation in some ways:
>
> - Each iteration of the parsing loop identifies exactly one
> (separator[n-1],field[n])
> pair. The last iteration may deliver just a final separator.
> - Each (separator/field) pair must consume some input characters.
> Except for the first (sep[0]/field[1]) that can be both nulls.
> - So a null field is only valid as a first field or after
> a non-null separator.
>
> This avoid the use of flags to remember some state from the previous
> loop iteration.
>
> The attached code has been tested with the (adapted) set of fpat*.awk
> tests, and some more.
>
> Please advice if it is worth the pain of trying to restructure the
> current code, or if there is a simpler way to fix the FPAT bug.
Thanks Manuel.
For the bug-gawk list, Manuel contributed a code fix based on
his awk reference implementation and I have integrated his changes
into master.
Below is the essence of his fix (without some additional tests)
that might apply to gawk-4.1-stable without too much trouble.
In any case, his code is now in master and will thus be part of the
next major release.
Thanks!
Arnold
-------------------------------
diff --git a/ChangeLog b/ChangeLog
index 31c77de..fd75980 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,11 @@
+2017-04-12 Manuel Collado <address@hidden>
+
+ Fix the FPAT bug reported by Ed Morton in the gawk-bug mailing list.
+
+ * awk.h (Regexp): Remove the non_empty flag.
+ * field.c (fpat_parse_field): Restructure the code to reduce complexity
+ and document the new structure.
+
2017-04-10 Andrew J. Schorr <address@hidden>
* awk.h (enum opcodeval): For the avoidance of doubt, specify that
diff --git a/awk.h b/awk.h
index 88a495a..127eb92 100644
--- a/awk.h
+++ b/awk.h
@@ -210,7 +210,6 @@ typedef struct Regexp {
struct re_pattern_buffer pat;
struct re_registers regs;
struct dfa *dfareg;
- bool non_empty; /* for use in fpat_parse_field */
bool has_meta; /* re has meta chars so (probably) isn't simple
string */
bool maybe_long; /* re has meta chars that can match long text */
} Regexp;
diff --git a/doc/ChangeLog b/doc/ChangeLog
index 82ae2ce..61d09e2 100644
--- a/doc/ChangeLog
+++ b/doc/ChangeLog
@@ -1,3 +1,7 @@
+2017-04-12 Manuel Collado <address@hidden>
+
+ * gawktexi.in, gawk.1: Small clarification of the patsplit behavior.
+
2017-04-11 Arnold D. Robbins <address@hidden>
* gawktexi.in: Minor style edits.
diff --git a/doc/gawk.1 b/doc/gawk.1
index a4f691d..b04cb01 100644
--- a/doc/gawk.1
+++ b/doc/gawk.1
@@ -2977,9 +2977,11 @@ that matched
.IR r .
The value of
.BI seps[ i ]
-is the separator that appeared in
-front of
-.BI a[ i +1]\fR.
+is the possibly null separator that appeared after
+.BI a[ i ]\fR.
+The value of
+.B seps[0]
+is the possibly null leading separator.
\&\fRIf
.I r
is omitted,
diff --git a/doc/gawktexi.in b/doc/gawktexi.in
index f4fe259..f991432 100644
--- a/doc/gawktexi.in
+++ b/doc/gawktexi.in
@@ -17267,7 +17267,7 @@ using a third argument is a fatal error.
@cindexgawkfunc{patsplit}
@cindex split string into array
Divide
address@hidden into pieces defined by @var{fieldpat}
address@hidden into pieces (or ``fields'') defined by @var{fieldpat}
and store the pieces in @var{array} and the separator strings in the
@var{seps} array. The first piece is stored in
@address@hidden, the second piece in @address@hidden, and so
@@ -17278,9 +17278,11 @@ It may be either a regexp constant or a string.
If @var{fieldpat} is omitted, the value of @code{FPAT} is used.
@code{patsplit()} returns the number of elements created.
@address@hidden@var{i}]} is
-the separator string
-between @address@hidden@var{i}]} and @address@hidden@var{i}+1]}.
-Any leading separator will be in @address@hidden
+the possibly null separator string
+after @address@hidden@var{i}]}.
+The possibly null leading separator will be in @address@hidden
+So a non-null @var{string} with @var{n} fields will have @var{n+1} separators.
+A null @var{string} will not have neither fields nor separators.
The @code{patsplit()} function splits strings into pieces in a
manner similar to the way input lines are split into fields using @code{FPAT}
diff --git a/field.c b/field.c
index a3be977..8145141 100644
--- a/field.c
+++ b/field.c
@@ -1502,101 +1502,65 @@ incr_scan(char **scanp, size_t len, mbstate_t *mbs)
* via (*parse_field)(). This variation is for when FPAT is a regular
* expression -- use the value to find field contents.
*
- * This was really hard to get right. It happens to bear many resemblances
- * to issues I had with getting gsub right with null matches. When dealing
- * with that I prototyped in awk and had the foresight to save the awk code
- * over in the C file. Starting with that as a base, I finally got to this
- * awk code to do what I needed, and then translated it into C. Fortunately
- * the C code bears a closer correspondance to the awk code here than over
- * by gsub.
+ * The FPAT parsing logic is a bit difficult to specify. In particular
+ * to allow null fields at certain locations. To make the code as robust
+ * as possible, an awk reference implementation was written and tested
+ * as a first step, and later recoded in C, preserving its structure as
+ * much as possible.
*
- * BEGIN {
- * false = 0
- * true = 1
- *
- * fpat[1] = "([^,]*)|(\"[^\"]+\")"
- * fpat[2] = fpat[1]
- * fpat[3] = fpat[1]
- * fpat[4] = "aa+"
- * fpat[5] = fpat[4]
- *
- * data[1] = "Robbins,,Arnold,"
- * data[2] = "Smith,,\"1234 A Pretty Place,
NE\",Sometown,NY,12345-6789,USA"
- * data[3] = "Robbins,Arnold,\"1234 A Pretty Place,
NE\",Sometown,NY,12345-6789,USA"
- * data[4] = "bbbaaacccdddaaaaaqqqq"
- * data[5] = "bbbaaacccdddaaaaaqqqqa" # should get trailing qqqa
- *
- * for (i = 1; i in data; i++) {
- * printf("Splitting: <%s>\n", data[i])
- * n = mypatsplit(data[i], fields, fpat[i], seps)
- * print "n =", n
- * for (j = 1; j <= n; j++)
- * printf("fields[%d] = <%s>\n", j, fields[j])
- * for (j = 0; j in seps; j++)
- * printf("seps[%s] = <%s>\n", j, seps[j])
- * }
- * }
- *
- * function mypatsplit(string, array, pattern, seps,
- * eosflag, non_empty, nf) # locals
+ * # Reference implementation of the FPAT record parsing.
+ * #
+ * # Each loop iteration identifies a (separator[n-1],field[n]) pair.
+ * # Each loop iteration must consume some characters, except for the first
field.
+ * # So a null field is only valid as a first field or after a non-null
separator.
+ * # A null record has no fields (not a single null field).
+ *
+ * function refpatsplit(string, fields, pattern, seps,
+ * parse_start, sep_start, field_start, field_length, field_found, nf)
# locals
* {
- * delete array
- * delete seps
- * if (length(string) == 0)
- * return 0
- *
- * eosflag = non_empty = false
- * nf = 0
- * while (match(string, pattern)) {
- * if (RLENGTH > 0) { # easy case
- * non_empty = true
- * if (! (nf in seps)) {
- * if (RSTART == 1) # match at front of
string
- * seps[nf] = ""
- * else
- * seps[nf] = substr(string, 1, RSTART - 1)
- * }
- * array[++nf] = substr(string, RSTART, RLENGTH)
- * string = substr(string, RSTART+RLENGTH)
- * if (length(string) == 0)
- * break
- * } else if (non_empty) {
- * # last match was non-empty, and at the
- * # current character we get a zero length match,
- * # which we don't want, so skip over it
- * non_empty = false
- * seps[nf] = substr(string, 1, 1)
- * string = substr(string, 2)
- * } else {
- * # 0 length match
- * if (! (nf in seps)) {
- * if (RSTART == 1)
- * seps[nf] = ""
- * else
- * seps[nf] = substr(string, 1, RSTART - 1)
- * }
- * array[++nf] = ""
- * if (! non_empty && ! eosflag) { # prev was empty
- * seps[nf] = substr(string, 1, 1)
- * }
- * if (RSTART == 1) {
- * string = substr(string, 2)
- * } else {
- * string = substr(string, RSTART + 1)
- * }
- * non_empty = false
- * }
- * if (length(string) == 0) {
- * if (eosflag)
- * break
- * else
- * eosflag = true
- * }
- * }
- * if (length(string) > 0)
- * seps[nf] = string
- *
- * return length(array)
+ * # Local state variables:
+ * # - parse_start: pointer to the first not yet consumed character
+ * # - sep_start: pointer to the beginning of the parsed separator
+ * # - field start: pointer to the beginning of the parsed field
+ * # - field length: length of the parsed field
+ * # - field_found: flag for succesful field match
+ * # - nf: Number of fields found so far
+ *
+ * # Prepare for parsing
+ * parse_start = 1 # first not yet parsed char
+ * nf = 0 # fields found so far
+ * delete fields
+ * delete seps
+ *
+ * # Loop that consumes the whole record
+ * while (parse_start <= length(string)) { # still something to parse
+ *
+ * # first attempt to match the next field
+ * sep_start = parse_start
+ * field_found = match(substr(string, parse_start), pattern)
+ *
+ * # check for an invalid null field and retry one character away
+ * if (nf > 0 && field_found && RSTART==1 && RLENGTH==0) {
+ * parse_start++
+ * field_found = match(substr(string, parse_start), pattern)
+ * }
+ *
+ * # store the (sep[n-1],field[n]) pair
+ * if (field_found) {
+ * field_start = parse_start + RSTART - 1
+ * field_length = RLENGTH
+ * seps[nf] = substr(string, sep_start, field_start-sep_start)
+ * fields[++nf] = substr(string, field_start, field_length)
+ * parse_start = field_start + field_length
+ *
+ * # store the final extra sep after the last field
+ * } else {
+ * seps[nf] = substr(string, sep_start)
+ * parse_start = length(string) + 1
+ * }
+ * }
+ *
+ * return nf
* }
*/
static long
@@ -1615,10 +1579,9 @@ fpat_parse_field(long up_to, /* parse only up to
this field number */
char *start;
char *end = scan + len;
int regex_flags = RE_NEED_START;
- bool need_to_set_sep;
- bool non_empty;
- bool eosflag;
mbstate_t mbs;
+ char* field_start;
+ bool field_found;
memset(&mbs, 0, sizeof(mbstate_t));
@@ -1631,90 +1594,48 @@ fpat_parse_field(long up_to, /* parse only up to
this field number */
if (rp == NULL) /* use FPAT */
rp = FPAT_regexp;
- if (in_middle) {
- regex_flags |= RE_NO_BOL;
- }
- non_empty = rp->non_empty;
+ while (scan <= end && nf < up_to) { /* still something to parse */
- eosflag = false;
- need_to_set_sep = true;
- start = scan;
- while (research(rp, scan, 0, (end - scan), regex_flags) != -1
- && nf < up_to) {
+ /* first attempt to match the next field */
+ start = scan;
+ field_found = research(rp, scan, 0, (end - scan), regex_flags)
!= -1;
+
+ /* check for an invalid null field and retry one character away
*/
+ if (nf > 0 && field_found && REEND(rp, scan) == 0) { /* invalid
null field */
+ increment_scan(& scan, end - scan);
+ field_found = research(rp, scan, 0, (end - scan),
regex_flags) != -1;
+ }
- if (REEND(rp, scan) > RESTART(rp, scan)) { /* if (RLENGTH > 0)
*/
- non_empty = true;
- if (sep_arr != NULL && need_to_set_sep) {
- if (RESTART(rp, scan) == 0) /* match at front */
- set_element(nf, start, 0L, sep_arr);
+ /* store the (sep[n-1],field[n]) pair */
+ if (field_found) {
+ field_start = scan + RESTART(rp, scan);
+ if (sep_arr != NULL) { /* store the separator */
+ if (field_start == start) /* match at front */
+ set_element(nf, start, 0L, sep_arr);
else
- set_element(nf,
+ set_element(nf,
start,
- (long) RESTART(rp, scan),
+ (long) (field_start - start),
sep_arr);
}
/* field is text that matched */
(*set)(++nf,
- scan + RESTART(rp, scan),
+ field_start,
(long)(REEND(rp, scan) - RESTART(rp, scan)),
n);
-
scan += REEND(rp, scan);
- if (scan >= end)
- break;
- need_to_set_sep = true;
- } else if (non_empty) { /* else if non_empty */
+
+ } else {
/*
- * last match was non-empty, and at the
- * current character we get a zero length match,
- * which we don't want, so skip over it
+ * No match, store the final extra separator after
+ * the last field.
*/
- non_empty = false;
- if (sep_arr != NULL) {
- need_to_set_sep = false;
- set_element(nf, start, 1L, sep_arr);
- }
- increment_scan(& scan, end - scan);
- } else {
- /* 0 length match */
- if (sep_arr != NULL && need_to_set_sep) {
- if (RESTART(rp, scan) == 0) /* RSTART == 1 */
- set_element(nf, start, 0L, sep_arr);
- else
- set_element(nf, start,
- (long) RESTART(rp,
scan),
- sep_arr);
- }
- need_to_set_sep = true;
- (*set)(++nf, scan, 0L, n);
- if (! non_empty && ! eosflag) { /* prev was empty */
- if (sep_arr != NULL) {
- set_element(nf, start, 1L, sep_arr);
- need_to_set_sep = false;
- }
- }
- if (RESTART(rp, scan) == 0)
- increment_scan(& scan, end - scan);
- else {
- scan += RESTART(rp, scan);
- }
- non_empty = false;
- }
- if (scan >= end) { /* length(string) == 0 */
- if (eosflag)
- break;
- else
- eosflag = true;
+ if (sep_arr != NULL)
+ set_element(nf, start, (long) (end - start),
sep_arr);
+ scan = end + 1;
}
-
- start = scan;
- }
- if (scan < end) {
- if (sep_arr != NULL)
- set_element(nf, scan, (long) (end - scan), sep_arr);
}
*buf = scan;
- rp->non_empty = non_empty;
return nf;
}
diff --git a/test/patsplit.ok b/test/patsplit.ok
index cda8319..02387d8 100644
--- a/test/patsplit.ok
+++ b/test/patsplit.ok
@@ -8,6 +8,7 @@ seps[0] = <>
seps[1] = <,>
seps[2] = <,>
seps[3] = <,>
+seps[4] = <>
Splitting: <Smith,,"1234 A Pretty Place, NE",Sometown,NY,12345-6789,USA>
n = 7
fields[1] = <Smith>
@@ -24,6 +25,7 @@ seps[3] = <,>
seps[4] = <,>
seps[5] = <,>
seps[6] = <,>
+seps[7] = <>
Splitting: <Robbins,Arnold,"1234 A Pretty Place,
NE",Sometown,NY,12345-6789,USA>
n = 7
fields[1] = <Robbins>
@@ -40,6 +42,7 @@ seps[3] = <,>
seps[4] = <,>
seps[5] = <,>
seps[6] = <,>
+seps[7] = <>
Splitting: <bbbaaacccdddaaaaaqqqq>
n = 2
fields[1] = <aaa>
- Re: [bug-gawk] FPAT bug?, (continued)
- Re: [bug-gawk] FPAT bug?, Andrew J. Schorr, 2017/04/01
- Re: [bug-gawk] FPAT bug?, Ed Morton, 2017/04/01
- Re: [bug-gawk] FPAT bug?, Andrew J. Schorr, 2017/04/02
- Re: [bug-gawk] FPAT bug?, david kerns, 2017/04/02
- Re: [bug-gawk] FPAT bug?, Andrew J. Schorr, 2017/04/02
- Re: [bug-gawk] FPAT bug?, Manuel Collado, 2017/04/02
- Re: [bug-gawk] FPAT bug?, Ed Morton, 2017/04/02
- Re: [bug-gawk] FPAT bug?, arnold, 2017/04/03
- Re: [bug-gawk] FPAT bug?, Ed Morton, 2017/04/03
- Re: [bug-gawk] FPAT bug?, Manuel Collado, 2017/04/05
- Re: [bug-gawk] FPAT bug?,
Arnold Robbins <=
Re: [bug-gawk] FPAT bug?, Ed Morton, 2017/04/01