bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: FPAT documentation. The CSV example.


From: arnold
Subject: Re: FPAT documentation. The CSV example.
Date: Sun, 12 Apr 2020 11:44:04 -0600
User-agent: Heirloom mailx 12.5 7/5/10

Thanks. I have added your text, almost verbatim, into the
manual as a new section, including the test data and program.
I have credited you, of course.  Do a 'git pull' to see it.

Looking forward to your CSV library.

Arnold

Manuel Collado <address@hidden> wrote:

> The FPAT and FIELDWIDTHS documentation in the gawk-5 manual
> has been greatly enhanced w.r.t. gawk-4. But still remains a
> little inaccuracy in the example about CSV processing. It
> says:
>
> [... each field is either "anything that is not a comma," or
> "a double quote, anything that is not a double quote, and a
> closing double quote." ...]
>
> And the first proposed FPAT is /([^,]+)|("[^"]+")/, later
> amended as /([^,]*)|("[^"]+")/ to accept empty fields.
>
> But in addition to commas, a CSV field can also contains
> quotes, that have to be escaped by doubling them. The
> proposed regexps fail to accept quoted fields with both
> commas and quotes inside. Perhaps the simplest FPAT
> expression that recognizes this kind of fields is
> /([^,]*)|("([^"]|"")+")/. The following code tests these
> variants.
>
> $ cat sample.csv
> p,"q,r",s
> p,"q""r",s
> p,"q,""r",s
> p,"",s
> p,,s
>
> $ cat fpat.awk
> BEGIN {
>      fp[0] = "([^,]+)|(\"[^\"]+\")"
>      fp[1] = "([^,]*)|(\"[^\"]+\")"
>      fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")"
>      FPAT =  fp[fpat+0]
> }
>
> {
>      print "<" $0 ">"
>      printf("NF = %s ", NF)
>      for (i = 1; i <= NF; i++) {
>          printf("<%s>", $i)
>      }
>      print ""
> }
>
> $ gawk -f fpat.awk sample.csv
> <p,"q,r",s>
> NF = 3 <p><"q,r"><s>
> <p,"q""r",s>
> NF = 3 <p><"q""r"><s>
> <p,"q,""r",s>
> NF = 4 <p><"q,"><"r"><s>
> <p,"",s>
> NF = 3 <p><""><s>
> <p,,s>
> NF = 2 <p><s>
>
> $ gawk -v fpat=1 -f fpat.awk sample.csv
> <p,"q,r",s>
> NF = 3 <p><"q,r"><s>
> <p,"q""r",s>
> NF = 3 <p><"q""r"><s>
> <p,"q,""r",s>
> NF = 4 <p><"q,"><"r"><s>
> <p,"",s>
> NF = 3 <p><""><s>
> <p,,s>
> NF = 3 <p><><s>
>
> $ gawk -v fpat=2 -f fpat.awk sample.csv
> <p,"q,r",s>
> NF = 3 <p><"q,r"><s>
> <p,"q""r",s>
> NF = 3 <p><"q""r"><s>
> <p,"q,""r",s>
> NF = 3 <p><"q,""r"><s>
> <p,"",s>
> NF = 3 <p><""><s>
> <p,,s>
> NF = 3 <p><><s>
>
> Besides that, it is often said that awk is not the right
> tool to process CSV data. This is not true for recent gawk
> versions. The FPAT and BEGINFILE/ENFILE features provide
> enough power to process CSV data in an effective way. I'm
> polishing a gawk source library that mimics the gawkextlib
> csv extension. Hopefully, it can be made publicly available
> in the near future.
>
> Regards.
> -- 
> Manuel Collado - http://mcollado.z15.es



reply via email to

[Prev in Thread] Current Thread [Next in Thread]