[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Support for CSV file format on sort
From: |
Grigoriy Sokolik |
Subject: |
Re: Support for CSV file format on sort |
Date: |
Sun, 31 Jan 2021 01:11:42 +0200 |
> If you implement csv in sort you’ll have to implement it in head, tail, uniq,
joint, wc, etc. etc. etc...
Could the format processing logic be extracted? Also maybe that's a place
for some kind of abstractions like format processor, unquoted format
processor, etc?
On Sun 31. Jan 2021 at 0.59, Erik Auerswald <auerswal@unix-ag.uni-kl.de>
wrote:
> Hi,
>
> On 30.01.21 21:28, Eric Fischer wrote:
> > A couple of years ago I went down this route of thinking I would add CSV
> > support to sort, and then let myself get distracted into trying to follow
> >
> https://paulfitz.github.io/2017/01/24/the-year-of-poop-on-the-desktop.html
>
> Well, but not everyone is using PSV format, many are using some
> kind of CSV format. I sometimes use CSV (or SSV, semicolon
> separated values ;) as a simple compatibility format when working
> with people not using the GNU operating system.
>
> Even with ASCII there are seldom used characters that look helpful
> for character separated value files, e.g., "Unit Separator" (0x1f),
> to practically get rid of the need for quoted fields.
>
> But since not everybody uses those characters already, a tool that
> bridges the worlds of RFC 4180 CSV(*) and GNU Coreutils might be
> handy.
>
> Seldom used ASCII (i.e., single byte) characters could be used as
> field separator to enable working with GNU tools, even if this is
> just used in a pipeline, but never seen by the user:
>
> csvconv -f, -t$'x1f' data.csv | sort -t$'\x1f' | csvconv -f$'\x1f' -t,
>
> (This uses an imaginary CSV tool "csvconv" to convert from (-f) one
> separator to (-t) another while observing CSV quoting rules.)
>
> Disclaimer: I did not check if sort works correctly with "-t$'\x1f'".
>
> To allow newlines inside a field one could terminate each row of CSV
> data with NUL, and use "sort -z". Thus the imaginary csvconv could
> use "--input-zero-terminated" and "--output-zero-terminated" options
> as well.
>
> The imaginary "csvconv"'s adherence to (generalized) CSV quoting
> rules would be the primary difference to "tr", "sed", or "awk".
>
> Thanks,
> Erik
>
> (*) RFC 4180 requires CRLF instead of LF as end-of-line sequence, but
> many implementations just use the native end-of-line sequence.
>
> --
Thanks!
Best regards,
Grigorii