parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: csv with multiline records


From: Ole Tange
Subject: Re: csv with multiline records
Date: Fri, 16 Dec 2016 03:57:10 +0100

On Thu, Dec 15, 2016 at 3:35 AM, Ryan Brothers <ryan.brothers@gmail.com> wrote:
> On Wed, Dec 14, 2016 at 2:36 PM, Ole Tange <ole@tange.dk> wrote:
> > But if you can somehow replace the record separator, then you can use 
> > --recend.
> >
> > Given your input this might work:
> >
> >     parallel --pipe --recend '"\n"
> >
> > assuming a good part of the records have a last column with newlines.
>
> Thank you for your help.  I can't assume the last column will always
> have newlines, but your suggestion with --recend gave me an idea to do
> something like:
>
> cat file.csv | php reformat.php | parallel --pipe --recend '@@@'
> --remove-rec-sep wc

Unless @@@ can be in your real data, then that should work just fine.
I often use \0 = NUL because that can be very hard for even a
malicious user to enter. Only if the data is binary will \0 not work.

> reformat.php is a PHP script that reads the csv and writes it out to
> stdout with @@@ in-between each record.
>
> That seems to work great except I don't believe I can use --pipepart
> with this method because the csv with @@@ is generated on the fly.

True. So it is:

cat file.csv | php reformat.php | parallel --pipe --recend '@@@'
--remove-rec-sep wc

Or:

cat file.csv | php reformat.php >tmpfile
parallel --pipepart --recend '@@@' --remove-rec-sep wc :::: tmpfile

My bet is the first is the faster as you avoid saving on disk.

> I would have to save the reformatted csv file to disk.  Do you have any
> thoughts to get around that?  If not, generating a new csv file in
> this format would also be ok for my use case.

How do you in reformat.php determine where the @@@ should be placed?
Can you use a combination of --recend/--recstart to do that? With
--regexp?

If your input lines all start the same way like:

> row1,"1
> 2
> 3"
> row2,4

Then this might work (allowing up to row999999):

--recend '\n' --recstart 'row\d{1,6},' --regexp

It *will* ofcourse f*ck up if the "quoted" string contains
"\nrow123,this is not a new row".


/Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]