parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU Parallel seems to drop


From: Ole Tange
Subject: Re: GNU Parallel seems to drop
Date: Tue, 25 Sep 2012 15:11:05 +0200

On Tue, Sep 25, 2012 at 1:50 PM, Dirk Eddelbuettel <edd@debian.org> wrote:

> Well a little "apt-get install gawk-doc" and two seconds of searching lead to
> the '>>' operator to append to files ... and tada, it now works.

Depending on how it appends that may not work. Do you know for sure it
flushes for every record? Otherwise you may get half-records.

> edd@max:/tmp/parallel$ rm dataSerial/* dataParallel/*
> edd@max:/tmp/parallel$
> edd@max:/tmp/parallel$ cat data.txt | \
>          awk -v path=dataSerial '{print $0 > (path "/" $1 ".txt")}'
> edd@max:/tmp/parallel$ cat data.txt | \
>          parallel --pipe -- awk -v path=dataParallel -f script.awk
> edd@max:/tmp/parallel$ wc -l dataSerial/*
>   199762 dataSerial/A.txt
>   200031 dataSerial/B.txt
>   200283 dataSerial/C.txt
>   199845 dataSerial/D.txt
>   200079 dataSerial/E.txt
>  1000000 total
> edd@max:/tmp/parallel$ wc -l dataParallel/*
>   199762 dataParallel/A.txt
>   200031 dataParallel/B.txt
>   200283 dataParallel/C.txt
>   199845 dataParallel/D.txt
>   200079 dataParallel/E.txt
>  1000000 total

If these give the same output, then you are golden. If not, you may
have half-records in the parallel data.

parallel -k --tag 'sort {} | md5sum' ::: dataSerial/*
parallel -k --tag 'sort {} | md5sum' ::: dataParallel/*


/Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]