parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: What am I doing wrong?


From: Rasmus Villemoes
Subject: Re: What am I doing wrong?
Date: Tue, 12 May 2015 11:27:53 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux)

On Mon, May 11 2015, Arun Seetharam <arnstrm@gmail.com> wrote:

> Hi all,
>
> I am trying to use parallel for regular Linux commands, as I have to deal
> with huge files on daily basis. But few times I have tried, I don't see any
> improvement. Is there a threshold for the file size after which the
> parallel is beneficial? Or am I doing it wrong?
> Eg.,
>
> $ time head -n 1000000 huge.vcf  | parallel --pipe "awk '{print $123}'"  |
> wc -l
> 1000000
>
> Wall Time       0m29.326s
> User Mode       0m22.489s
> Kernel Mode     17m55.061s
> CPU Usage       3745.90%
>
> $ time head -n 1000000 huge.vcf | awk '{print $123}' | wc -l
> 1000000
>
> Wall Time       0m10.329s
> User Mode       0m12.447s
> Kernel Mode     0m4.540s
> CPU Usage       164.46%

Two things spring to mind: First, when comparing two runs like this, always
ensure that (the relevant part of) the file is in the page cache before
both runs; otherwise what you see in the first test may be entirely due
to reading the file from disk, and the second run then benefits greatly
from reading the file from RAM. Could you try repeating the above, but
start by doing 'time head -n 1000000 huge.vcf > /dev/null' before each?

Second, how long are the lines in huge.vcf? If the lines are _extremely_
long (say, 50k), each awk instance ends up getting passed only a few
lines, which means that almost all the time is spent in overhead (spawning
and reaping subprocesses and managing their output). See the --block
option if this is an issue.

However, I'm not sure either of these could explain the huge CPU usage
in the first case.

Rasmus



reply via email to

[Prev in Thread] Current Thread [Next in Thread]