[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: What am I doing wrong?
From: |
Rasmus Villemoes |
Subject: |
Re: What am I doing wrong? |
Date: |
Tue, 12 May 2015 11:27:53 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) |
On Mon, May 11 2015, Arun Seetharam <arnstrm@gmail.com> wrote:
> Hi all,
>
> I am trying to use parallel for regular Linux commands, as I have to deal
> with huge files on daily basis. But few times I have tried, I don't see any
> improvement. Is there a threshold for the file size after which the
> parallel is beneficial? Or am I doing it wrong?
> Eg.,
>
> $ time head -n 1000000 huge.vcf | parallel --pipe "awk '{print $123}'" |
> wc -l
> 1000000
>
> Wall Time 0m29.326s
> User Mode 0m22.489s
> Kernel Mode 17m55.061s
> CPU Usage 3745.90%
>
> $ time head -n 1000000 huge.vcf | awk '{print $123}' | wc -l
> 1000000
>
> Wall Time 0m10.329s
> User Mode 0m12.447s
> Kernel Mode 0m4.540s
> CPU Usage 164.46%
Two things spring to mind: First, when comparing two runs like this, always
ensure that (the relevant part of) the file is in the page cache before
both runs; otherwise what you see in the first test may be entirely due
to reading the file from disk, and the second run then benefits greatly
from reading the file from RAM. Could you try repeating the above, but
start by doing 'time head -n 1000000 huge.vcf > /dev/null' before each?
Second, how long are the lines in huge.vcf? If the lines are _extremely_
long (say, 50k), each awk instance ends up getting passed only a few
lines, which means that almost all the time is spent in overhead (spawning
and reaping subprocesses and managing their output). See the --block
option if this is an issue.
However, I'm not sure either of these could explain the huge CPU usage
in the first case.
Rasmus