parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: parallel issue


From: Ole Tange
Subject: Re: parallel issue
Date: Fri, 11 Mar 2011 12:05:47 +0100

Your code looks fine.

The reason why you are not seeing 100% utilization on all 4 cores may
be that your disk cannot deliver data fast enough.

On most disks it is faster to read file1 sequentially and then file2
sequentially instead of reading both file1 and file2 in parallel (as
the latter will cause a lot of disk seeks).

To see if your disks are the limiting factor try:

A: time parallel -u cat ::: files* >/dev/null
B: time parallel -j1 -u cat ::: files* >/dev/null

Remember to flush the disk cache between runs as the disk cache may
make a huge difference.

If B runs faster than A your disk is the limiting factor. If A and B
run at the same speed your disks are not the limiting factor.

Your work can be done by parallelizing on the file level (which is
what you have done), but it can also be parallelized on the record
level (your record is a line).

Parallelizing on record level is done using --pipe:

$ cat ~/weblog/nowidget/deals_apache_log.201102*clean |
parallel -k --pipe 'grep "&subscriber_id=" > log.2011Feb.sub_only'

This will chop the input into 1 MB chunks, spawn grep and pass 1 chunk to grep.

If grep is slow to start you may want have a larger blocksize: --block-size 10M

If grep is fast for some blocks and slow for other blocks it may be a
good idea to start more processes than you have cores: -j300%

$ cat ~/weblog/nowidget/deals_apache_log.201102*clean |
parallel -j300% --block-size 10M -k --pipe 'grep "&subscriber_id=" >
log.2011Feb.sub_only'

This might be quicker.


/Ole

On Fri, Mar 11, 2011 at 1:14 AM, Li Hong <cefs99@gmail.com> wrote:
> Not sure if I am using parallel the right way but I am not seeing all the
> four core are utilized (2 dual-core CPU):
>
> $ ls ~/weblog/nowidget/deals_apache_log.201102*clean |time  parallel --eta
> 'grep "&subscriber_id=" {} > log.2011Feb.sub_only'
>
> Computers / CPU cores / Max jobs to run
> 1:local / 4 / 4
>
> Computer:jobs running/jobs completed/%of started jobs/Average seconds to
> complete
>
>
> ETA: 2096s 21left 62.50avg  local:4/2/100%/62.5s s
>
> ETA: 1106s 20left 47.00avg  local:4/3/100%/47.0s
>
>
> ----
> Tasks: 150 total,   1 running, 140 sleeping,   9 stopped,   0 zombie
> Cpu0  :  2.6%us,  6.6%sy,  0.0%ni,  0.0%id, 88.4%wa,  0.3%hi,  2.0%si,
> 0.0%st
> Cpu1  :  0.0%us,  0.3%sy,  0.0%ni, 82.3%id, 17.3%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Cpu3  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:  12307784k total, 12260708k used,    47076k free,    13596k buffers
> Swap:   499992k total,    11968k used,   488024k free, 10378244k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> COMMAND
> 27461 li        18   0 60228  720  592 D    3  0.0   0:01.07
> grep
> 27412 li        18   0 60224  716  592 D    2  0.0   0:01.93
> grep
> 27458 li        18   0 60224  716  592 D    2  0.0   0:01.18
> grep
> 27456 li        18   0 60228  720  592 D    2  0.0   0:01.21
> grep
>   370 root      10  -5     0    0    0 S    1  0.0 600:10.33
> kswapd0
>   371 root      10  -5     0    0    0 S    0  0.0 441:16.01
> kswapd1
>   613 root      10  -5     0    0    0 D    0  0.0  54:42.09
> kjournald



reply via email to

[Prev in Thread] Current Thread [Next in Thread]