[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Optimizing -j parameter?
Re: Optimizing -j parameter?
Thu, 11 Jan 2018 02:18:57 +0100
On Tue, Jan 9, 2018 at 11:57 PM, PD <address@hidden> wrote:
> I'm trying to download a large number of files from AWS S3,... but AWS does
> have throttling rules that
> will start to reject requests if you send them too fast.
If you know what "too fast" means, then `-j0 --delay X` sounds like
the way to go.
> If I change the contents of loadfile.txt to '20', the numbers in the ETA
> display change, but it's not clear whether the final ETA is based on all
> of the jobs that have gone before, or just the ones since loadfile.txt was
> changed, or something else.
The ETA is based on an exponential average. So it will take all
runtimes into account, but it will put more emphasis on the runtime of
the most recent job than on the job that finished before this job. On
top of that there is a bit of smoothing so the average does not jump
all the time. It then looks at how many jobs are still unfinished, and
simply multiply the smoothed exponential average with the number of
> In general, how do you find the optimal number of jobs to run in parallel?
In general that is a question that is impossible to answer: You may in
theory mix jobs that are heavy CPU with jobs that are heavy I/O with
jobs that are neither.
But let us assume that all jobs are the same type:
* CPU hungry (e.g. bzip2)
Here --load 100% should find the optimal number of jobs - even if the
system is used by others.
* RAM hungry (e.g. convert on hi-res images)
Here --memfree can help. If you know that max usage of a single job is
1 GB, just make sure to have at least 1 GB free before spawning
another job. If the job does not use the full 1 GB immediately, it can
be combined with --delay.
* Network hungry (e.g. wget)
Here you can often just use -j0 - unless the server you are getting
data from is very small.
* Disk I/O hungry (e.g. cat)
This is a tricky one, as disks are very different in nature: Single
magnetic disk, single SSD, RAID of magnetic disks, RAID of SSD, tiered
disks (SSD caching with magnetic backing).
I wrote a post about that
* None of the above (e.g. sleep)
Typically -j0 will be fine here.
> Is there a way to graph the number of processes and the job rate over
> time? (I'm a visual kind of guy.)
The typical situation is running a fixed number of processes over
time, which makes for a boring graph. But you should be able to do a
graph based on --joblog: Here you find both start time and run time.
> Does Parallel have an automatic optimizer? After all, it's got every
> other feature under the sun, why not this? :-)
How should that work?
One of the more obvious problems is that if GNU Parallel spawns too
many jobs, you may descend into swap-hell. How can we avoid that?
But you can try:
parallel --memfree 1G --load 100% -j0
which should be reasonable for locally run jobs.
For your S3-jobs you have constrain that not only do you want the
optimal time to run the jobs, you also want them to succeed: If they
finish with error then that is not enough for you. This constrain will
make the algorithm even harder.
Maybe we can have '-j =optimizeforsuccess' do:
if success: slots = slots * 1.1 else slots = slots / 1.1;
This would require Amazon not to hold a grudge: If you spawn too fast,
they cannot be rejecting for next X minutes but only until you spawn
at a lower rate.
But let us discuss ideas on how such an optimization algorithm might