[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Threaded versions of cp, mv, ls for high latency / parallel filesyst

From: Dr. David Alan Gilbert
Subject: Re: Threaded versions of cp, mv, ls for high latency / parallel filesystems?
Date: Sun, 9 Nov 2008 23:06:36 +0000
User-agent: Mutt/1.5.13 (2006-08-11)

* Andrew McGill (address@hidden) wrote:
> Greetings coreutils folks,
> There are a number of interesting filesystems (glusterfs, lustre? ... NFS) 
> which could benefit from userspace utilities doing certain operatings in 
> parallel.  (I have a very slow glusterfs installation that makes me think 
> that some things can be done better.)
> For example, copying a number of files is currently done in series ...
>       cp a b c d e f g h dest/
> but, on certain filesystems, it would be roughly twice as efficient if 
> implemented in two parallel threads, something like:
>       cp a c e g dest/ &
>       cp b d f h dest/
> since the source and destination files can be stored on multiple physical 
> volumes.  

Of course you can't do that by hand since each might be a directory with an
unbalanced number of files etc - so you are right, something smarter
is needed (my pet hate is 'tar' or 'cp' working it's way through a 
source tree of thousands of small files).

> Simlarly, ls -l . will readdir(), and then stat() each file in the directory. 
> On a filesystem with high latency, it would be faster to issue the stat() 
> calls asynchronously, and in parallel, and then collect the results for 
> display.  (This could improve performance for NFS, in proportion to the 
> latency and the number of threads.)

I think, as you are suggesting, you have to end up doing threading
in the userland code which to me seems to be mad since the code doesn't
really know how wide to go and it's a fair overhead.  In addition this
behaviour can be really bad if you get it wrong - for example
if 'dest' is a single disc then having multiple writers writing two
large files leads to fragmentation on many filesystems.

I once tried to write a backup system that streamed data from 10's of machines
trying to write a few MB at a time on Linux, each machine being a separate
process; unfortuantely the kernel was too smart and ended up writing a few
KB from each process before moving onto the next leading to *awful* throughput.

> Question:  Is there already a set of "improved" utilities that implement this 
> kind of technique?  If not, would this kind of performance enhancements be 
> considered useful?  (It would mean introducing threading into programs which 
> are currently single-threaded.)
> One could also optimise the text utilities like cat by doing the open() and 
> stat() operations in parallel and in the background -- userspace read-ahead 
> caching.  All of the utilities which process mutliple files could get 
> small speed boosts from this -- rm, cat, chown, chmod ... even tail, head, 
> wc -- but probably only on network filesystems.

I keep wondering if the OS level needs a better interface; an 'openv' or 'statv'
or I'm currently wondering if a combined call would work - something which
would stat a path, if it's a normal file, open it, read upto a buffers worth
and if finished close it - it might work nicely for small files.

 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    | Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

reply via email to

[Prev in Thread] Current Thread [Next in Thread]