parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Processing files from a tar archive in parallel


From: Benjamin R. Haskell
Subject: Re: Processing files from a tar archive in parallel
Date: Tue, 29 Mar 2011 11:41:23 -0400 (EDT)
User-agent: Alpine 2.01 (LNX 1266 2009-07-14)

On Tue, 29 Mar 2011, Hans Schou wrote:

On Tue, 29 Mar 2011, Jay Hacker wrote:

I have a large gzipped tar archive containing many small files; just untarring it takes a lot of time and space. I'd like to be able to process each file in the archive, ideally without untarring the whole thing first, and I'd like to process several files in parallel. Is there a recipe for this with GNU Parallel?

tar xvf big-file.tar.gz | parallel echo "Proc this file {}"

Parallel will start when the first file is untared.


Wow. Glad I was (hopefully) so wrong. (I should point out that the last time I wanted to do this, I'd not yet discovered Parallel.)

Hans, you left off the 'z' in 'tar zxvf':

tar zxvf big-file.tar.gz | parallel echo "Proc this file {}"

Jay, you probably also want to 'rm' the files as you go, since space sounds like an issue.

And, unfortunately, it seems as though there's a timing issue with when 'tar' spits out the name... If the individual files are large, you might have a job started before the file is fully there.

Tested via:

$ cd /tmp
$ mkdir /foo
# create 5 1-GB files
$ seq 1 5 | parallel dd if=/dev/zero of=foo/{} bs=1G count=1
$ tar -zcvf foo.tgz foo/*
$ rm -rf foo
$ tar -zxvf foo.tgz | parallel 'ls -l {} && rm {}'
parallel: Warning: Starting 10 extra processes takes > 2 sec.
Consider adjusting -j. Press CTRL-C to stop.
-rw------- 1 bhaskell users 1073741824 2011-03-29 11:28 foo/1
tar: foo/2: Cannot utime: No such file or directory
-rw------- 1 bhaskell users 4652032 2011-03-29 11:38 foo/2



reply via email to

[Prev in Thread] Current Thread [Next in Thread]