[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Processing files from a tar archive in parallel
From: |
Jay Hacker |
Subject: |
Re: Processing files from a tar archive in parallel |
Date: |
Wed, 30 Mar 2011 13:18:39 -0400 |
With tar files, to extract the file you want, you first have to read
through all the files before it. So reading the k-th file takes time
proportional to k (or the size of the first k files, anyway). If you
read all N files like that separately, it will take time 1 + 2 + 3 +
... + N, which is O(N^2). So if it's slow just untarring the file
once, doing it N times is going to be *really* unpleasant. :)
On Tue, Mar 29, 2011 at 5:41 PM, Cook, Malcolm <MEC@stowers.org> wrote:
> ooops, more like:
>
> tar -t big-file.tar.gz | parallel tar -O -x -f big-file.tar.gz '|'
> someCommandThatReadsFromStdIn
>
>
> Malcolm Cook
> Stowers Institute for Medical Research - Bioinformatics
> Kansas City, Missouri USA
>
>
>
>> -----Original Message-----
>> From: parallel-bounces+mec=stowers.org@gnu.org
>> [mailto:parallel-bounces+mec=stowers.org@gnu.org] On Behalf
>> Of Cook, Malcolm
>> Sent: Tuesday, March 29, 2011 4:35 PM
>> To: 'Ole Tange'; 'Jay Hacker'
>> Cc: 'parallel@gnu.org'
>> Subject: RE: Processing files from a tar archive in parallel
>>
>> Hmmm
>>
>> use tar-t to extract the filenames pipe that into parallel to
>> call tar again to extract just that file and pipe it to some
>> other command
>>
>> tar -t big-file.tar.gz | parallel tar -f big-file.tar.gz -
>> '|' someCommandThatReadsFromStdIn
>>
>> Malcolm Cook
>> Stowers Institute for Medical Research - Bioinformatics
>> Kansas City, Missouri USA
>>
>>
>>
>> > -----Original Message-----
>> > From: parallel-bounces+mec=stowers.org@gnu.org
>> > [mailto:parallel-bounces+mec=stowers.org@gnu.org] On Behalf Of Ole
>> > Tange
>> > Sent: Tuesday, March 29, 2011 4:14 PM
>> > To: Jay Hacker
>> > Cc: parallel@gnu.org
>> > Subject: Re: Processing files from a tar archive in parallel
>> >
>> > On Tue, Mar 29, 2011 at 10:14 PM, Jay Hacker <jayqhacker@gmail.com>
>> > wrote:
>> > > On Tue, Mar 29, 2011 at 11:20 AM, Hans Schou
>> <chlor@schou.dk> wrote:
>> > >> On Tue, 29 Mar 2011, Jay Hacker wrote:
>> > >>
>> > >>> I have a large gzipped tar archive containing many small
>> > files; just
>> > >>> untarring it takes a lot of time and space. I'd like to
>> > be able to
>> > >>> process each file in the archive, ideally without untarring the
>> > >>> whole thing first,
>> > :
>> > >> tar xvf big-file.tar.gz | parallel echo "Proc this file {}"
>> > >>
>> > >> Parallel will start when the first file is untared.
>> > :
>> > > That is a great idea. However, can I be sure the file is
>> > completely
>> > > written to disk before tar prints the filename?
>> >
>> > While I loved Hans' idea, it does indeed have a race
>> condition. This
>> > should run 'ls -l' on each file after decompressing and
>> clearly fails
>> > now and then:
>> >
>> > $ tar xvf ../i.tgz | parallel ls -l > ls-l
>> > ls: cannot access 1792: No such file or directory
>> > ls: cannot access 209: No such file or directory
>> > ls: cannot access 21: No such file or directory
>> > ls: cannot access 2256: No such file or directory
>> > ls: cannot access 2349: No such file or directory
>> > ls: cannot access 2363: No such file or directory
>> > ls: cannot access 246: No such file or directory
>> > ls: cannot access 2712: No such file or directory
>> >
>> > But you could unpack in a new dir and use:
>> > http://www.gnu.org/software/parallel/man.html#example__gnu_par
>> > allel_as_dir_processor
>> >
>> > That seems to work.
>> >
>> > /Ole
>> >
>> >
>>
- Processing files from a tar archive in parallel, Jay Hacker, 2011/03/29
- Re: Processing files from a tar archive in parallel, Hans Schou, 2011/03/29
- Re: Processing files from a tar archive in parallel, Jay Hacker, 2011/03/29
- Re: Processing files from a tar archive in parallel, Hans Schou, 2011/03/29
- Re: Processing files from a tar archive in parallel, Ole Tange, 2011/03/29
- RE: Processing files from a tar archive in parallel, Cook, Malcolm, 2011/03/29
- RE: Processing files from a tar archive in parallel, Cook, Malcolm, 2011/03/29
- Re: Processing files from a tar archive in parallel, Ole Tange, 2011/03/29
- RE: Processing files from a tar archive in parallel, Cook, Malcolm, 2011/03/30
- Re: Processing files from a tar archive in parallel,
Jay Hacker <=
- Re: Processing files from a tar archive in parallel, Hans Schou, 2011/03/29
- Re: Processing files from a tar archive in parallel, Ole Tange, 2011/03/29
- Re: Processing files from a tar archive in parallel, Benjamin R. Haskell, 2011/03/30
Re: Processing files from a tar archive in parallel, Benjamin R. Haskell, 2011/03/30
Re: Processing files from a tar archive in parallel, Benjamin R. Haskell, 2011/03/30