[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-tar] fast extract of single file from large tar.gz archive

From: Dilip Chhetri
Subject: [Bug-tar] fast extract of single file from large tar.gz archive
Date: Wed, 13 Aug 2014 22:39:41 -0700

I have lots of large .tar.gz files and I need to extract just a single (small) file from it. I purposely put it at the front of .tar file so that extraction is fast, but if that file is gzipped, then 'tar' wants to read the whole .tgz file before exiting.

to explain this phenomena, consider this

1) regular extract
desktop1:/tmp$ time dd if=linux-3.4.2.tar.bz2 bs=1k|bunzip2|tar x linux-3.4.2/Documentation/ABI/README
78284+1 records in
78284+1 records out
80162970 bytes (80 MB) copied, 8.96967 s, 8.9 MB/s

real    0m8.983s
user    0m9.057s
sys    0m0.549s

* performance is same for "tar jxf linux-3.4.2.tar.bz2 linux-3.4.2/Documentation/ABI/README"

2) crude way of fast extract
address@hidden:/tmp$ time dd if=linux-3.4.2.tar.bz2 bs=1k count=1000|bunzip2|tar x linux-3.4.2/Documentation/ABI/README
1000+0 records in
1000+0 records out
1024000 bytes (1.0 MB) copied, 0.0980247 s, 10.4 MB/s

bunzip2: Compressed file ends unexpectedly;
    perhaps it is corrupted?  *Possible* reason follows.
bunzip2: Inappropriate ioctl for device
    Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

/tmp/tar: Unexpected EOF in archive
/tmp/tar: Error is not recoverable: exiting now

real    0m0.105s
user    0m0.104s
sys    0m0.009s

As you can see, using method (2) I can still extract single file in 0.1 second (vs 8.9 second). Looks to me that 'tar' still reads the whole archive from stdin even though it is done extracting.

Q: is there any special option to make it fast. If not this would be really good enhancement (I saw lot of people asking for it on the web). If someone can post a patch to fix this behaviour, that would be really nice. I spent sometime reading source code for tar, but things aren't looking obvious to me.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]