[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-tar] High per file overhead?

From: Phillip Susi
Subject: Re: [Bug-tar] High per file overhead?
Date: Sat, 25 Feb 2006 12:26:58 -0500
User-agent: Mail/News 1.5 (X11/20060213)

Joerg Schilling wrote:
Phillip Susi <address@hidden> wrote:

Can anyone explain this?

~$: du -bsh Maildir/
98M     Maildir/
~$: tar cf Maildir.tar Maildir/
~$: du -bsh Maildir.tar
112M    Maildir.tar
~$: find Maildir | cpio -o -H newc > Maildir.cpio
204433 blocks
~$: du -bsh Maildir.cpio
100M    Maildir.cpio

Why does tar have 12M more overhead than cpio? This Maildir is the lkml since Jan 1, so it contains ~20,000 messages/files, but ~734 bytes per file seems like a bit much for overhead.

As cpio does not offer a -H newc format, let me asume that you are talking about the -c or -H crc format...

Yes, it does have a newc format, see the info page. It is also the format used by the linux kernel for initramfs images.

cpio is unblocked and thus has problems to resync after a part of the archive
that appears to be corrupted. du only counts the file contend and a part of the meta data (not counting e.g.
the "inode" - see: /usr/include/sys/fs/ufs_inode.h)

Right, but the timestamps, owner, and mode only take up a handful of bytes, which cpio also stores.

cpio -Hcrc writes 110 Bytes header + the file path name + the file content.
tar in the historical format or POSIX.1-1988 writes 512 bytes header + the file content rounded up to the next 512 byte boundary. recent tar (POSIX.1-2001 aka. "pax") writes at least 1 KB per file in addition.

I see. And the purpose for this is to try and recover from bad sectors since a file will always start on a sector boundary, so only the file contained in the bad sector will be lost?

Conclusion: if you write more metadata, you have more overhead.
But in real world use this has no relevence:

star -cPM -time f=/dev/null -C /usr .
star: 107825 blocks + 6656 bytes (total of 1104134656 bytes = 1078256.50k).
star: Total time 136.532sec (7897 kBytes/sec)

star -cPM -Hasc -time f=/dev/null -C /usr .
star: 104818 blocks + 2560 bytes (total of 1073338880 bytes = 1048182.50k).
star: Total time 134.415sec (7798 kBytes/sec)

The additional overhead that reasults from the tar format is typically less
than 3%. If you compress the result and use an archiver that takes care about
best compressibilty (as star does), even the small "advantage" of the cpio
format will go away.

If you compress the result, the remaining difference is less than 1%.

I'd say archiving my Maildir is a rather real world use, so this is somewhat relevant. I did notice though, that once compressed, the difference in size is greatly diminished.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]