[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnu-arch-users] archive storage format comments on the size

From: Andrea Arcangeli
Subject: [Gnu-arch-users] archive storage format comments on the size
Date: Mon, 29 Sep 2003 18:35:11 +0200
User-agent: Mutt/1.4.1i

I tried to convert two random cvs repositories to arch archives and the
sizes are like this:

arch 2176k
cvs 376k
arch 29984k
cvs 6208k

note cvs is uncompressed.

at around >2000 patchsets archs slowsdown like a crawl in the commits,
it seems to spend all its time doing a flood of lstat on every patchset
file in the {arch} directory inside the working dir, basically it's O(N)
where N is the number of patchsets in the archive. cscvs spends all its
time in wait4 (waiting tla to return). Note, linux should be fairly
efficient in the stats, thanks to the dcache, indipendently by the fs

Then I tried to ungzip all the tar.gz patchsets of the 29m archive and
the size went to around 70M, so the gzip is generating a mere x2
compression on plain text. That's very low (though amittedly very
significant for network, but not nearly as much as what you can really
achieve with a properly designed ad-hoc network sync protocol ala rsync
and with a proper database for local storage, not this flood of tar.gz
tiny packages).

After uncompressing all the tar.gz patchsets, I made a tar.gz archive of
the project directory:

address@hidden:/dev/shm> du -s *
69368   aa-neural--mainline
3224    aa-neural--mainline.tar.gz

so the compression would be x20, not x2 and if compressed the whole
archive would fit in 3M.

Compressing the original tar.gz is less effective, it leads to a ~5M
archive, you've to decompress them first.

NOTE: the cvs compressed fits in 1M so it'd be still 3 times bigger in
backup compressed form.

Ideally even the patch should be a diff -u0, however when merging
rejects arch should regenerate the patch against its original tree with
the usual -u3 to make reject resolving easier, and I'm unsure if that
would be slow to generate.

It'd be interesting also to analyze what else could be done to reduce the
size of a patchset, this was a small repository, the kernel is an order
of magnitude bigger. 69M umcompressed of tla vs 6M uncompressed of cvs
isn't very nice and I believe there's lots of room for optimizations in
this area (diff -u0 and the replication removal in the file names, being
the first two ideas, though I doubt the too long filenames would save
much disk space, that's more a speed/ram matter than a disk matter). I
tend to believe the diff -u0 could be significant.

cscvs works great too. btw, I'm running everything on a 64bit platform
with 64bit userspace, so it's all 64bit clean code and I didn't run into
any trouble during compilation.


Andrea - If you prefer relying on open source software, check these links:

reply via email to

[Prev in Thread] Current Thread [Next in Thread]