bug-tar
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-tar] Re: tar 1.22.90 memory usage


From: Solar Designer
Subject: [Bug-tar] Re: tar 1.22.90 memory usage
Date: Thu, 17 Dec 2009 13:47:19 +0300
User-agent: Mutt/1.4.2.3i

On Thu, Dec 17, 2009 at 10:46:14AM +0200, Sergey Poznyakoff wrote:
> Solar Designer <address@hidden> ha escrit:
> 
> > It turns out that canonicalize_filename_mode(), which is used by the
> > "new functions" mentioned above, allocates memory for at least
> > PATH_MAX bytes
> 
> Thanks for reporting.  I overlooked that.  I have forwarded your patch
> to bug-gnulib mailing list (that function is part of gnulib, not the
> tar proper).

Indeed.  I had left that sort of coordination to you. ;-)

> In the meantime, I am going to install the attached change.

This looks OK to me, except that you probably want to use xrealloc(),
not just realloc().  Also, I did not check if there are possibly any
other uses of canonicalize_filename_mode() in tar - did you?

> > Besides fixing canonicalize_filename_mode() or its uses, maybe "struct
> > directory" could be made more compact again?  For example, maybe "name"
> > could be allocated along with the struct like it was before instead of
> > being referenced through a pointer?
> 
> Perhaps this would save a couple of bytes, but I'm afraid it will make
> the code less manageable.

Yes, I had the same concern too, but I think it's more than just a
couple of bytes and I tried to propose relatively simple changes only.

> > Maybe "caname" could be allocated along with the struct too?  The
> > pointer field would have to remain, but at least we'd save on malloc
> > overhead (we will have fewer allocations).
> 
> Caname is allocated using normalize_filename, so packing it into the
> structure won't diminish the number of allocations.  Besides,
> it would mean one more call to free, which will add its overhead.

Aren't you calling free() on this pointer eventually anyway?  If so,
you'd only call free() sooner, so the number of allocations and frees
performed will stay the same, but the number of allocated regions will
stay lower, resulting in memory savings (less memory wasted to malloc
overhead).

BTW, if tar doesn't free some of its allocations till the very end of
the process lifetime, you could opt to make those allocations from your
own pool (large mallocs, like of 64 KB each) and not free them at all
(or free the large chunks).  This is what I am doing in John the Ripper,
and it has proven to save memory quite well while keeping the code
simple - just invoke a custom malloc()-like function where appropriate.
Another source of memory savings is through telling this new
malloc()-like function what alignment is required.  When allocating
memory for a character string, no alignment is required.  You can see
the function prototypes here:

http://cvsweb.openwall.com/cgi/cvsweb.cgi/Owl/packages/john/john/src/memory.h

Here's yet another idea: instead of storing full pathnames, reference
directory paths and final pathname components via separate pointers.
For example, right now you seem to be storing strings like:

/dir1/dir2/dir3/dirA
/dir1/dir2/dir3/dirB
/dir1/dir2/dir3/dirC
/dir1/dir2/dir3/dirC/dirP
/dir1/dir2/dir3/dirC/dirQ

You could instead store:

/dir1/dir2/dir3
dirA
dirB
/dir1/dir2/dir3/dirC - and have an extra pointer to "dirC" in this string
dirP
dirQ

Since typical directory names are longer than those I used in the
example above and since directory hierarchies are often much deeper, I
expect that the savings on pathname strings would be very substantial -
like 2x to 10x.  When you need to obtain a full pathname, you'll
concatenate exactly two strings obtained via two pointers found in the
same struct.  Sounds easy enough.

> Nevertheless, thanks for the suggestions, I will try to diminish
> the memory consumption anyway.

Thank you.

> Actually, I wonder whether it would not be better to keep the
> directory status information in some kind of external cache (e.g.
> in a DBM file).  It would certainly solve the memory consumption
> problem, at the expense of slightly longer execution time and an extra
> software dependency.  What do you think?

I dislike this.  If this is all we could reasonably do, then I'd rather
leave things as-is.

Frankly, I'm not sure I understand all of tar's reasons for keeping all
of this info in memory (or in some other temporary storage).  My current
understanding is that it needs this for three reasons:

1. Not to overwrite the "snar file" until an incremental run is known to
have succeeded.  (This is the file specified with "-g" or
"--listed-incremental"; I don't know why the texinfo documentation for
tar calls it "snar" - do you?)

2. To match directories on the filesystem against snar file entries.

3. To detect hard-links.

Are there any other reasons?

For #1, it could write to a temporary file of the same format (or of
another format if that is somehow preferable - but I dislike the idea of
introducing a dependency on some database support library).

#2 might be tricky if the file is not loaded into memory (frankly, I
don't know if it currently is - I haven't checked).  It sounds easy when
the order in which the directories are traversed is that same as the
order in which they're recorded in the snar file, but this does not have
to be the case.

For #3, it would only need to keep track of entries with st_nlink of 2
or greater.  Hopefully, there won't be too many of non-directory entries
like this, so they could be kept in memory.

Given my limited understanding (more like some guesses), I am not ready
to make suggestions on how to rework this, if at all (probably not...)

One thing that could probably be done is bringing tar's memory usage
closer to the snar file size.  For 1.22.90 with my patch, we're getting
400 MB memory usage with an 80 MB snar file.  I think this suggests that
there's room for memory savings without a major redesign.

Thanks again,

Alexander




reply via email to

[Prev in Thread] Current Thread [Next in Thread]