[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: feature request: gzip/bzip support for sort

From: Jim Meyering
Subject: Re: feature request: gzip/bzip support for sort
Date: Sun, 21 Jan 2007 22:41:11 +0100

Dan Hipschman <address@hidden> wrote:

> On Sun, Jan 21, 2007 at 07:14:03PM +0100, Jim Meyering wrote:
>> Not to look the gift horse in the mouth, but it'd be nice
>> if you wrote ChangeLog entries, too.  And even (gasp! :-)
>> a test case or two.  Of course, we'd expect such a test case
>> (probably named tests/misc/sort-compress, and based on
>> tests/sample-test) to have this line in it:
>>   . $srcdir/../very-expensive
>> If you don't have time for that, I'll take care of it, eventually.
> I'm not going to stop you :-)  I just haven't had the time to look into
> it yet, so I've just been running the coreutils tests and then my own
> tests.  I was planning on adding the tests, as you say, "eventually".
>> Default to just "gzip", not /bin/gzip.  The latter may not exist;
>> your patch already handles that, but /bin/gzip may not be the first
>> gzip in PATH.  Also, don't bother with the access-XOK check.
>> There's no point in incurring even such a small overhead in the
>> general case, when no temporary is used.
> The reason I put the access check in there is that if we default to gzip
> and it doesn't exist, then of course the exec will fail, the child will

This is a good argument for using libz by default, not a separate
gzip program.  Why incur the overhead of an exec when we don't need to?
Now, I'm convinced that sort should provide built-in support for both
gzip and bzip2.  How to select built-in vs. actually exec the program
is something to think about...
Of course it should still be possible to specify some other program.

> fail, and this will cause sort to fail when it really should just not do
> the compression, or try another default if something suitable exists
> (what about compress?).  How about we just delay the determination of

compress?  no thank you! :-)

> the compress program until it's actually needed (e.g., in create_temp
> right before "if (compress_program)" we have "if (compress_program_known)"
> and inside the body we check the environment variable and/or do access
> checks on possible defaults)?
>> But please address the FIXME I've added.
> If we can't fork a compression process, it's not the end of the world.
> We just don't compress that file and sort will still work.  If we can't
> fork a decompression process, we can't continue the sort.  So I figure
> we'll just try twice to fork compression processes, and if we can't do
> it after 1 sec, we're probably wasting more time waiting to fork than we
> would doing disk access.  However, we really need to be able to fork
> decompression processes, so we can afford to wait a really long time for
> it.  I was considering making the number of tries for decompression
> processes even larger (now, it'll wait about 2 min before giving up).
>> Have you considered using the gnulib hash module rather than
>> rolling your own?  There are examples in many of coreutils/src/*.c.
> I'm not familiar with gnulib, so I didn't know a hash module existed or
> think to look for one.  Looking at it now, though, it seems it will be
> slower because of its abstraction, unless the table fills up to the

Performance isn't the issue here, but code-reuse.
I would be very surprised if changing hash table implementations
has any measurable effect on sort's performance.

> point where it would be faster to access if it grew.  I'd prefer (since
> it seems to be my time we're talking about), to leave it the way it is
> because it's simple, and see if the gnulib module is faster later.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]