coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sort: new feature: use environment variable to set buffer size


From: Pádraig Brady
Subject: Re: sort: new feature: use environment variable to set buffer size
Date: Thu, 30 Aug 2012 02:44:05 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20110816 Thunderbird/6.0

On 08/29/2012 09:50 PM, Assaf Gordon wrote:
> Hello,
> 
> I'd like to suggest a new feature to sort: the ability to set the buffer size 
> (-S/--buffer-size X) using an environment variable.
> 
> In summary:
>  $ export SORT_BUFFER_SIZE=20G 
>  $ someprogram | sort -k1,1 > output.txt
>  # sort will use 20G of RAM, as if "--buffer-size 20G" was specified.
> 
> 
> The rational:
> recent commits improved the guessed buffer size when sort is given an input 
> file,
> but these don't apply if sort is used as part of a pipe line, with a pipe as 
> input, e.g.
>   some | program | sort | other | programs > file 
> 
> (Tested with v8.19 on linux 2.6.32, sort consumes few MBs of RAM, even though 
> many GBs are available).
> This results in many small temporary files being created.
> 
> The script (which uses sort) is not under my direct control, but even if it 
> was,
> I don't want to hard-code the amount of memory used, to keep it portable to 
> different servers.
> 
> AFAIK, there are four aspects of sort the affect performance:
> 1. number of threads:
> changeable with "--parallel=X" and with environment variable OMP_NUM_THREADS.
> 
> 2. temporary files location:
> changeable with "--temporary-directory=DIR" and with environment variable 
> TMPDIR.
> 
> 3. memory usage:
> changeable with "--buffer-size=SIZE" but not with environment variable.
> 
> 4. compression program:
> changeable with "--compression-program=PROG" but not with environment 
> variable.
> (but at the moment, I do not address this aspect).
> 
> 
> With the attached patch, sort will read an environment variable named 
> "SORT_BUFFER_SIZE", and will treat it as if "--buffer-size" was specified 
> (but only if "--buffer-size" wasn't used on the command line).
> 
> If this is conceptually acceptable, I'll prepare a proper patch (with NEWS, 
> help, docs, etc.).
> 
> Regards,
>  -gordon

Thanks for the detailed rationale, however
the existing env variables are significant to more utils than sort(1).
I.E. they're generally system level settings, rather than command level.
Also sort -S is very portable, even though not standardised.
solaris' sort(1) has -S and GNU sort is used on most other platforms,
which has -S available since TEXTUTILS-2_0_10-58-gbf86c62

Note also this thread on the selection of a default buffer size for pipes:
http://thread.gmane.org/gmane.comp.gnu.coreutils.general/878/focus=887

So currently I'd be 70:30 against adding such a variable.

cheers,
Pádraig.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]