[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: sort: new feature: use environment variable to set buffer size
From: |
Pádraig Brady |
Subject: |
Re: sort: new feature: use environment variable to set buffer size |
Date: |
Thu, 30 Aug 2012 02:44:05 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20110816 Thunderbird/6.0 |
On 08/29/2012 09:50 PM, Assaf Gordon wrote:
> Hello,
>
> I'd like to suggest a new feature to sort: the ability to set the buffer size
> (-S/--buffer-size X) using an environment variable.
>
> In summary:
> $ export SORT_BUFFER_SIZE=20G
> $ someprogram | sort -k1,1 > output.txt
> # sort will use 20G of RAM, as if "--buffer-size 20G" was specified.
>
>
> The rational:
> recent commits improved the guessed buffer size when sort is given an input
> file,
> but these don't apply if sort is used as part of a pipe line, with a pipe as
> input, e.g.
> some | program | sort | other | programs > file
>
> (Tested with v8.19 on linux 2.6.32, sort consumes few MBs of RAM, even though
> many GBs are available).
> This results in many small temporary files being created.
>
> The script (which uses sort) is not under my direct control, but even if it
> was,
> I don't want to hard-code the amount of memory used, to keep it portable to
> different servers.
>
> AFAIK, there are four aspects of sort the affect performance:
> 1. number of threads:
> changeable with "--parallel=X" and with environment variable OMP_NUM_THREADS.
>
> 2. temporary files location:
> changeable with "--temporary-directory=DIR" and with environment variable
> TMPDIR.
>
> 3. memory usage:
> changeable with "--buffer-size=SIZE" but not with environment variable.
>
> 4. compression program:
> changeable with "--compression-program=PROG" but not with environment
> variable.
> (but at the moment, I do not address this aspect).
>
>
> With the attached patch, sort will read an environment variable named
> "SORT_BUFFER_SIZE", and will treat it as if "--buffer-size" was specified
> (but only if "--buffer-size" wasn't used on the command line).
>
> If this is conceptually acceptable, I'll prepare a proper patch (with NEWS,
> help, docs, etc.).
>
> Regards,
> -gordon
Thanks for the detailed rationale, however
the existing env variables are significant to more utils than sort(1).
I.E. they're generally system level settings, rather than command level.
Also sort -S is very portable, even though not standardised.
solaris' sort(1) has -S and GNU sort is used on most other platforms,
which has -S available since TEXTUTILS-2_0_10-58-gbf86c62
Note also this thread on the selection of a default buffer size for pipes:
http://thread.gmane.org/gmane.comp.gnu.coreutils.general/878/focus=887
So currently I'd be 70:30 against adding such a variable.
cheers,
Pádraig.