coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency


From: Pádraig Brady
Subject: Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...
Date: Mon, 14 Mar 2011 15:31:00 +0000
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3

On 12/03/11 15:34, Jim Meyering wrote:
> Jim Meyering wrote:
>> Jim Meyering wrote:
>>> Jim Meyering wrote:
>>>> Running "make -j25 check" on a nominal-12-core F14 system would
>>>> cause serious difficulty leading to an OOM kill -- and this is brand new.
>>>> It worked fine yesterday.  I tracked it down to all of the make processes
>>>> working on the "built_programs.list" (in src/Makefile.am) rule
>>>>
>>>> built_programs.list:
>>>>    @echo $(bin_PROGRAMS) $(bin_SCRIPTS) | tr ' ' '\n' \
>>>>      | sed -e 's,$(EXEEXT)$$,,' | $(ASSORT) -u | tr '\n' ' '
>>>>
>>>> Which made me realize we were running that submake over 400 times,
>>>> once per test scripts (including skipped ones).  That's well worth
>>>> avoiding, even if it means a new temporary file.
>>>>
>>>> I don't know the root cause of the OOM-kill (preceded by interminable
>>>> minutes of a seemingly hung and barely responsive system) or why it started
>>>> happening today (afaics, none of the programs involved was updated),
>>>> but this does fix it...
>>>
>>> FYI,
>>> I've tracked this down a little further.
>>> The horrid performance (hung system and eventual OOM-kill)
>>> are related to the use of sort above.  This is the definition:
>>>
>>>     ASSORT = LC_ALL=C sort
>>>
>>> If I revert my earlier patch and instead simply
>>> insist that sort not do anything in parallel,
>>>
>>>     ASSORT = LC_ALL=C sort --parallel=1
>>>
>>> then there is no hang, and things finish in relatively good time.
>>>
>>> I don't have a good stand-alone reproducer yet
>>> and am out of time for today.
> 
> After updating to a new F14 kernel,
> I never managed to reproduce that, but maybe
> that's just a coincidence...
> 
>> Well, right after writing that, I realized the key to the misbehavior:
>> sort was reading from a *pipe*:
>>
>> # This finishes right away, reading from input file "k":
>> seq 99 >k && for i in $(seq 33); do LC_ALL=C timeout 1 sort k > /dev/null & 
>> done
>>
>> # When reading from a pipe, it's a very different story:
>> # Without the "timeout 7" prefix, the following would render an N-core
>> # system (5<N) unusable for many minutes.  As it is, be prepared:
>> # my system goes unresponsive after 1 second, and doesn't return until 
>> timeout.
>> for i in $(seq 33); do seq 88|LC_ALL=C timeout 7 sort --para=5 >/dev/null & 
>> done
>>
>> Occasionally, the above jobs all complete quickly.
>>
>> My first question was why were *any* processes being spawned to handle
>> such a small input file.  The first answer is in the first hunk:
>>
>> diff --git a/src/sort.c b/src/sort.c
>> index 13954cb..b9316e7 100644
>> --- a/src/sort.c
>> +++ b/src/sort.c
>> @@ -112,9 +112,8 @@ struct rlimit { size_t rlim_cur; };
>>  /* Heuristic value for the number of lines for which it is worth
>>     creating a subthread, during an internal merge sort, on a machine
>>     that has processors galore.  Currently this number is just a guess.
>> -   This value must be at least 4.  We don't know of any machine where
>> -   this number has any practical effect.  */
>> -enum { SUBTHREAD_LINES_HEURISTIC = 4 };
>> +   This value must be at least 4.  */
>> +enum { SUBTHREAD_LINES_HEURISTIC = 32 * 1024 };
>>
>>  /* The number of threads after which there are
>>     diminishing performance gains.  */
>>
>> The old definition of SUBTHREAD_LINES_HEURISTIC meant that any group
>> of 5 or more lines would be split in two and sorted via two (or more)
>> separate threads.  Thus, with just 40 lines, you could get the maximum
>> of 8 threads working.  That is obviously not efficient, unless lines are
>> so pathologically long that the cost of comparing two of them approaches
>> the cost of creating a new process.
>>
>> With the above, sort would use a more reasonable number.
>> Tests on high-end hardware and using very short lines
>> suggest that a value like 200,000 would still be conservative.
> 
> Here's a complete patch for that:
> 
>>From b2db5675bfeb3fe7e87bcc12934f34057ee26704 Mon Sep 17 00:00:00 2001
> From: Jim Meyering <address@hidden>
> Date: Thu, 10 Feb 2011 08:48:27 +0100
> Subject: [PATCH] sort: spawn fewer threads for small inputs
> 
> * src/sort.c (SUBTHREAD_LINES_HEURISTIC): Do not spawn a new thread
> for every 4 lines.  Increase this from 4 to 128K.  128K lines seems
> appropriate for a 5-year-old dual-core laptop, but it is too low for
> some common combinations of short lines and/or newer systems.
> * NEWS (Bug fixes): Mention it.
> ---
>  NEWS       |    9 ++++++---
>  src/sort.c |   16 ++++++++++------
>  2 files changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/NEWS b/NEWS
> index 3157ef2..5770410 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -4,13 +4,16 @@ GNU coreutils NEWS                                    -*- 
> outline -*-
> 
>  ** Bug fixes
> 
> -  du would infloop when given --files0-from=DIR
> -  [bug introduced in coreutils-7.1]
> -
>    cut could segfault when invoked with a user-specified output
>    delimiter and an unbounded range like "-f1234567890-".
>    [bug introduced in coreutils-5.3.0]
> 
> +  du would infloop when given --files0-from=DIR
> +  [bug introduced in coreutils-7.1]
> +
> +  sort no longer spawns 7 worker threads to sort 16 lines
> +  [bug introduced in coreutils-8.6]
> +
>    wc would dereference a NULL pointer upon an early out-of-memory error
>    [bug introduced in coreutils-7.1]
> 
> diff --git a/src/sort.c b/src/sort.c
> index 13954cb..9b8666a 100644
> --- a/src/sort.c
> +++ b/src/sort.c
> @@ -109,12 +109,16 @@ struct rlimit { size_t rlim_cur; };
>     and is responsible for merging TOTAL lines. */
>  #define MAX_MERGE(total, level) (((total) >> (2 * ((level) + 1))) + 1)
> 
> -/* Heuristic value for the number of lines for which it is worth
> -   creating a subthread, during an internal merge sort, on a machine
> -   that has processors galore.  Currently this number is just a guess.
> -   This value must be at least 4.  We don't know of any machine where
> -   this number has any practical effect.  */
> -enum { SUBTHREAD_LINES_HEURISTIC = 4 };
> +/* Heuristic value for the number of lines for which it is worth creating
> +   a subthread, during an internal merge sort.  I.e., it is a small number
> +   of "average" lines for which sorting via two threads is faster than
> +   sorting via one on an "average" system.  On an dual-core 2.0 GHz i686
> +   system with 3GB of RAM and 2MB of L2 cache, a file containing 128K
> +   lines of gensort -a output is sorted slightly faster with --parallel=2
> +   than with --parallel=1.  By contrast, using --parallel=1 is about 10%
> +   faster than using --parallel=2 with a 64K-line input.  */
> +enum { SUBTHREAD_LINES_HEURISTIC = 128 * 1024 };
> +verify (4 <= SUBTHREAD_LINES_HEURISTIC);
> 
>  /* The number of threads after which there are
>     diminishing performance gains.  */
> --
> 1.7.4.1.299.ga459d

I've not fully analyzed this yet, and I'm not saying it's wrong,
but the above change seems to have a large effect on thread
creation when smaller buffers are used (you hinted previously
that being less aggressive with the amount of mem used by default
might be appropriate, and I agree).

Anyway with the above I seem to need a buffer size more
than 10M to have any threads created at all.

Testing the original 4 lines heuristic with the following, shows:
(note I only get > 4 threads after 4M of input, not 7 for 16 lines
as indicated in NEWS).

$ for i in $(seq 30); do
>   j=$((2<<$i))
>   yes | head -n$j > t.sort
>   strace -c -e clone sort --parallel=16 t.sort -o /dev/null 2>&1 |
>    join --nocheck-order -a1 -o1.4,1.5 - /dev/null |
>    sed -n "s/\([0-9]*\) clone/$j\t\1/p"
> done
4       1
8       2
16      3
32      4
64      4
128     4
256     4
512     4
1024    4
2048    4
4096    4
8192    4
16384   4
32768   4
65536   4
131072  4
262144  4
524288  4
1048576 4
2097152 4
4194304 8
8388608 16

When I restrict the buffer size with '-S 1M', many more threads
are created (a max of 16 in parallel with the above command)
4       1
8       2
16      3
32      4
64      4
128     4
256     4
512     4
1024    4
2048    4
4096    4
8192    4
16384   8
32768   12
65536   24
131072  44
262144  84
524288  167
1048576 332
2097152 660
4194304 1316
8388608 2628

After increasing the heuristic to 128K, I get _no_ threads until -S > 10M
and this seems to be independent of line length.

cheers,
Pádraig.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]