[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency
From: |
Jim Meyering |
Subject: |
Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency... |
Date: |
Wed, 09 Feb 2011 23:04:35 +0100 |
Jim Meyering wrote:
> Jim Meyering wrote:
>> Running "make -j25 check" on a nominal-12-core F14 system would
>> cause serious difficulty leading to an OOM kill -- and this is brand new.
>> It worked fine yesterday. I tracked it down to all of the make processes
>> working on the "built_programs.list" (in src/Makefile.am) rule
>>
>> built_programs.list:
>> @echo $(bin_PROGRAMS) $(bin_SCRIPTS) | tr ' ' '\n' \
>> | sed -e 's,$(EXEEXT)$$,,' | $(ASSORT) -u | tr '\n' ' '
>>
>> Which made me realize we were running that submake over 400 times,
>> once per test scripts (including skipped ones). That's well worth
>> avoiding, even if it means a new temporary file.
>>
>> I don't know the root cause of the OOM-kill (preceded by interminable
>> minutes of a seemingly hung and barely responsive system) or why it started
>> happening today (afaics, none of the programs involved was updated),
>> but this does fix it...
>
> FYI,
> I've tracked this down a little further.
> The horrid performance (hung system and eventual OOM-kill)
> are related to the use of sort above. This is the definition:
>
> ASSORT = LC_ALL=C sort
>
> If I revert my earlier patch and instead simply
> insist that sort not do anything in parallel,
>
> ASSORT = LC_ALL=C sort --parallel=1
>
> then there is no hang, and things finish in relatively good time.
>
> I don't have a good stand-alone reproducer yet
> and am out of time for today.
Well, right after writing that, I realized the key to the misbehavior:
sort was reading from a *pipe*:
# This finishes right away, reading from input file "k":
seq 99 >k && for i in $(seq 33); do LC_ALL=C timeout 1 sort k > /dev/null & done
# When reading from a pipe, it's a very different story:
# Without the "timeout 7" prefix, the following would render an N-core
# system (5<N) unusable for many minutes. As it is, be prepared:
# my system goes unresponsive after 1 second, and doesn't return until timeout.
for i in $(seq 33); do seq 88|LC_ALL=C timeout 7 sort --para=5 >/dev/null & done
Occasionally, the above jobs all complete quickly.
My first question was why were *any* processes being spawned to handle
such a small input file. The first answer is in the first hunk:
diff --git a/src/sort.c b/src/sort.c
index 13954cb..b9316e7 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -112,9 +112,8 @@ struct rlimit { size_t rlim_cur; };
/* Heuristic value for the number of lines for which it is worth
creating a subthread, during an internal merge sort, on a machine
that has processors galore. Currently this number is just a guess.
- This value must be at least 4. We don't know of any machine where
- this number has any practical effect. */
-enum { SUBTHREAD_LINES_HEURISTIC = 4 };
+ This value must be at least 4. */
+enum { SUBTHREAD_LINES_HEURISTIC = 32 * 1024 };
/* The number of threads after which there are
diminishing performance gains. */
The old definition of SUBTHREAD_LINES_HEURISTIC meant that any group
of 5 or more lines would be split in two and sorted via two (or more)
separate threads. Thus, with just 40 lines, you could get the maximum
of 8 threads working. That is obviously not efficient, unless lines are
so pathologically long that the cost of comparing two of them approaches
the cost of creating a new process.
With the above, sort would use a more reasonable number.
Tests on high-end hardware and using very short lines
suggest that a value like 200,000 would still be conservative.
But even with that change, the above for-loop would sometimes hang.
Pursuing it further, I discovered that sort was allocating 130MB
(INPUT_FILE_SIZE_GUESS * 129, with the former being 1MiB) of space
for its input buffer, simply because it was reading from a pipe.
diff --git a/src/sort.c b/src/sort.c
index 13954cb..92c8d4e 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -316,7 +315,7 @@ static size_t merge_buffer_size = MAX
(MIN_MERGE_BUFFER_SIZE, 256 * 1024);
static size_t sort_size;
/* The guessed size for non-regular files. */
-#define INPUT_FILE_SIZE_GUESS (1024 * 1024)
+#define INPUT_FILE_SIZE_GUESS (64 * 1024)
/* Array of directory names in which any temporary files are to be created. */
static char const **temp_dirs;
With that, the while-loop does not hang any more.
Note again that this is on a multi-core system.
I'll investigate more tomorrow.