[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
parallel sort strangeness
From: |
Jim Meyering |
Subject: |
parallel sort strangeness |
Date: |
Thu, 28 Jul 2011 14:16:07 +0200 |
For some reason, sort is now very slow here (Fedora 15, x86_64, with
little free RAM), even for tiny inputs, but only some of the time,
and not when I set OMP_NUM_THREADS=1.
(I noticed because ./bootstrap was taking *forever*, traced it,
which showed gnulib-tool taking all the time, then traced gnulib-tool,
which showed slow sort -- it runs thousands of small ones.)
time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d"
sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" 0.00s user 3.18s system 98%
cpu 3.218 total
$ time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d"
sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" 0.01s user 39.90s system 99%
cpu 40.077 total
$ time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d"
sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" 0.00s user 5.65s system 99%
cpu 5.679 total
$ time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d"
sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" 0.00s user 34.16s system 13%
cpu 4:12.40 total
Then I can run the very same command 40 times in a row, with it
terminating in less than .3s total.
Then once more and it takes 5 seconds.
I managed to get some info out of one via strace -p PID:
$ strace -p 21925
Process 21925 attached - interrupt to quit
fstat(0, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7f8fded12000
read(0, "a\nb\n", 131072) = 4
read(0, "", 126976) = 0
fstat(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7f8fded11000
write(1, "a\nb\n", 4) = 4
munmap(0x7f8fddcd0000, 16912384) = 0
lseek(0, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
close(0) = 0
munmap(0x7f8fded12000, 4096) = 0
close(1) = 0
munmap(0x7f8fded11000, 4096) = 0
close(2) = 0
exit_group(0) = ?
Process 21925 detached
That fstat (STDIN_FILENO, ... call
comes from sort->merge's call to avoid_trashing_input.
Top reported this:
top - 23:18:49 up 25 days, 15:21, 38 users, load average: 0.11, 0.30, 0.45
Tasks: 285 total, 1 running, 278 sleeping, 5 stopped, 1 zombie
Cpu(s): 0.1%us, 4.4%sy, 0.1%ni, 95.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 12323704k total, 11880864k used, 442840k free, 84120k buffers
Swap: 4194300k total, 4194252k used, 48k free, 958840k cached
Unfortunately, the above was last night,
and this morning I've been unable to reproduce it.
This strace -c diff shows how the default with 6/12 cores
(equiv. to OMP_NUM_THREADS=12) each of sort's munmap call takes 50% longer
than when I set OMP_NUM_THREADS=1:
--- /t/sort-omp1.log 2011-07-28 10:48:39.284061665 +0200
+++ /t/sort.log 2011-07-28 10:48:27.088061664 +0200
@@ -1,6 +1,6 @@
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
-100.00 0.000761 127 6 munmap
+100.00 0.001151 192 6 munmap
0.00 0.000000 0 6 read
0.00 0.000000 0 1 write
0.00 0.000000 0 5 open
@@ -16,8 +16,9 @@
0.00 0.000000 0 1 execve
0.00 0.000000 0 4 getrlimit
0.00 0.000000 0 1 arch_prctl
+ 0.00 0.000000 0 1 sched_getaffinity
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 1 fadvise64
0.00 0.000000 0 1 set_robust_list
------ ----------- ----------- --------- --------- ----------------
-100.00 0.000761 95 3 total
+100.00 0.001151 96 3 total
This shows how with OMP_NUM_THREADS=12 the max-resident RAM
is nearly triple what you'll see when using OMP_NUM_THREADS=1:
$ :|env time env LC_ALL=C OMP_NUM_THREADS=12 sort
0.00user 0.01system 0:00.01elapsed 88%CPU (0avgtext+0avgdata
17076maxresident)k
0inputs+0outputs (0major+4496minor)pagefaults 0swaps
$ :|env time env LC_ALL=C OMP_NUM_THREADS=1 sort
0.00user 0.00system 0:00.00elapsed 83%CPU (0avgtext+0avgdata 6780maxresident)k
0inputs+0outputs (0major+1937minor)pagefaults 0swaps
Comparing the two, I see where the extra memory is being allocated,
and will probably remove this pre-optimization to fix it:
[I'll adjust the log, but I doubt I'll be able to add a test]
>From fe20dc12a0bacb81e0474c12dbb29a1afa8b28d1 Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Thu, 28 Jul 2011 14:03:09 +0200
Subject: [PATCH] sort: FIXME: do not allocate more space initially for
multi-thread
---
src/sort.c | 17 +----------------
1 files changed, 1 insertions(+), 16 deletions(-)
diff --git a/src/sort.c b/src/sort.c
index 03f63fb..2aa5da2 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -3793,22 +3793,7 @@ sort (char *const *files, size_t nfiles, char const
*output_file,
char const *file = *files;
FILE *fp = xfopen (file, "r");
FILE *tfp;
-
- size_t bytes_per_line;
- if (nthreads > 1)
- {
- /* Get log P. */
- size_t tmp = 1;
- size_t mult = 1;
- while (tmp < nthreads)
- {
- tmp *= 2;
- mult++;
- }
- bytes_per_line = (mult * sizeof (struct line));
- }
- else
- bytes_per_line = sizeof (struct line) * 3 / 2;
+ size_t bytes_per_line = sizeof (struct line) * 3 / 2;
if (! buf.alloc)
initbuf (&buf, bytes_per_line,
--
1.7.6.639.g50fb9
- parallel sort strangeness,
Jim Meyering <=