[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Multithreaded sort hangs on Solaris
From: |
McFarland, Jeffrey |
Subject: |
RE: Multithreaded sort hangs on Solaris |
Date: |
Tue, 12 Mar 2013 16:22:37 +0000 |
Honestly the sort command is generated by another script so I'm not sure why
the `sort -t\n` syntax was chosen. However that part seems to work as it is.
It breaks on newlines as it should.
Where are you suggesting adding some sleeps? I haven't gotten into the sort
code and I'm not sure that I'll a lot more time to put into it.
I have noticed a couple more related oddities. First, I found that even
though I set the batch-size to 100 it always creates 104 files when parallel is
not set to 1. It creates 103 files of the same size then then starts merging
them into the 104th file, then finally into the final file. When parallel is
set to 1 then it creates only 95 temp files. Secondly, I have tested this on 3
machines now (all with the same OS) and I've noticed up to a 15% increase in
performance when running with parallel set to 1.
-----Original Message-----
From: Pádraig Brady [mailto:address@hidden]
Sent: Tuesday, March 12, 2013 5:07 AM
To: McFarland, Jeffrey
Cc: address@hidden
Subject: Re: Multithreaded sort hangs on Solaris
On 03/11/2013 03:47 PM, McFarland, Jeffrey wrote:
> I have come across some odd results regarding the sort utility in coreutils
> version 8.20. I've looked through the archives and don't see any similar
> issues so it may be something specific to our systems.
>
>
>
> System: SunOS 5.10 Generic_147440-26 sun4u sparc SUNW,Sun-Fire-V890
>
>
>
> Issue: When running sort on a 22.5 GB file I found that about 1 out of 10
> times the process seems to hang (out of 100+ tests). The process is still
> running but the temp files are no longer changing and the final file either
> has not been created or is a 0 byte file. When this happens the temp files
> are never in the exact same state as a previous run. On this machine a
> complete sort normally takes about 20 minutes. On one occasion the process
> hung for over 48 hours before I killed it. Running top shows no significant
> load on the system.
>
>
>
> Command run:
>
> ./sort -t\n -S 256M --batch-size=100 -T /disk/craiwk01/prod/SORTWK -T
> /disk/craiwk02/prod/SORTWK -T /disk/craiwk03/prod/SORTWK -T
> /disk/craiwk04/prod/SORTWK -T /disk/craiwk06/prod/SORTWK -k1.1,1.10
> infile -o infile.sorted
>
>
>
>>: ps
>
> PID TTY TIME CMD
>
> 16328 pts/3 5:06 sort
>
> 12697 pts/3 0:00 ps
>
>
>
>>: sudo truss -rall -wall -f -p 16328
>
> 16328: lwp_park(0x00000000, 0) (sleeping...)
>
>
>
>>: sudo pstack 16328
>
> 16328: /usr/local/abacus/etsort/sort -tn -S 295063 --batch-size=100
> -T /disk/
>
> ----------------- lwp# 1 / thread# 1 --------------------
>
> ffffffff7d4d8818 lwp_park (0, 0, 0)
>
> 0000000100009c74 sortlines (111b56580, 111c56080, ffffffff7fffeab0,
> 10012a321, ffffffff7fffead0, 10012a328) + 514
>
> 000000010000a5cc sortlines (111558380, 2, ffffffff7fffeab0, 1121765e0,
> 0, ffffffff7fffeab0) + e6c
>
> 000000010000a5cc sortlines (111956f80, 4, ffffffff7fffeab0, 112176420,
> 0, ffffffff7fffeab0) + e6c
>
> 000000010000a5cc sortlines (112154760, 8, ffffffff7fffeab0, 1121760a0,
> 1, ffffffff7fffeab0) + e6c
>
> 000000010000c070 sort (10012a740, 0, ffffffff7fffead0, 23, 10012cddd,
> 112154760) + 350
>
> 000000010000e6e8 main (13, ffffffff7ffff148, 0, 10012c220, fffd,
> 10012b1e0) + 1ee8
>
> 00000001000041bc _start (0, 0, 0, 0, 0, 0) + 7c
>
> ----------------- lwp# 240 / thread# 240 --------------------
>
> 000000010000a600 sortlines_thread(), exit value = 0x0000000000000000
>
> ** zombie (exited, not detached, not yet joined) **
>
> ----------------- lwp# 241 / thread# 241 --------------------
>
> 000000010000a600 sortlines_thread(), exit value = 0x0000000000000000
>
> ** zombie (exited, not detached, not yet joined) **
>
> ----------------- lwp# 242 / thread# 242 --------------------
>
> 000000010000a600 sortlines_thread(), exit value = 0x0000000000000000
>
> ** zombie (exited, not detached, not yet joined) **
>
>
>
> If I change the sort to run as a single threaded process (add "--parallel=1"
> to above command) then it doesn't hang. This makes me think that it's most
> likely a threading issue. I ran the same tests on a LINUX machine and it did
> not have the same hanging issue so it's most likely only an issue with
> Solaris.
>
>
>
> I initially found this issue using coreutils 8.9 and I changed to 8.20 to see
> if a fix had been made but no luck.
>
>
>
> Is this a known issue? Are there any additional tests I should run to
> further narrow down this issue?
I can't think of anything TBH.
There may possibly be some portability issues with --compress and --parallel
(due to possibly non async safe functions being called after a fork), but
you're not using --compress, so we can discount that at least.
No matter if the bug is in coreutils or solaris, adding some sleeps may help
trigger a race more quickly?
BTW the `sort -t\n` looks strange. Did you mean: sort -t$'\n' ?
thanks,
Pádraig.
________________________________
This e-mail and files transmitted with it are confidential, and are intended
solely for the use of the individual or entity to whom this e-mail is
addressed. If you are not the intended recipient, or the employee or agent
responsible to deliver it to the intended recipient, you are hereby notified
that any dissemination, distribution or copying of this communication is
strictly prohibited. If you are not one of the named recipient(s) or otherwise
have reason to believe that you received this message in error, please
immediately notify sender by e-mail, and destroy the original message. Thank
You.