Re: [PATCH] sort: parallel external sort implementation

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] sort: parallel external sort implementation

From:	Joey Degges
Subject:	Re: [PATCH] sort: parallel external sort implementation
Date:	Thu, 4 Mar 2010 16:39:20 -0800

2010/3/4 Pádraig Brady <address@hidden>

> On 04/03/10 06:01, Joey Degges wrote:
>
>> Hello,
>>
>> Matt (cc'd, not on list) and I have developed a patch that parallelizes
>> sort
>> using pthreads. The threading approach taken here is to break up input
>> files
>> into groups based on physical device and sort each of those groups in
>> parallel.
>> This allows for storage and processing resources to be more fully utilized
>> during the sort process.
>>
>>
>  multidisk sort with 4x 50M inputs
>> $ time -v sort_multidisk /q/0/50M /q/1/50M /q/2/50M /q/3/50M>  /dev/null
>>         User time (seconds): 10.89
>>         System time (seconds): 0.83
>>         Percent of CPU this job got: 180%
>>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.50
>>         Major (requiring I/O) page faults: 20
>>         Minor (reclaiming a frame) page faults: 72843
>>         Voluntary context switches: 1936
>>         Involuntary context switches: 149
>>         File system inputs: 393080
>>
>> current sort with 4x 50M inputs
>> $ time -v sort_current /q/0/50M /q/1/50M /q/2/50M /q/3/50M>  /dev/null
>>         User time (seconds): 11.05
>>         System time (seconds): 0.32
>>         Percent of CPU this job got: 86%
>>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.19
>>         Major (requiring I/O) page faults: 18
>>         Minor (reclaiming a frame) page faults: 72532
>>         Voluntary context switches: 915
>>         Involuntary context switches: 48
>>         File system inputs: 198936
>>
>
>  diff --git a/src/sort.c b/src/sort.c
>> +      --threads=N           use no more than N threads to improve
>> parallelism\n\
>>
>
> Threads is a bit of an implementation detail?
> Perhaps it would be better to use --parallel so one
> could move to multiprocess in future if desired?


> Since this is a new option, it requires documentation
> in docs/coreutils.texi which would include examples
> of how and why you would use this option.
>
>  +
>> +/* Sort NFILES FILES onto OUTPUT_FILE.
>> +
>> +   Threading approach: Break FILES up into several groups where each
>> contains
>> +   only files that can be found on the same physical device (according to
>> +   device_cmp()). Spawn threads to execute do_sort() on each group of
>> files in
>> +   parallel.
>> +
>> +   This allows for all concerned resources (storage devices and
>> processors) to
>> +   be more fully utilized.
>> +*/
>> +
>> +static void
>> +sort_multidisk (char * const *files, size_t nfiles, char const
>> *output_file,
>> +                unsigned long int nthreads)
>>
>
> Have you considered the seek characteristics of SSDs
> and how they might affect things (with consideration
> that mechanical disks will be replaced by SSDs quite quickly).
> There still would be some benefit splitting per SSD,
> but it would be worth measuring to see.
>

I will post some results testing with various flash keys but I do not have
any proper SSD drives to play with. The extreme case here would be sorting
from multiple ramdisks in which case there is likely to be no improvements
whatsoever -- supposing the underlying "do_sort" function can process a
single file in parallel. In this worst case it might be useful to expose a
"--no-multidisk" flag allowing the user to disable this feature (or a
"--multidisk" flag to enable it).



>  @@ -3691,7 +3996,21 @@ main (int argc, char **argv)
>>       IF_LINT (free (sortfiles));
>>     }
>>   else
>> -    sort (files, nfiles, outfile);
>> +    {
>> +      if (!nthreads)
>> +        {
>> +          /* The default NTHREADS is 2 ** floor (log2 (number of
>> processors)).
>> +             If comparisons do not vary in cost and threads are
>> +             scheduled evenly, with the current algorithm there is no
>> +             performance advantage to using a number of threads that
>> +             is not a power of 2.  */
>> +          unsigned long int np2 = num_processors (NPROC_CURRENT) / 2;
>> +          for (nthreads = 1; nthreads <= np2; nthreads *= 2)
>> +            continue;
>> +        }
>>
>
> You probably want NPROC_CURRENT_OVERRIDABLE ?
>

Would we want to use the OpenMP environmental variable to affect the number
of pthreads that are used? A more generic PARALLEL variable might be better
suited.


Thanks,
Joey

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH] sort: parallel external sort implementation, Joey Degges, 2010/03/04
- Re: [PATCH] sort: parallel external sort implementation, Pádraig Brady, 2010/03/04
  - Re: [PATCH] sort: parallel external sort implementation, Chen Guo, 2010/03/04
    - Re: [PATCH] sort: parallel external sort implementation, Pádraig Brady, 2010/03/04
    - Re: [PATCH] sort: parallel external sort implementation, Joey Degges, 2010/03/04
- Re: [PATCH] sort: parallel external sort implementation, Pádraig Brady, 2010/03/04
  - Re: [PATCH] sort: parallel external sort implementation, Chen Guo, 2010/03/04
    - Re: [PATCH] sort: parallel external sort implementation, Pádraig Brady, 2010/03/04
  - Re: [PATCH] sort: parallel external sort implementation, Joey Degges <=
    - Re: [PATCH] sort: parallel external sort implementation, Pádraig Brady, 2010/03/04
  - Re: [PATCH] sort: parallel external sort implementation, Ralf Wildenhues, 2010/03/07
    - Re: [PATCH] sort: parallel external sort implementation, Chen Guo, 2010/03/07

Prev by Date: say that cut can't handle more than one field demarcator
Next by Date: Re: you are not going to be able to sort this by the fifth field.
Previous by thread: Re: [PATCH] sort: parallel external sort implementation
Next by thread: Re: [PATCH] sort: parallel external sort implementation
Index(es):
- Date
- Thread