bug-parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Cannot specify the number of threads for parsort


From: Mario Roy
Subject: Re: Cannot specify the number of threads for parsort
Date: Wed, 22 Feb 2023 23:16:30 -0600


I ran the following commands to capture the number of sort processes in another terminal window. Seeing > 1,100 processes threw me off guard and experienced the system locking briefly using parsort (many files). Well, that's definitely a second wish list -- why so many processes... I witnessed processes consuming 11% CPU.

parsort

while true; do ps -ef | grep sort | grep -v parsort | wc -l; sleep 1; done

mcesort

while true; do ps -ef | grep sort | grep -v mcesort | wc -l; sleep 1; done



On Wed, Feb 22, 2023 at 10:58 PM Mario Roy <marioeroy@gmail.com> wrote:
Aloha,

Congratulations on supporting parsort --parallel. I was wondering why the high number of processes (many files) until reading the 20230222 release notes. Now I understand.

First and foremost, mcesort is simply a parsort variant using mini-MCE parallel engine, integrated into mcesort. I reduced MCE code to the essentials (less than 1,500 lines). The main application is 400 lines. Currently, mcesort is < 1,900 lines.

1) mcesort supports -A (sets LC_ALL=C) and -j, --parallel  N, N%, or max (-j12, -j50%, -jmax).

2) currently, mcesort does not allow -S, --buffer-size. From testing, specifying -S or --buffer-size leads to more memory consumption and degrades performance. Is -S --buffer-size helpful from parsort/mcesort perspective?

3) mcesort runs -z --zero-terminated in parallel, unlike parsort consuming one core.

4) mcesort accepts --check, -c, -C, --debug, --merge [--batch-size], and simply passes through and runs sort serially, never returning, if checking or merging sorted input or debugging incorrect key usage.

    exec('sort', @ARGV) if $pass_through;

Respectfully, I captured results using parsort 20230222 and mcesort (to be released soon). 

#################################################################
~
List of Files (total: 6 * 92 = 552 files) 17 GB Size
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

parsort
~~~~~~~
  $ time LC_ALL=C parsort --parallel=64 -k1 \
  /dev/shm/big* /dev/shm/big* /dev/shm/big* \
  /dev/shm/big* /dev/shm/big* /dev/shm/big* | cksum
  867518687 17463513600

    1,109 processes created (brief system lockup)
    physical memory consumption peak 7.79 GB

    real   1m59.565s
    user   1m27.735s
    sys    0m22.013s

mcesort
~~~~~~~
  $ time LC_ALL=C mcesort --parallel=64 -k1 \
  /dev/shm/big* /dev/shm/big* /dev/shm/big* \
  /dev/shm/big* /dev/shm/big* /dev/shm/big* | cksum
  867518687 17463513600

    128 processes created (no system lockup, fluid)
      1 sort and 1 merge per worker
    physical memory consumption peak 2.92 GB

    real   1m57.209s
    user   21m55.152s
    sys    1m15.790s


#################################################################
~ Single File 17 GB
~   cat /dev/shm/big* >> /dev/shm/huge (6 times)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

parsort
~~~~~~~
  $ time LC_ALL=C parsort --parallel=64 -k1 /dev/shm/huge | cksum
  867518687 17463513600

    128 processes created (no system lockup, fluid)
    physical memory consumption peak 2.90 GB

    real   2m11.056s
    user   1m39.646s
    sys    0m22.040s

mcesort
~~~~~~~
  $ time LC_ALL=C mcesort --parallel=64 -k1 /dev/shm/huge | cksum
  867518687 17463513600

    128 processes created (no system lockup, fluid)
    physical memory consumption peak 2.83 GB

    real   1m53.255s
    user   23m52.807s
    sys    0m58.450s

#################################################################
Standard Input
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

parsort
~~~~~~~
  $ time cat \
    /dev/shm/big* /dev/shm/big* /dev/shm/big* \
    /dev/shm/big* /dev/shm/big* /dev/shm/big* \
  | LC_ALL=C parsort --parallel=64 -k1 | cksum
  867518687 17463513600

    193 processes created (no system lockup, fluid)
    physical memory consumption peak 3.05 GB

    real   2m18.442s
    user   1m39.051s
    sys    0m27.548s


mcesort
~~~~~~~
  $ time cat \
    /dev/shm/big* /dev/shm/big* /dev/shm/big* \
    /dev/shm/big* /dev/shm/big* /dev/shm/big* \
  | LC_ALL=C mcesort --parallel=64 -k1 | cksum
  867518687 17463513600

    128 processes created (no system lockup, fluid)
    physical memory consumption peak 2.75 GB

    real   1m57.487s
    user   22m16.476s
    sys    1m15.481s




On Sat, Feb 18, 2023 at 3:42 AM Mario Roy <marioeroy@gmail.com> wrote:
Are you in the high memory consumption scenario which Nigel describes?

The issue is running parsort on large scale machines. Running on all cores is mostly not desirable for memory intensive applications. The memory channels become the bottle neck, eventually.

The mcesort variant has reached the incubator stage (code 100% completed). It supports the -j (short option) and --parallel. Obviously, specifying 1% will not be less than 1 CPU core, minimally.

-jN   integer value
-jN%  percentage value; e.g. -j1% .. -j100%
-jmax or -jauto  same as 100% or available N logical cores

The test file is a mockup of random generated key-value pairs. There are 323+ million rows.

$ ls -lh /dev/shm/huge
-rw-r--r-- 1 mario mario 2.8G Feb 18 00:48 /dev/shm/huge

$ wc -l /dev/shm/huge
323398400 /dev/shm/huge


Using parsort, one cannot specify the number of cores processing a file. So, it spawns 64 workers on this machine. The Perl MCE variant performs similarly. I get better throughput by running 38 workers versus 64.

$ time parsort /dev/shm/huge | cksum
3409526408 2910585600

real 0m18.147s
user 0m13.920s
sys 0m3.660s

$ time mcesort -j64 /dev/shm/huge | cksum
3409526408 2910585600

real 0m18.081s
user 2m52.082s
sys 0m10.860s

$ time mcesort -j38 /dev/shm/huge | cksum
3409526408 2910585600

real 0m16.788s
user 2m21.384s
sys 0m8.263s


Regarding standard input, I can run parsort using a wrapper script (given at the top of this email thread). Notice how parsort has better throughput running 38 workers.

$ time parsort -j64 </dev/shm/huge | cksum
3409526408 2910585600

real 0m19.553s
user 0m14.030s
sys 0m3.520s


$ time mcesort -j64 </dev/shm/huge | cksum
3409526408 2910585600

real 0m18.312s
user 2m42.042s
sys 0m11.546s

$ time parsort -j38 </dev/shm/huge | cksum
3409526408 2910585600

real 0m17.609s
user 0m11.856s
sys 0m3.451s


$ time mcesort -j38 </dev/shm/huge | cksum
3409526408 2910585600

real 0m16.819s
user 2m21.108s
sys 0m9.523s



I find it interesting in not seeing the total user time running parsort (tally of all workers' time).

This was a challenge and can see the finish line, hoping by next week.



On Fri, Feb 17, 2023 at 2:49 PM Rob Sargent <robjsargent@gmail.com> wrote:
On 2/17/23 13:41, Mario Roy wrote:
It looks like we may not get what we kindly asked for. So, I started making "mcesort" using Perl MCE's chunking engine.

On Thu, Feb 16, 2023 at 5:08 AM Nigel Stewart <nigels@nigels.com> wrote:
Can you elaborate on what I am missing from the picture?

Ole,

Perhaps your workloads are more CPU and I/O intensive, and latency is less of a priority.
If the workload is memory-intensive, that can be the more important constraint than
the number of available cores.  If the workload is interactive (latency-sensitive) it's
undesirable to have too many jobs in flight competing for CPU and I/O, delaying each other.

- Nigel
 
Are you in the high memory consumption scenario which Nigel describes?

If you're going to develop it anyway, you could try submitting a patch to GNU Parallel. 

reply via email to

[Prev in Thread] Current Thread [Next in Thread]