Re: Cannot specify the number of threads for parsort

On Wed, Feb 22, 2023 at 10:58 PM Mario Roy <marioeroy@gmail.com> wrote:

Aloha,

Congratulations on supporting parsort --parallel. I was wondering why the high number of processes (many files) until reading the 20230222 release notes. Now I understand.

First and foremost, mcesort is simply a parsort variant using mini-MCE parallel engine, integrated into mcesort. I reduced MCE code to the essentials (less than 1,500 lines). The main application is 400 lines. Currently, mcesort is < 1,900 lines.

1) mcesort supports -A (sets LC_ALL=C) and -j, --parallel N, N%, or max (-j12, -j50%, -jmax).

2) currently, mcesort does not allow -S, --buffer-size. From testing, specifying -S or --buffer-size leads to more memory consumption and degrades performance. Is -S --buffer-size helpful from parsort/mcesort perspective?

3) mcesort runs -z --zero-terminated in parallel, unlike parsort consuming one core.

4) mcesort accepts --check, -c, -C, --debug, --merge [--batch-size], and simply passes through and runs sort serially, never returning, if checking or merging sorted input or debugging incorrect key usage.

exec('sort', @ARGV) if $pass_through;

Respectfully, I captured results using parsort 20230222 and mcesort (to be released soon).

#################################################################
~ List of Files (total: 6 * 92 = 552 files) 17 GB Size
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

parsort
~~~~~~~
$ time LC_ALL=C parsort --parallel=64 -k1 \
/dev/shm/big* /dev/shm/big* /dev/shm/big* \
/dev/shm/big* /dev/shm/big* /dev/shm/big* | cksum
867518687 17463513600

1,109 processes created (brief system lockup)
physical memory consumption peak 7.79 GB

real 1m59.565s
user 1m27.735s
sys 0m22.013s

mcesort
~~~~~~~
$ time LC_ALL=C mcesort --parallel=64 -k1 \
/dev/shm/big* /dev/shm/big* /dev/shm/big* \
/dev/shm/big* /dev/shm/big* /dev/shm/big* | cksum
867518687 17463513600

128 processes created (no system lockup, fluid)
1 sort and 1 merge per worker
physical memory consumption peak 2.92 GB

real 1m57.209s
user 21m55.152s
sys 1m15.790s

#################################################################
~ Single File 17 GB
~ cat /dev/shm/big* >> /dev/shm/huge (6 times)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

parsort
~~~~~~~
$ time LC_ALL=C parsort --parallel=64 -k1 /dev/shm/huge | cksum
867518687 17463513600

128 processes created (no system lockup, fluid)
physical memory consumption peak 2.90 GB

real 2m11.056s
user 1m39.646s
sys 0m22.040s

mcesort
~~~~~~~
$ time LC_ALL=C mcesort --parallel=64 -k1 /dev/shm/huge | cksum
867518687 17463513600

128 processes created (no system lockup, fluid)
physical memory consumption peak 2.83 GB

real 1m53.255s
user 23m52.807s
sys 0m58.450s

#################################################################
~ Standard Input
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

parsort
~~~~~~~
$ time cat \
/dev/shm/big* /dev/shm/big* /dev/shm/big* \
/dev/shm/big* /dev/shm/big* /dev/shm/big* \
| LC_ALL=C parsort --parallel=64 -k1 | cksum
867518687 17463513600

193 processes created (no system lockup, fluid)
physical memory consumption peak 3.05 GB

real 2m18.442s
user 1m39.051s
sys 0m27.548s

mcesort
~~~~~~~
$ time cat \
/dev/shm/big* /dev/shm/big* /dev/shm/big* \
/dev/shm/big* /dev/shm/big* /dev/shm/big* \
| LC_ALL=C mcesort --parallel=64 -k1 | cksum
867518687 17463513600

128 processes created (no system lockup, fluid)
physical memory consumption peak 2.75 GB

real 1m57.487s
user 22m16.476s
sys 1m15.481s

On Sat, Feb 18, 2023 at 3:42 AM Mario Roy <marioeroy@gmail.com> wrote:
> Are you in the high memory consumption scenario which Nigel describes?

The issue is running parsort on large scale machines. Running on all cores is mostly not desirable for memory intensive applications. The memory channels become the bottle neck, eventually.

The mcesort variant has reached the incubator stage (code 100% completed). It supports the -j (short option) and --parallel. Obviously, specifying 1% will not be less than 1 CPU core, minimally.

-jN integer value
-jN% percentage value; e.g. -j1% .. -j100%
-jmax or -jauto same as 100% or available N logical cores

The test file is a mockup of random generated key-value pairs. There are 323+ million rows.

$ ls -lh /dev/shm/huge
-rw-r--r-- 1 mario mario 2.8G Feb 18 00:48 /dev/shm/huge

$ wc -l /dev/shm/huge
323398400 /dev/shm/huge

Using parsort, one cannot specify the number of cores processing a file. So, it spawns 64 workers on this machine. The Perl MCE variant performs similarly. I get better throughput by running 38 workers versus 64.

$ time parsort /dev/shm/huge | cksum
3409526408 2910585600

real 0m18.147s
user 0m13.920s
sys 0m3.660s

$ time mcesort -j64 /dev/shm/huge | cksum
3409526408 2910585600

real 0m18.081s
user 2m52.082s
sys 0m10.860s

$ time mcesort -j38 /dev/shm/huge | cksum
3409526408 2910585600

real 0m16.788s
user 2m21.384s
sys 0m8.263s

Regarding standard input, I can run parsort using a wrapper script (given at the top of this email thread). Notice how parsort has better throughput running 38 workers.

$ time parsort -j64 </dev/shm/huge | cksum
3409526408 2910585600

real 0m19.553s
user 0m14.030s
sys 0m3.520s

$ time mcesort -j64 </dev/shm/huge | cksum
3409526408 2910585600

real 0m18.312s
user 2m42.042s
sys 0m11.546s

$ time parsort -j38 </dev/shm/huge | cksum
3409526408 2910585600

real 0m17.609s
user 0m11.856s
sys 0m3.451s

$ time mcesort -j38 </dev/shm/huge | cksum
3409526408 2910585600

real 0m16.819s
user 2m21.108s
sys 0m9.523s

I find it interesting in not seeing the total user time running parsort (tally of all workers' time).

This was a challenge and can see the finish line, hoping by next week.

On Fri, Feb 17, 2023 at 2:49 PM Rob Sargent <robjsargent@gmail.com> wrote:

On 2/17/23 13:41, Mario Roy wrote:

It looks like we may not get what we kindly asked for. So, I started making "mcesort" using Perl MCE's chunking engine.

On Thu, Feb 16, 2023 at 5:08 AM Nigel Stewart <nigels@nigels.com> wrote:

Can you elaborate on what I am missing from the picture?

Ole,

Perhaps your workloads are more CPU and I/O intensive, and latency is less of a priority.
If the workload is memory-intensive, that can be the more important constraint than
the number of available cores. If the workload is interactive (latency-sensitive) it's
undesirable to have too many jobs in flight competing for CPU and I/O, delaying each other.

- Nigel

Are you in the high memory consumption scenario which Nigel describes?

If you're going to develop it anyway, you could try submitting a patch to GNU Parallel.

From:	Mario Roy
Subject:	Re: Cannot specify the number of threads for parsort
Date:	Wed, 22 Feb 2023 23:16:30 -0600