Proposal for GNU APL work
David B. Lamkins 20140308


Goal
----

Have GNU APL take advantage of multiple CPU cores when available.


Implementation Sketch
---------------------

Have a fixed pool of worker threads at the ready. Use pthreads
[pthreads(7)].

For the initial implementation we'll start with scalar functions that
don't return a result of a different shape because that'll be the
simplest case. Once the ravel of the data exceeds a limit, give each
worker thread its own segment of the ravel to compute.

Use ⎕syl to specify the size of ravel beyond which parallel processing
is applied. This is a read-write setting. Negative values disable
parallel execution. A value of 0 causes all parallelizable tasks to be
run in a worker thread rather than the main thread. All negative
values, when written, are clamped to -1.

Normally the parallelism threshold setting will be much larger than
zero in order to apply parallelism only to tasks of sufficient size to
overcome the overhead of running in a worker thread. This setting may
be changed at any time, although changing it to a negative value will
not affect any active parallel tasks.

Use ⎕syl to configure the number of worker threads and the worker
threads' stack size. These are both read-write values. Attempting to
set an unreasonable value (i.e. outside of hard limits determined at
./configure time) results in a DOMAIN error.

The number of worker threads and the stack size may only be changed
when no parallel tasks are active. (This won't be a concern for the
initial implementation. However, it might be relevant if we later
implement parallelism against user-defined functions.) Attempting a
change while parallel tasks are active causes an error. This is a new
error, tentatively named THREADS_ACTIVE.

Changing the stack size causes the thread pool to be rebuilt, since a
thread's stack size may only be set at creation. Changing the size of
the thread pool while holding the stack size constant results in
incremental changes to the pool.

GNU APL uses a loop() macro for iteration control. There are many uses
of loop() that don't involve iterating over the ravel of an array.
Keep these as-is; create a ploop() macro to handle the parallel cases.

The ploop() macro starts with a fast check of the size of the ravel
against the parallel processing threshold. At or below the threshold
we'll run loop() in its original form. Above the threshold we'll
transfer to a new function that partitions the ravel, runs each
partition on a worker thread and waits for all threads to complete.

There are about a half-dozen distinct small blocks inside the
parallelizable loop()s of the initial implementation. These blocks
will be extracted as functions. These functions will need to be
wrapped in order to be run by worker threads (which expect
parameterless functions).

Below the parallel processing threshold, ploop() calls the extracted
functions in-place (i.e. without a worker thread) in order to avoid
the extra overhead of scheduling a thread for a small task.

Note that nested arrays can cause recursive evaluation. We can run
more threads than cores and trust the scheduler to sort things out.
Threads on outer arrays will block until those on inner arrays finish.

Don't use cpu affinity for the threads; let the scheduler do its job.


Windows Port
------------

GNU APL already runs on Windows. It'd be a shame to not preserve
parity with the Linux platform as we add support for parallelism.

Look into using a compatibility library to provide a pthreads API on
Windows. See, for example:

https://www.sourceware.org/pthreads-win32/ .


Concerns
--------

These concerns must be investigated and resolved.

  - Are there any C++ exception exits from parallelizable loop()s? I'm
    assuming that C++ won't be happy if we attempt to throw an
    exception from a worker thread to the main thread because they're
    on different stacks. Either find a way to avoid throwing these
    exceptions altogether or install a local catch in each worker
    thread (assuming that this is even possible) and have the main
    thread roll up the detected exceptions signalled by the workers.

  - Likewise, do the parallelizable loop()s have any non-reentrant
    functions, shared data or I/O?

  - How best to set thread stack size? Recursive APL functions may use
    large stacks. In the absence of APL recursion, the stack space is
    roughly proportional to the depth of array nesting.