Proposal for GNU APL work David B. Lamkins 20140308 Goal ---- Have GNU APL take advantage of multiple CPU cores when available. Implementation Sketch --------------------- Have a fixed pool of worker threads at the ready. Use pthreads [pthreads(7)]. For the initial implementation we'll start with scalar functions that don't return a result of a different shape because that'll be the simplest case. Once the ravel of the data exceeds a limit, give each worker thread its own segment of the ravel to compute. Use ⎕syl to specify the size of ravel beyond which parallel processing is applied. This is a read-write setting. Negative values disable parallel execution. A value of 0 causes all parallelizable tasks to be run in a worker thread rather than the main thread. All negative values, when written, are clamped to -1. Normally the parallelism threshold setting will be much larger than zero in order to apply parallelism only to tasks of sufficient size to overcome the overhead of running in a worker thread. This setting may be changed at any time, although changing it to a negative value will not affect any active parallel tasks. Use ⎕syl to configure the number of worker threads and the worker threads' stack size. These are both read-write values. Attempting to set an unreasonable value (i.e. outside of hard limits determined at ./configure time) results in a DOMAIN error. The number of worker threads and the stack size may only be changed when no parallel tasks are active. (This won't be a concern for the initial implementation. However, it might be relevant if we later implement parallelism against user-defined functions.) Attempting a change while parallel tasks are active causes an error. This is a new error, tentatively named THREADS_ACTIVE. Changing the stack size causes the thread pool to be rebuilt, since a thread's stack size may only be set at creation. Changing the size of the thread pool while holding the stack size constant results in incremental changes to the pool. GNU APL uses a loop() macro for iteration control. There are many uses of loop() that don't involve iterating over the ravel of an array. Keep these as-is; create a ploop() macro to handle the parallel cases. The ploop() macro starts with a fast check of the size of the ravel against the parallel processing threshold. At or below the threshold we'll run loop() in its original form. Above the threshold we'll transfer to a new function that partitions the ravel, runs each partition on a worker thread and waits for all threads to complete. There are about a half-dozen distinct small blocks inside the parallelizable loop()s of the initial implementation. These blocks will be extracted as functions. These functions will need to be wrapped in order to be run by worker threads (which expect parameterless functions). Below the parallel processing threshold, ploop() calls the extracted functions in-place (i.e. without a worker thread) in order to avoid the extra overhead of scheduling a thread for a small task. Note that nested arrays can cause recursive evaluation. We can run more threads than cores and trust the scheduler to sort things out. Threads on outer arrays will block until those on inner arrays finish. Don't use cpu affinity for the threads; let the scheduler do its job. Windows Port ------------ GNU APL already runs on Windows. It'd be a shame to not preserve parity with the Linux platform as we add support for parallelism. Look into using a compatibility library to provide a pthreads API on Windows. See, for example: https://www.sourceware.org/pthreads-win32/ . Concerns -------- These concerns must be investigated and resolved. - Are there any C++ exception exits from parallelizable loop()s? I'm assuming that C++ won't be happy if we attempt to throw an exception from a worker thread to the main thread because they're on different stacks. Either find a way to avoid throwing these exceptions altogether or install a local catch in each worker thread (assuming that this is even possible) and have the main thread roll up the detected exceptions signalled by the workers. - Likewise, do the parallelizable loop()s have any non-reentrant functions, shared data or I/O? - How best to set thread stack size? Recursive APL functions may use large stacks. In the absence of APL recursion, the stack space is roughly proportional to the depth of array nesting.