Re: [Bug-apl] Proposal for initial multicore support

Hello David,

You misunderstood (or more likely, I wan't very clear about) what I meant about the job construction. I was merely using the EACH operator together with a lambda to illustrate the different ways a simple _expression_ such as 1+2×A can be decomposed.

My first example illustrates how such the jobs should be formed. The second example shows how they would be formed if you do not perform any type of job construction optimisation.

I think the solution is to coalesce multiple functions that work on individual cells (i.e. they don't return some aggregate of multiple cells). These functions include:

Algebraic functions (+, -, etc...)
Comparative functions (<, ≤, =, etc...)
EACH operator (assuming none of the functions called use the assignment operator ←)
Circle functions: N○R
Ceiling, floor functions, right and left tack, etc...

What makes these functions special is that the only operate on a single cell in isolation. This also means that a sequence of these can be coalesced into a single function that can be dispatched as a multithreaded job without having to do any intermediate synchronisation.

This works very well with the "temp value" optimisation[1] that was recently implemented. The idea being that if all of these functions are updated to take advantage of this optimisation the coalesced job can be set to work on a part of the array and then all the interpreted have to do is to wait for the jobs to finish and the array will have been changed in-place with the result.

I've been thinking about implementing a proof-of-concept that does multithreaded dispatching. However, I've had more important things to do, and frankly I didn't want to spend too much time on it knowing that Jürgen has spent more time thinking about this stuff than me.

Footnotes:

[1] http://lists.gnu.org/archive/html/bug-apl/2014-02/msg00000.html

Regards,

Elias

On 11 March 2014 02:19, David Lamkins <address@hidden> wrote:

That's a good analysis of the larger landscape, Elias. Thanks for posting.

I intend to avoid many of those issues by working on the (I think)
simplest case of scalar functions that don't reshape the result. That
obviously doesn't help with the more general cases...

Regarding your first example of parallelizing the each operator:

{1+2×⍵}¨A

It seems that it's suboptimal to decompose the lambda into individual
operations. What if the lambda could (somehow) be executed as a whole
rather than decomposed? TBH, I don't know enough about the interpreter
to know whether this is even feasible.

Also, I admit that I haven't given more than passing thought to
consequences of intermediate results in the lambda. It might be useful
to think in terms of doing a parallel each only in the case where the
function is a composition of scalar functions. (That simplification
would address your concern about side-effects, too.)

On Mon, Mar 10, 2014 at 7:31 AM, Elias Mårtenson <address@hidden> wrote:
> This is something that I have been thinking about as well. And the thing
> that mostly concerns me is how multiple operations can be optimised. Part of
> that thinking is actually part of my idea to keep track of temporary objects
> so that they can be modified in-place. The core of that was already
> implemented by Jürgen. Thanks for that. :-) (by the way, there are plenty
> of more functions that could benefit from in-place modification)
>
> About multithreading. I have a few concerns, and I'll outline them all here
> for posterity:
>
> Task dispatch and job construction
>
> If you have a large array A (say, 8k rows) and you perform, say, the
> operation 1+2×A. If you have 8 cores, you want to perform the following
> operations:
>
> Core 1: {1+2×⍵}¨A[⍳1000;]
> Core 2: {1+2×⍵}¨A[1000+⍳1000;]
> Core 3: {1+2×⍵}¨A[2000+⍳1000;]
> etc...
>
> Note that with in-place modifications this can be quite efficient.
>
> However, note that there must be some intelligent algorithm figuring out
> that the sequence 1+2× can be merged into a single threadable job.
>
> If this optimisation is not done, then we'll probably see inefficiencies
> like this:
>
> With a plain non-optimised multi-core dispatching we would be wasting a lot
> of time doing dispatch and subsequent collections of results:
>
> Core 1: {2×⍵}¨A[⍳1000;]
> Core 2: {2×⍵}¨A[1000+⍳1000;]
> etc...
> Then, collect the result of the multiplication followed by another dispatch
> to execute:
> Core 1: {1×⍵}¨A[⍳1000;]
> Core 2: {1×⍵}¨A[1000+⍳1000;]
> etc...
>
> Unless the arrays that you are working on are very large, you are going to
> spend more time synchronising the threads than actually doing real work.
>
> Thread model
>
> Unless all threadable operations always take the same amount of time (i.e.
> no branching) a simple thread-pool will not be enough. One would need some
> kind of job-stealing queue system. There are several available for C++, but
> as I'm not a C++ developer I don't have any experience with them. However,
> Threading Building Blocks looks promising.
>
> Synchronisation
>
> Having the EACH operator being threadable sound incredibly tempting, but how
> should the system deal with synchronisation? It's clear that most operations
> does not have to worry about synchronisation since they only work on its
> small slice of an array, but what about something like this:
>
> N←0 ◊ {⍵+N←N+1}¨⍳100
>
> The easy way around this would be to simply not multi-thread functions that
> contains the assignment operator. The benefit of this is that one can do
> away with locking almost completely.
>
> Multithreaded memory management
>
> Right now, instances of Value are allocated using the default allocator.
> This allocator calls down to malloc(), which is not the fastest in the
> world. Most implementations of malloc() also have a single lock, making the
> allocation operation effectively single-threaded. Since GNU APL does a lot
> of memory allocations, this will definitely be a bottleneck. I would suggest
> using an allocator that has a thread-local memory pool, ensuring that two
> parallel memory allocations will not interfere with each other.
>
> I'm sure there are more issues, but these are the ones I have been thinking
> about so far. Comments welcome.
>
> Regards,
> Elias
>
>
> On 10 March 2014 19:44, Juergen Sauermann <address@hidden>
> wrote:
>>
>> Hi David,
>>
>> I think your ideas are correct. I have planned multicore support for GNU
>> APL 1.4 and
>> every help is welcome.
>>
>> Actually parallel processing was one of the main reasons for me to write
>> GNU APL.
>> I published a Ph.D thesis "A parallel APL computer" (in German) back in
>> 1990. We had
>> built a prototype based on multiple MC68020 processors and demonstrated
>> that most
>> APL primitive functions and operators can benefit from parallel execution.
>> The only
>> thing missing at that time was the interpreter (which is now GNU APL).
>>
>> My thoughts so far were going in the direction of OpenMP which uses a
>> similar approach
>> as the one you describe. But I haven't worked with it yet and there are
>> other solutions around.
>>
>> The most important step is to find the best approach for GNU APL. That
>> probably means a lot
>> of benchmarking and experimenting.
>>
>> In GNU APL every primitive function is a separate object so we can decide
>> on a per-function
>> basis at what length we go multi-core.
>>
>> /// Jürgen
>>
>>
>>
>>
>>
>> On 03/10/2014 06:33 AM, David B. Lamkins wrote:
>>>
>>> This weekend I spent a few hours poking around in the GNU APL source
>>> code while thinking about what it might take to exploit a multicore CPU.
>>> I've collected my thoughts on what it might take to do an initial
>>> implementation (see attached).
>>>
>>> Juergen (and anyone else who has worked in the code base), I'd
>>> appreciate some feedback regarding the feasibility of my proposal, as
>>> well as your comments about things I've overlooked.
>>>
>>> This is something I'd be willing to work on in my spare time.
>>>
>>>
>>
>>
>

--
"The secret to creativity is knowing how to hide your sources."
Albert Einstein

http://soundcloud.com/davidlamkins
http://reverbnation.com/lamkins
http://reverbnation.com/lcw
http://lamkins-guitar.com/
http://lamkins.net/
http://successful-lisp.com/

From:	Elias Mårtenson
Subject:	Re: [Bug-apl] Proposal for initial multicore support
Date:	Tue, 11 Mar 2014 10:54:18 +0800