[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-apl] Experimental OpenMP patch

From: Juergen Sauermann
Subject: Re: [Bug-apl] Experimental OpenMP patch
Date: Wed, 12 Mar 2014 12:18:12 +0100
User-agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130330 Thunderbird/17.0.5

Hi Elias,

I believe we should first find out how big the thread dispatch effort actually is.
Because coalescing can also fir back by creating unequally distributed intermediate results.

For skalar functions you have a parallel eecution time of:

a + b×⌈N÷P where a = startup time (thread dispatch and clean-up), b = cost per cell, N = data size, and P = core count.

In eg. A + B + C coalescing would reduce the time from  2×(a + b×⌈N÷P) to a + 2 ×(b×⌈N÷P)

On the other hand in A + B ⍴ C things could be completely different because ⍴ can create a very unevenly sized right
argument of +.

I guess we have to look into the details of every function and operator to see what can be done in terms of parallel execution.
Starting with skalar functions seems to be a good strategy and I believe we should finish that first before looking into
more complex scenarios.

/// Jürgen

On 03/11/2014 04:24 PM, Elias Mårtenson wrote:
Oh and one more thing: Have you given any thought to my comments re. the coalescing of certain functions to reduce thread dispatch effort? (also, add some more functions to the no-copy optimisation?)


On 11 March 2014 23:22, Elias Mårtenson <address@hidden> wrote:
I agree. I just wanted to point out that without a runtime option, delivering binary versions will be hard, forcing the package maintainers to choose a default that will surely be wrong for the majority of users.

That said, being able to choose a compile-time value is good too.


On 11 March 2014 23:20, Juergen Sauermann <address@hidden> wrote:

we could do it similar to the LOG macro where you can choose between
more efficient compile-time settings and less efficient run-time settings.

It is important that we do these things properly from the outset to avoid
too many changes later on.

/// Jürgen

On 03/11/2014 04:10 PM, Elias Mårtenson wrote:
May I suggest that being able to choose the number of cores at runtime should actually be the default. Remember that most Linux distributions will not compile the source on the local machine and instead distributes binaries.

Having some #ifdefs would be good, and having runtime user-selected (or automatically based on cores) number of threads as default is important for this reason.


On 11 March 2014 23:07, Juergen Sauermann <address@hidden> wrote:
Hi David,

looks good! Some comments, though.

1 .you could adapt src/testcases/Performance.pt with some longer
skalar functions in order to get some performance figures. You can start it like this:

./apl -T testcases/Performance.pt

2. I believe we should not bother the user with specifying parallelization parameters in ⎕SYL.
I would rather ./configure CORES=n with n=1 meaning no parallel execution, CORES=auto
being the number of cores on the build machine, and explicit numbers n>1 meaning that
n cores shall be used. This would generate slightly faster code than computing array bounds
at runtime. Its a bit more hassle for the user, but may pay off soon.

3. Yes, GNU APL throws many exception (almost every APL error was thrown from somewhere),
 and I was excpecting that we have to catch them on the throwing processor. Not too difficult if
we do it on the top level.

4. It would be good to understand how the OPenMP loops work. I could imagined one of two strategies:

- in loop(j, MAX)   thread j executes iteration j, j+CORES, ...
- thread j executes iterations j*MAX/CORES ... (j+1)*MAX/CORES

The first strategy interleaves the data and is more intuitive
while the second uses blocks of data and is more cache-friendly and therefore probably
giving better performance.

5. Not sure if your earlier comment on letting the scheduler decide is correct. I have been doing
pthread programming in the past and I have seen cases where the scheduler fooled itself and
led to cases where the same problem took more than double the capacity compared to explicit
affinity on a 4-core CPU. I would expect that APL generates very fine-graned and short-lived
pieces of execution and the scheduler may not be optimized for that. I guess we have to try that out.

/// Jürgen

On 03/11/2014 08:02 AM, David B. Lamkins wrote:
Juergen's suggestion prompted me to attempt an implementation using
OpenMP rather than the by-hand coding that I had been anticipating.
Attached is a quick-and-dirty patch to enable GNU APL to be build with
OpenMP support.

./configure --with-openmp

There are many rough edges, both in the Makefile and the code.

--with-openmp would ideally check to see whether the compiler supports
OpenMP. It may be necessary to check the compiler version, as different
compilers support different versions of OpenMP. Also, I've assumed
compilation on/for Linux despite the fact that GNU APL and OpenMP should
be buildable with the right Windows compiler.

As one might expect, OpenMP requires that any throw from a worker thread
must be caught by the same thread. I'm almost certain that this
restriction could be violated by GNU APL code as currently written.

The good news, though, is that the changes are benign; in the absence of
--with-openmp, GNU APL's behavior is unchanged.

With OpenMP support, ⎕syl is extended to access some of OpenMPs

I've done only trivial testing at this point; just enough to verify that
compiling OpenMP support doesn't obviously break GNU APL.

I haven't confirmed that the OpenMP #pragmas on the key loops in
SkalarFunction.cc have any effect on execution time or processor core
utilization. I hope to do more testing later this week.

Best wishes,

reply via email to

[Prev in Thread] Current Thread [Next in Thread]