Hi Louis,
most people will assume that
A←A,1 2 3
simply appends 1 2 3 to A. But don't
bet your life on that:
)CLEAR
CLEAR WS
∇Z←A B
[1] ⊣⎕EX 'A' ◊ Z←42,B
[2] ∇
A←A,1 2 3
A
42 1 2 3
The optimizations that Elias and myself tried earlier
worked for the most part,
but failed on pathological cases like the above. The detection of
such cases
cannot, as the example above shows, be done statically (say at
function definition
time) but has to be done at runtime. The question then is: will
the effort for the
detection be amortized by the fewer copies or not? My guts feeling
is that it
does not, but you can always construct a benchmark in which it
does.
Our conclusion at that time was that the correctness of the result
is more important
than speed.
Best Regads,
Jürgen Sauermann
On 07/05/2017 02:20 AM, Louis de
Forcrand wrote:
On the subject of copying arrays, I was wondering, in GNU
APL, where in code such as
A <- A,1 2 3
B <- A,4 5 6
A <- A,7 8 9
(in that order) A is copied. It has to be copied in the third
line (unless some weird bookkeeping is done) as B referred to
A's previous value, but can be modified in place in the first
two lines.
Also
1 2 3 + 3 4 5
where the result of + can overwrite
one of its two arguments (the same goes for any scalar
function).
What is the in place situation of GNU
APL?
Thanks,
Louis
Hi Blake,
The state of GNU APL is, I believe, this:
I am not aware of any unnecessary copying of arrays in GNU
APL. There were some suspicions
claimed earlier that some copying could be avoided. But it
then turned out that removing these
copies would corrupt other values under certain
circumstances (because different functions would
modify the same value). We therefore had to revert the
"unnecessary copies" to a state that seems
to be safe now.
Regarding parallel processing, the situation seems to be so
that a multi-core CPU
cannot be
significantly faster than a single-core CPU with the same
memory (-interface). A significant
speedup requires that the bandwidth between the cores and
the memory grows with the number
of cores. I had built a machine like that with some
students back in 1990, but the current PCs with
multi-core CPUs just do not provide that.
If you make parallel1 and then run ScalarBenchmark.apl
(which comes with GNU APL)
then you get, for example:
-------------- Mix_IRC +
Mix1_IRC --------------
average sequential startup cost: 530 cycles
average parallel startup cost: 1300 cycles
per item cost sequential: 122 cycles
per item cost parallel: 195 cycles
parallel break-even length: not reached
This means that:
the additional start-up cost for a parallel computation
is 1300-530=770 cycles or 240 nano-seconds
on a 3.2 GHz machine. This is actually a pretty good
value. Before writing my own core synchronization
functions I used a standard paralle;ization library that
took aboutc 20,000 cycles.
I believe it was libmpi but that I was many years ago so I
don't quite remember.
This startup cost also includes what Elias refers to as
bookkeeping).
What remains (and is the show-stopper) is the per item
cost A single core needs 122 cycles for
adding two numbers while 4 cores need 195 cycles per core
= 780 cycles in total.
The code in both cases is exactly the same. Putting it
differenly, if I run alone then some function
takes 122 cycles and when my collegues work in parallel on
something that has nothing
to do with my work then I need 780 cycles.
Once the parallel startup has finished the cores work
independently and without any locks or
the like between them. This cannot be explained at
software level but rather suggests that
some common resource (memory ?!) is slowing each core down
when some other core(s) are
working at the same time. The 122 cycles (~40 ns) in the
single core case is roughly the time
for one DRAM access in page mode. In other words, the main
memory bandwidth (at least of
my machine) is just enough to feed one core, but far to
low for 4 cores.
GNU APL can do little to fix this bottleneck. I believe I
have done everything possible (although
new ideas are welcome) that can be done in software, but
if we hit hardware bottlenecks then
then thats it. I believe GNU APL would run perfectly on a
4-core CPU with a 4-port main memory,
but that will probably remain a dream in my lifetime.
Best Regards,
Jürgen Sauermann
On 07/04/2017 06:15 PM, Blake
McBride wrote:
Hi Ellas,
I remember your earlier work and comments. In fact,
that is the source of my knowledge of this.
Putting more features on a base with known issues is
problematic for the following reasons:
1. The time and effort it would take to secure the
base is being spent elsewhere.
2. Adding new features is, in effect, ignoring the
problem and moving on.
3. APL is good and useful as it is. If it's not
being used as-is, then there is some other issue (like
efficiency). I remember another post about unnecessary
bloat because of the unnecessary copying. This was
causing his machine to page unnecessarily. The
unnecessary coping is major in terms of performance and
memory efficiency. This is big.
4. APL was developed with the promise of parallel
processing. This is something that wasn't practical
back then, but is now. I am under the impression that
this was one of the big reasons Juergen created GNU APL
- to take advantage of parallel processing.
Just one opinion.
Blake
|