bug-apl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-apl] performance in of ¨


From: Juergen Sauermann
Subject: Re: [Bug-apl] performance in of ¨
Date: Fri, 17 Jun 2016 19:41:44 +0200
User-agent: Mozilla/5.0 (X11; Linux i686; rv:31.0) Gecko/20100101 Thunderbird/31.4.0

Hi,

I can confirm that something is strange with the macro based ¨ but I can't yet see what.
I wrote a simple measurement function to test A+B in different ways:

A+B        (direct scalar)
A+¨B       (uses built-in each)
A {⍺+⍵} B  (uses macro each)


The measurement program is this:

      )clear
∇MEASURE B;A;D;R;T
 A←⍳B ◊ T←7⍴0
 T[1]←⎕FIO ¯1 ◊ ⊣ A + A              ◊ T[2]←⎕FIO ¯1 ◊ ⊣ A + A
 T[3]←⎕FIO ¯1 ◊ ⊣ A +¨ A             ◊ T[4]←⎕FIO ¯1 ◊ ⊣ A +¨ A
 T[5]←⎕FIO ¯1 ◊ ⊣ A {⍺+⍵}¨ A         ◊ T[6]←⎕FIO ¯1 ◊ ⊣ A {⍺+⍵}¨ A
 T[7]←⎕FIO ¯1 ◊ D←T-1⌽T ◊ R←¯1↓D÷D[3]
 ⍉2 6⍴(,2/'A + B' 'A +¨ B' 'A {⍺+⍵}¨ B') , (⊂8 2)⍕¨R

∇Z←A (LO EACH_1) B;rho_Z;N;N_max
rho_Z←⍴B ◊ N←⎕IO ◊ A←,A ◊ N_max←N+⍴B←,B ◊ Z←N_max⍴0
LOOP: Z[N]←⊂(⊃A[N]) LO ⊃B[N] ◊ →(N_max>N←N+1)⍴LOOP
Z←rho_Z⍴Z

∇Z←A (LO EACH_2) B;rho_Z;N;N_max
rho_Z←⍴B ◊ N←⎕IO ◊ A←,A ◊ N_max←N+⍴B←,B ◊ Z←N_max⍴0
LOOP: (N⌷Z)←⊂(N⌷A) LO N⌷B ◊ →(N_max>N←N+1)⍴LOOP


Every case is measured twice to see fluctuations, caching and the like.
The numbers are CPU cycles relative to the third line
The results are quite puzzling:


            MEASURE 100
 A + B           .18
 A + B           .14
 A +¨ B         1.00
 A +¨ B          .90
 A {⍺+⍵}¨ B    17.81
 A {⍺+⍵}¨ B    11.81


            MEASURE 1000
 A + B           .13
 A + B           .12
 A +¨ B         1.00
 A +¨ B         1.01
 A {⍺+⍵}¨ B    18.03
 A {⍺+⍵}¨ B    11.67

            MEASURE 10000
 A + B           .18
 A + B           .17
 A +¨ B         1.00
 A +¨ B          .29
 A {⍺+⍵}¨ B   111.45
 A {⍺+⍵}¨ B   111.85

So the macro based EACH becomes slower when the values grow. But why?
The macro code for the EACH macro is quite simple (and ⎕FXing it gives the same
behavior as the built-in macro.

      ⎕CR 'μ-Z__vA_LO_EACH_vB'
Z←A (LO Z__vA_LO_EACH_vB) B;rho_Z;N;N_max
rho_Z←⍴B ◊ N←⎕IO ◊ A←,A ◊ N_max←N+⍴B←,B ◊ Z←N_max⍴0
LOOP: Z[N]←⊂(⊃A[N]) LO ⊃B[N] ◊ →(N_max>N←N+1)⍴LOOP
Z←rho_Z⍴Z


I can't see anything in that code which would explain why the macro performance
drops when the values get larger. The difference between
A +¨ B andA {⍺+⍵}¨ B is caused
by two independent factors:

one uses the built-in + while the other uses the user defined lambda
{⍺+⍵}, and
one uses the built-in ¨ while the other uses the user defined macro
Z__vA_LO_EACH_vB

The impact of the second factor can be determined by running the same measurement with an
older version of GNU APL:

            MEASURE 100 
 A + B           .20
 A + B           .15
 A +¨ B         1.00
 A +¨ B          .94
 A {⍺+⍵}¨ B     5.53
 A {⍺+⍵}¨ B     5.36

            MEASURE 1000
 A + B           .14
 A + B           .10
 A +¨ B         1.00
 A +¨ B          .98
 A {⍺+⍵}¨ B     4.87
 A {⍺+⍵}¨ B     4.49

            MEASURE 10000
 A + B           .26
 A + B           .25
 A +¨ B         1.00
 A +¨ B          .42
 A {⍺+⍵}¨ B     2.42
 A {⍺+⍵}¨ B     2.41


Here the time decreases rather than increases with the vector lengths.
Thus macros are about 3 times slower at vectors of length 100 and 30 times
slower at vectors of length 10000.

And the LOOP line in the macro does the same as the corresponding C code in the
built-in EACH case.

I have no idea yet what is going on here. The memory consumption is probably a hint,
but I can't see where the memory goes. The A, B and Z variables are allocated only once
per macro call and the variables in the loop are small (which means that GNU APL is
recycling them internally rather than malloc()ing them).

Any ideas welcome!

/// Jürgen



On 06/17/2016 12:46 AM, Xiao-Yong Jin wrote:
Hi,

In my script I was using ⍎¨ on a vector of character vector from each line of a file.  After some testing, I guess it is not really the ⍎, that is slow.  The time scaling is strange.

In the following, the first row and the third row (using ¨ and ⍣ respectively) scales like n*3, while the problem size is only n*2.  My script ends up taking more than a minute to read a 1000×1000 numeric matrix from a text file.  On the other hand, the naive {⍵+1⊣⍎v}⍣n 1 indeed scales like n*2.

In addition, gnu apl is taking nearly a gigabytes of memory to run the following code, while dyalog is on the order of megabytes.

      te¨200×⍳3
  ⍎¨u 536     ⍎¨u 4129     ⍎¨u 13962  
  ⍎ v   0     ⍎ v    0     ⍎ v     0  
  ⍎u⍣ 518     ⍎u⍣ 4114     ⍎u⍣ 13889  
  ⍎v⍣  15     ⍎v⍣   58     ⍎v⍣   131  
      ∇te[⎕]∇
    ∇
[0]   r←te n;s;u;v;t
[1]   s←1E¯18×?n n⍴1E18
[2]   u←⍕¨⊂[2]s
[3]   v←↑u
[4]   t←⎕AI◊⊣⍎¨u◊r←(⊂'⍎¨u'),2⌷⎕AI-t
[5]   t←⎕AI◊⊣⍎v◊r←r,[.5](⊂'⍎ v'),2⌷⎕AI-t
[6]   t←⎕AI◊⊣{⍵+1⊣⍎⊃u[⍵]}⍣n 1◊r←r,[1](⊂'⍎u⍣'),2⌷⎕AI-t
[7]   t←⎕AI◊⊣{⍵+1⊣⍎v}⍣n 1◊r←r,[1](⊂'⍎v⍣'),2⌷⎕AI-t
    ∇

Best,
Xiao-Yong





reply via email to

[Prev in Thread] Current Thread [Next in Thread]