Re: [Bug-apl] Spell corrector

On 12 September 2016 at 18:34, Ala'a Mohammad <address@hidden> wrote:

Thanks for the alternative, I'd tried to run it, but got Rank Error

RANK ERROR
λ1[1] λ←⍵[⍒⍵[;2];]
^ ^

How can I help debug this?

Regards,

Ala'a

On Mon, Sep 12, 2016 at 5:32 PM, Jay Foad <address@hidden> wrote:
> Hi Ala'a,
>
> How about replacing the last line with this? It runs in about 1 minute on my
> machine:
>
> desc 39 2⍴(⍪u),≢¨⊂⍨x[⍋x←u⍳w]
>
> Jay.
>
> On 11 September 2016 at 19:23, Ala'a Mohammad <address@hidden> wrote:
>>
>> Just an update as a reference, I'm now able to parse the big.txt file
>> (without WS full or killed process), but it takes around 2 Hours and
>> 20 Minutes +-10 minutes. (around 1M words, 30K are unique). The
>> process reach 1GiB (after parsing the words), and tops that with
>> 100MiB during the sequential 'Each' (thus a max of 1.1GiB).
>>
>> The only change is scanning each unique word against the whole words
>> vector.
>>
>> Below is the code with a sample timed run.
>>
>> Regards,
>>
>> Ala'a
>>
>> ⍝ fhist.apl
>> a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>> downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
>> nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
>> nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
>> alphamask ← { ~ ⍵ ∊ nonalpha }
>> words ← { (alphamask ⍵) ⊂ downcase ⍵ }
>> desc ← {⍵[⍒⍵[;2];]}
>> ftxt ← { ⎕FIO[26] ⍵ }
>>
>> file ← '/misc/big.txt' ⍝ ~ 6.2M
>> ⎕ ← ⍴w ← words ftxt file
>> ⎕ ← ⍴u ← ∪w
>> desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u
>> )OFF
>>
>> : time apl -s -f fhist.apl
>> 1098281
>> 30377
>> the 80003
>> of 40025
>> to 28760
>> in 22048
>> for 6936
>> by 6736
>> be 6154
>> or 5349
>> all 4141
>> this 4058
>> are 3627
>> other 1488
>> before 1363
>> should 1297
>> over 1282
>> your 1276
>> any 1204
>> our 1065
>> holmes 450
>> country 417
>> world 355
>> project 286
>> gutenberg 262
>> laws 233
>> sir 176
>> series 128
>> sure 123
>> sherlock 101
>> ebook 85
>> copyright 69
>> changing 44
>> check 38
>> arthur 30
>> adventures 17
>> redistributing 7
>> header 7
>> doyle 5
>> downloading 5
>> conan 4
>>
>> apl -s -f fhist.apl 8901.96s user 5.78s system 99% cpu 2:28:38.61 total
>>
>> On Sat, Sep 10, 2016 at 12:02 PM, Ala'a Mohammad <address@hidden>
>> wrote:
>> > Thanks to all for the input,
>> >
>> > Replacing Find and Each OR with Match helped, now I'm parsing a 159K
>> > (~1545 lines) text file (a sample chunk from the big.txt).
>> >
>> > The strange thing for me that I'm trying to understand is that the APL
>> > process (when fed the 159K text file) start allocating memory until it
>> > reaches 2.7GiB, then after printing the result settle down to 50MiB.
>> > Why do I need 2.7GiB? is there any memory utils (i.e. Garbage
>> > collection utility) which can be used to mitigate this issue?
>> >
>> > Here is the updated code:
>> >
>> > a ← 'abcdefghijklmnopqrstuvwxyz'
>> > A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>> > downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
>> > nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
>> > nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
>> > alphamask ← { ~ ⍵ ∊ nonalpha }
>> > words ← { (alphamask ⍵) ⊂ downcase ⍵ }
>> > hist ← { (⍪∪⍵),+/(∪⍵)∘.≡⍵ } ⍝ as suggested by Kacper
>> > desc ← {⍵[⍒⍵[;2];]}
>> > ftxt ← { ⎕FIO[26] ⍵ }
>> > fhist ← { hist words ftxt ⍵ }
>> >
>> > file ← '/misc/llaa' ⍝ llaa contains 1546 text lines
>> > ⎕ ← ⍴w ← words ftxt file
>> > ⎕ ← ⍴u ← ∪w
>> > desc 39 2 ⍴ fhist file
>> >
>> > And here is a sample run
>> > : apl -s -f fhist.apl
>> > 30186
>> > 4155
>> > the 1560
>> > to 804
>> > of 781
>> > in 493
>> > for 219
>> > be 173
>> > holmes 164
>> > your 132
>> > this 114
>> > all 99
>> > by 97
>> > are 97
>> > or 73
>> > other 56
>> > over 51
>> > our 48
>> > should 47
>> > before 43
>> > sherlock 39
>> > any 35
>> > sir 26
>> > sure 13
>> > country 9
>> > project 6
>> > gutenberg 6
>> > ebook 5
>> > adventures 5
>> > world 5
>> > arthur 4
>> > conan 4
>> > doyle 4
>> > series 2
>> > copyright 2
>> > laws 2
>> > check 2
>> > header 2
>> > changing 1
>> > downloading 1
>> > redistributing 1
>> >
>> > Also attached the sample input file
>> >
>> > Regards,
>> >
>> > On Sat, Sep 10, 2016 at 9:20 AM, Kacper Gutowski <address@hidden>
>> > wrote:
>> >> On 9 September 2016 at 23:39, Ala'a Mohammad wrote:
>> >>> the errors happened inside 'hist' function, and I presume mostly due
>> >>> to the jot dot find (if understand correctly, operating on a matrix of
>> >>> length equal to : unique-length * words-length)
>> >>
>> >> Try (∪⍵)∘.≡⍵ instead of ∨/¨(∪⍵)∘.⍷⍵.
>> >>
>> >> -k
>>
>

From:	Jay Foad
Subject:	Re: [Bug-apl] Spell corrector - APL
Date:	Tue, 13 Sep 2016 16:25:10 +0100

Re: [Bug-apl] Spell corrector - APL