bug-apl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-apl] Use with word2vec


From: Fred Weigel
Subject: Re: [Bug-apl] Use with word2vec
Date: Mon, 01 May 2017 01:10:37 -0400

Juergen

This is useful -- I was looking at LApack.cc already. It is in line with what I need (as a template).

I am not worried about saving these things, but I have a 3000000x300 array of C float,
and do a 300 element vector by 300 element multiply on each of the 3 million rows in a "typical"
processing step. I don't want to convert to C double (that would increase memory from 3.6GB to 7.2GB).
I don't really want to copy the data at all! I can generate a descriptor to the data (memory pointer, dimensions).
I think I want to plant the data into a shared memory region (and, in future, pass it to a GPU).

I think I want to do some specific functions on the data -- right now I pass in row sets to GNU APL using
the API, and execute APL code using the API. However, the control is exclusively from outside APL,
meaning I cannot experimentally analyze using APL.

I can work on the model given by LApack.cc, and supply some functions which (basically) provide
a "virtual memory/workspace".

The main problem with these array sizes is saving and loading -- this array would be around 30GB in
GNU APL (as far as I can tell). If ever saved, it would then take 300GB. I can convert from float to double,
and create the Cell structures, but I would want to simply mmap() the thing into GNU APL (and, of course,
never have the thing participate in memory management). Again, I was leaning towards partial mapping.
Because, when I start with tensors, the arrays will be sparse.

So, two real problems -- (1) how to deal with LARGE non-sparse matrices, and (2) how to deal with
LARGE sparse matrices.

I really like the _expression_ afforded by APL.

It may be possible to use the APL parser, and provide new implementations of primitives -- thanks
for that idea.

LApack.cc seems to provide for something I can start with -- the actual LARGE arrays won't change
so this provides a good demark point and start for something workable.

Thanks!
Fred Weigel




On Sat, 2017-04-29 at 13:04 +0200, Juergen Sauermann wrote:
Hi Fred,

I have not fully understood what you want to do exactly, but is looks to me as if you want to go for
native GNU APL functions. Native functions provide the means to bypass the GNU APL interpreter
itself to the extent desired. For example you can use APL variables but not the APL parser, or the
APL parser but not the implementation of primitives, or whatever else you are up to.

As to plain double vectors, it is very difficult to introduce them as a new built-in data type because that
change would affect: every APL primitive, every APL operator, )LOAD, )SAVE, )DUMP, and a lot
more.

However, you can have a look at (the top-level of) the implementation of the matrix divide primitive which
is doing what you are maybe after. The implementation of matrix divide expects either a double vector or
a complex<double> vector as argument(s) and returns such a vector as result. Before and after the computation
of matrix divide a conversion between APL values and the plain double or complex vector is performed.
This conversion is very lightweight. If you have a homogenious GNU APL value, say all revel items being double,
then that value is almost like a C double *. The difference is a space between adjacent ravel elements. In other
words (expressed in APL):

C_vector ←→ 1 0 1 0 ... / APL_vector

I can provide you with more information if you want to go along this path.

/// Jürgen




On 04/29/2017 03:19 AM, Fred Weigel wrote:
Jeurgen, and other GNU APL experts.

I am exploring neural nets, word2vec and some other AI related areas.

Right now, I want to tie in google's word2vec trained models (the
billion word one GoogleNews-vectors-negative300.bin.gz)

This is a binary file containing a lot of floating point data -- about
3.5GB of data. These are words, followed by cosine distances. I could
attempt to feed this in slow way, and put it into an APL workspace. 
But... I also intend on attempting to feed the data to a GPU. So, what I
am looking for is a modification to GNU APL (and yes, I am willing to do
the work) -- to allow for the complete suppression of normal C++
allocations, etc. and allow the introduction of simple float/double
vectors or matrices (helpful to allow "C"-ish or UTF-8-ish strings: the
data is (C string containing word name) (fixed number of floating
point)... repeated LOTs of times.

The data set(s) may be compressed, so I don't want read them directly --
possibly from a shared memory region (64 bit system only, of course), or
, perhaps using shared variables... but I don't think that would be fast
enough.

Anyway, this begins to allow the push into "big data" and AI
applications. Just looking for some input and ideas here.

Many thanks
Fred Weigel




reply via email to

[Prev in Thread] Current Thread [Next in Thread]