pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: linked lists and trees


From: Jason Stover
Subject: Re: linked lists and trees
Date: Fri, 26 Sep 2008 11:18:01 -0400
User-agent: Mutt/1.5.18 (2008-05-17)

On Fri, Sep 26, 2008 at 10:56:16AM -0400, Jason Stover wrote:
> On Fri, Sep 26, 2008 at 10:51:31AM -0400, Jason Stover wrote:
> > So now I'm thinking hashing won't work. The problem is that I do
> > not know in advance the range of the hash function, nor how to
> > avoid collisions. 
> > 
> > If the keys were just, say, of the form (variable1, variable2),
> > I could make a simple hash function like
> > 
> >   key = variable1->idx + variable2->idx * n_vars
> > 
> > This would be fine for numeric variables.
> > 
> > The problem is with the categorical variables. I need distinct
> > entries in the hash table for each of the values of the categorical
> > variables, and I haven't passed the data yet, so I don't know how many
> > values there are, nor what they may be. Which means, I think, that I
> > can't write a hash function in advance that will be guaranteed to be
> > one-to-one.
> 
> I mis-spoke here. The problem isn't so much the one-to-one'ness of the 
> hash as it is that I just can't write a hash function in advance that
> knows what to do with the categorical values, since I haven't seen them
> yet.

Now I'm having second (or third or fourth) thoughts. I guess I could 
write such a hash function, but it would have to handle different cases,
and I don't immediately see how to do that with the categorical values.

The hash function needs a quadruple as a key: (var1, var2, value1, value2).

The cases are as follows: 

1. var1 and var2 are both numeric. In this case, value1 and value2 are
   irrelevant. Different values hash to the same node, or element, or
   whatever, in the hash table. Only the variables var1 and var2 matter.

2. Exactly one of var1 or var2 is numeric. In this case, exactly one
   value is relevant, meaning distinct values give distinct hash table
   elements.

3. var1 and var2 are both categorical. Both value1 and value2 are
   relevant. Distinct values value1 and value2 will hash to distinct
   elements in the hash table.

Remember that I haven't passed the data yet, so I have no idea what
values value1 and value2 may be.

I guess I can spend the time to come up with a hash function if
necessary.  But I'm not sure it's worth the effort. Would it be easier
to just use a binary tree and search every time I need a node?

About size: I expect this data structure to need O(n_vars^2)
nodes. One node takes about 32 bytes. There is no way to reduce this.
It all has to fit in memory anyway to perform the linear regression.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]