[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NaN-toolbox much faster now

From: Alois Schlögl
Subject: Re: NaN-toolbox much faster now
Date: Thu, 19 Mar 2009 14:07:16 +0100
User-agent: Thunderbird (X11/20090105)

Hash: SHA1

Jason Riedy wrote:
> And Alois Schlögl writes:
>> The only difference is that one implementation gives more often NaN
>> (indicating no result) than the other implementation. Maybe its not
>> wrong to return NaN, but its not what you want. You want the best
>> possible estimate.
> Well, sometimes it is what *I* want.  There are so many options of
> what to do with NaNs that predicting what users want is tricky.
> The "best possible estimate" with missing data can be computed in
> different ways!  One involves an iterative process of replacing the
> NaNs with an estimate based on the rest of the data.  You'll find
> that method in some experimental design literature, and you can
> even find reference to software using signaling NaNs for that
> purpose long, long ago.  


Yes, I'm aware of this. You will find these techniques also under the
keyword "imputation methods". Just because there are alternative methods
does not make the proposed approach invalid.

The proposed approach does not prevent anyone from implementing such
algorithms. Actually, an efficient sumskipnan() will be beneficial for
such iterative methods, because it provides an efficient way to obtain
the initial value.

Without such tricks, returning NaN as the
> mean is a great way to identify when you should apply the iterative
> method to your data set.

There are alternatives, you could check for NaN's before you compute the
mean. Or if the NaN-toolbox is in place, you can check this with
flag_nan_occured() afterwards, and you have already the initial value
for your iterative approach available.

Specifically the second solution is much better because it does not
waste computing the mean just to get a NaN.

> But, on the flip side, if you're using the mean to re-center the
> data, then you absolutely do *not* want a NaN result.  Subtracting
> NaN from every entry will wipe out the entire data set and leave
> you no trace of the original NaN.

That's one of the cases where skipping NaN is the right thing to do. And
sumskipnan() provides an efficient instrument to it.

> Baring more sophisticated handling of exceptional events and data,
> the most reliable choices are to provide either a per-call optional
> argument or two different routines.  R takes the former route[1],
> and we took the latter for max(a, b) and min(a, b) in the IEEE-754
> standard.

I do not know about R, but last time I looked at IEEE-754r
http://www.validlab.com/754R/drafts/archive/2006-10-04.pdf p.28,
says about minNum(x,y) and maxNum(x,y) " ...[returns] the canonicalized
floating point number if one operand is a floating-point number and the
other a NaN". My understanding of this is, if one value is NaN, the
other value is returned. In a vectorized form, all NaN's are skipped.
I do not find an alternative implementation (one that propagates NaN)
for min/max.

The current implementation of min() and max() of Octave and Matlab
ignore all NA/NaN's, too.

Octave:187> min([5,NA,NaN])
ans =  5
octave:188> max([5,NA,NaN])
ans =  5

The NaN-toolbox is just applying the same principle to other statistical

> A global flag that is not locally scoped and can be forgotten is
> downright dangerous in this context.  Such a flag will wander into
> code where it was not intended and wreck havoc.  Please use another
> method.

I see these alternatives:

(1) In order to avoid wandering of the flag into code, I've added the
following warning:
"warning: flag_implicit_skipnan(0): You are warned!!! You have turned
off skipping NaN in sumskipnan. This is not recommended. Make sure you
really know what you do."

Is this sufficient to address your concern ?

(2) Remove the flag, here is the patch:
 diff  sumskipnan.m sumskipnan.m.bak
< if ~isa(x,'float') || ~flag_implicit_skip_nan(),
- ---
> if ~isa(x,'float'),

and flag_implicit_skip_nan becomes useless and can be removed.

(3) there are already different functions (e.g. nanmean, nanstd, nanvar,
etc) available. In order to get superior performance, You only need to
modify these for using sumskipnan. In that case, you do not need the
NaN-toolbox, you only need sumskipnan. But you are on your own whether
to use mean or nanmean, std or nanstd, var or nanvar, etc.

(4) Spend a lot of time with implementing checks of the input arguments,
maintaining, and eventually changing default values, etc. I do not see a
need for it. If some one else thinks this is useful, (s)he should go
ahead and implement it.

(5) It's also ok to leave out sumskipnan() from octave. Those who are
interested have several options to install it:
o) octave-nan package from Debian/Ubuntu
o) the octave-forge repository
o) pkg load nan (not tested by me)
o) from here http://hci.tugraz.at/~schloegl/matlab/NaN/
o) its also distributed with "BioSig for Octave and Matlab" available
from here: http://biosig.sourceforge.net/download.html


Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


reply via email to

[Prev in Thread] Current Thread [Next in Thread]