pspp-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Import data from other file formats and Histogram question


From: Gaj Vidmar
Subject: Re: Import data from other file formats and Histogram question
Date: Wed, 10 Mar 2010 14:51:14 +0100

Though widely used, Sturges' rule is a wrong choice -- see (e.g., since it's 
freely available and easy to understand
even for non-mathematicians like most of <us here>)

http://robjhyndman.com/papers/sturges.pdf

Essential excerpt:

----------
Alternative rules for constructing histograms include Scott's (1979) rule 
for the class width:

h = 3.5*s*n**?1/3

and Freedman and Diaconis's (1981) rule for the class width:

h = 2(IQ)n**?1/3

where s is the sample standard deviation and IQ is the sample interquartile 
range. [** obviously stands for power]
Either of these are just as simple to use as Sturges' rule, but are 
well-founded in statistical theory.

Sturges' rule has probably survived as long as it has because, for moderate 
n (less than
200), it gives similar results to the alternative rules above (see Scott, 
1992, p.56), and so
produces reasonable histograms. However, it does not work for large n.

The problem with Sturges' rule is that its derivation is wrong. It is a rule 
which no longer
deserves a place in statistics textbooks or as a default in statistical 
computer packages.
----------

So, here's a good chance for PSPP to implement something better than many 
commercial stats packages
(BTW, several of which I admire and have been daily using for a looong time, 
while PSPP has still a looong way to go before ...)
with minimum effort.

Cordial regards and all the best with further advance of the PSPP project,
Gaj Vidmar
---
Assist. Prof. Gaj Vidmar, PhD
University Rehabilitation Institute, Republic of Slovenia
& Univ. of Ljubljana, Fac. of Medicine, Inst. for Biostatistics and Medical 
Informatics


"Ben Pfaff" <address@hidden> wrote in message 
news:address@hidden
> [cleaning out my old email]
>
> Erik Frebold <address@hidden> writes:
>
>> 2. Re: Histogram-- I was puzzled as to why this function would
>> assign only six bins to an n=1000 dataset. Upon investigation,
>> looks like the number of bins is assigned by gsl, right? (which
>> I assume would use something suitable like Sturges' formula,
>> though I wasn't able to find the actual stretch of code) So is
>> it likely just that the pspp part of the process might not be
>> quite ready for primetime yet? No problem if this is the case--
>> I can use matplotlib or similar for now.
>
> I took a look at this.  PSPP always uses exactly 11 bins for
> every histogram!  This seems less than optimal.
>
> I pushed out a commit that uses Sturges' formula, with a minimum
> of 5 buckets.
> -- 
> Ben Pfaff
> http://benpfaff.org 







reply via email to

[Prev in Thread] Current Thread [Next in Thread]