[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: good sample data sets for use in documentation
From: |
Jason Stover |
Subject: |
Re: good sample data sets for use in documentation |
Date: |
Tue, 28 Oct 2008 09:59:03 -0400 |
User-agent: |
Mutt/1.5.18 (2008-05-17) |
On Mon, Oct 27, 2008 at 09:55:26AM -0700, Ben Pfaff wrote:
> Jason Stover <address@hidden> writes:
>
> > On Sun, Oct 26, 2008 at 10:49:52AM -0700, Ben Pfaff wrote:
> >> I would like to start including examples in the PSPP
> >> documentation that work with realistic, interesting data sets
> >> that we also include with PSPP. To do this, I need some freely
> >> distributable (ideally, public domain) data sets. I have found
> >> some of these on the web, but none seems really perfect, and I
> >> wonder whether any of you have data sets to suggest?
> >
> > Do you mean data sets posted by organizations that collected data as
> > part of a designed experiment or observational study, or just anything
> > we cobbled together?
> >
> > I have some of the latter.
>
> It's probably good to have a mix of both. Yesterday, I was
> looking around for the former. Based on my web searches, other
> things that are nice, but not entirely necessary, are:
>
> - Not too specific to any particular country or region,
> so that they will be more likely to be interesting to
> users throughout the world.
>
> - Formatted to be easily imported. Notably, Excel
> spreadsheets are not particularly easy at the moment,
> and there are lots of websites with HTML tables that
> don't provide any other format.
>
> - I find it at least mildly interesting, and I understand
> what it's about. (Obviously this is highly
> subjective.)
I have several different text files with data sets of different
types. I gathered them from electronic sites, and did some reshuffling
to make them presentable. Here is a list of the data sets I know
I have:
- Text data scraped from 158 novels I downloaded from bibliomania.org.
Each row represents 1 sentence. Most columnn represents the
frequency of a word used in that sentence. One column holds the
author's name. Another holds the title. This is a large data file,
with about 1.3 million cases and around 10 variables.
- Data on crashes that occurred on US Highways in 2004, taken from the
National Highway Traffic Safety Administration. Each row represents
one vehicle that collided with something. Variables include the
estimated speed at which the vehicle was traveling when it collided,
severity of injury of the occupants, and the cause of the
collision. There are around 25000 cases (I think).
- Climate data. These data I took from an online database at some
university (I can find the source if you want to include it). It
includes about 600000 cases, with the following variables: Country,
weather station ID, year, month, and (I think) day, and average
temperature. The data for some of the older stations go as far back
as about 1800. Most of the others have records as far back as the
1970's.
- Stock market data. I have some data involving price changes in the
New York Stock Exchange. I have a lot of this, for different
companies.
All files are tab or comma-delimited text.
I'll rummage around my data directories and look for some other files.
-Jason