pspp-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: String variables combining files


From: Frans Houweling
Subject: Re: String variables combining files
Date: Thu, 26 Mar 2015 12:49:08 +0100
User-agent: Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Thunderbird/31.5.0

I don't think the widest variable should be saved - the same old rules should be followed (eg. in MATCH, if the var is already present leave it alone). Point is, in 99% of cases the incompatible-width vars are completely irrelevant for the MATCH at hand, which will fail only because a feeding file happens to have the same var somewhere but with different width.
To ftr: people create .sav files from Excel data, or using import wizards. And even trained staff may need to cope with a name that is longer than expected.
Cheers
frans

On 26/03/2015 12:10, Alan Mead wrote:
On 3/26/2015 3:59 AM, ftr wrote:
So this means that the programs that produce the CSV files produce output with different string variable width ?
This is due to the programs or to the people that use the progs ?

In general, when you import text files you fix the variable width in the DATA LIST.
Or you use GET DATA/TYPE=
http://www.gnu.org/software/pspp/manual/pspp.html#GET-DATA

And why don't you set FORMAT on each of the separate files before you integrate them ?
 
When I worked in a project that sounds similar to yours we did a serious pre-field work training of the local data producers that succeeded in making the local projects aware what was on stake (motivation), that made the local heads control the consistency of data to be sent - something we could not do because we had no direct access to the local projects, for which the local heads had better knowledge,  and it would have cost us too much (data control) - and that assured that data were sent in a coherent format and at time.

Maybe you have to train your local people ?

Just some ideas for local problem solving. I am happy that we have volunteers doing the programming work so we should not overcharge them with more work that we can at our side.

ftr,

It sounds like you don't run into this problem, so maybe this discussion isn't relevant for you. 

But to repeat the reasons why this change is a good idea: (1) it would still be EASIER to have PSPP deal with this problem automatically, rather than forcing me to deal with this issue; and (2) and it would be a simple way to create another point distinguishing PSPP as superior to SPSS.

I have given some thought to why SPSS has this limitation. One possibility is that it's simply an old limitation due to some original hardware or software issues. I speculate below that at the time of SPSS's inception, string data was not particularly common nor important and that variable lengths would be rare. Also, it could be due to performance issues, but if so I'm sure it would be faster for PSPP to resolve this issue than for me to due so manually; I assume that fixing this issue wouldn't generally slow down merge/join files?

I cannot imagine a situation where having this restriction on matching string length would be a feature.  But if PSPP solves the problem by truncating longer strings, then some data would be lost and sometimes that will be unacceptable so it would be good to issue a warning or force people to turn on this feature.  If the solution can be to change the final string length to the longest encountered string length (and, I assume, therefore truncate no data) then I cannot see a problem arising from this feature.

I also speculate that this problem is far more of an issue today than when SPSS was first created, because string data is easier (sometimes more natural) to collect today. SPSS would have originally (i.e., cerca 1970) been fed punch cards and most string data would have been generated either by the researcher (like a coding) or by something like a scantron or a scantron-like response grid. I'm sure someone had participants respond by writing something in but it would have been keyed into the computer into a fixed width. Using a physical storage medium (cards) would have discouraged strings unless they were necessary and encouraged researchers to use the shortest possible length. Compare that to now: my web-based surveys often have variable length strings like email, useragent and other string-based meta-data and often the survey includes fill-in-the-blank or short answer questions. Often I get datasets where responses are strings, rather than numeric codes (e.g., "male" and "female").  Even if they are the same data (e.g., email), it would be natural for these variables to have different lengths across different surveys. I don't foresee these conditions changing.

-Alan


-- 

Alan D. Mead, Ph.D.
President, Talent Algorithms Inc.

science + technology = better workers

+815.588.3846 (Office)
+267.334.4143 (Mobile)

http://www.alanmead.org

Announcing the Journal of Computerized Adaptive Testing (JCAT), a
peer-reviewed electronic journal designed to advance the science and
practice of computerized adaptive testing: http://www.iacat.org/jcat


_______________________________________________
Pspp-users mailing list
address@hidden
https://lists.gnu.org/mailman/listinfo/pspp-users


reply via email to

[Prev in Thread] Current Thread [Next in Thread]