pspp-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Performance questions: workspace_size default value and temp file di


From: John Darrington
Subject: Re: Performance questions: workspace_size default value and temp file directory
Date: Sat, 16 Mar 2013 13:33:41 +0100
User-agent: Mutt/1.5.20 (2009-06-14)

On Fri, Mar 15, 2013 at 12:57:46PM +0100, Stefan Tzeggai wrote:
     Hi everybody and thanks for this powerful piece of free software.
     
     I use GNU pspp 0.7.9 (Fri Jun 29 19:31:48 UTC 2012) to batch convert CSV
     to SAV files. The script basically does
     GET DATA /TYPE=TXT
     VARIABLE LABELS
     VALUE LABELS
     SAVE OUTFILE /COMPRESSED
     
     My "bigger" CSV files are between 100MB and 1GB in filesize, 300
     columns, 3000000 rows, mostly numerics. PSPP performance is pretty bad
     on the big files. One single CPU core uses only 20%, top's wait flickers
     up to 20%wa.
     
     I started to investigate solutions and came up with these questions:
     
     SET WORKSPACE=workspace_size
     
         The maximum amount of memory that PSPP will use to store data being
         processed. If memory in excess of the workspace size is required,
         then PSPP will start to use temporary files to store the data.
         Setting a higher value will, in general, mean procedures will run
         faster, but may cause other applications to run slower. On platforms
         without virtual memory management, setting a very large workspace
         may cause PSPP to abort. 
     
     1. Question: This is the amount of in BYTES? Any more recommendation on
     this setting? Will the amount be reserved on demand (a bit more, a bit
     more, a bit more) while processing or fully as soon as the command is
     executed?
     What is the default value and how can I query the present setting? "SHOW
     workspace;" did not work.

The value is in bytes.  The default is 64 MB (64 * 1024 * 1024).  It is a upper 
limit,
so it will only be used if needed.  It is a little more complex than that, 
because
it is the maximum amount PER READER -  some operations require multiple readers.

I don't know why SHOW WORKSPACE doesn't work.  Maybe that's a bug.
     
     When I set workspace=268435456 (256mb) the process uses 100% CPU and IO
     wait is down. So it is an approach for more performance.

That is what I would expect.  Basically, the bigger the workspace, the faster 
the
processing.  But clearly if the pspp engine  is running at 100% CPU, then there
is nothing left for other processes.  This can be an issue for people who are
using the GUI, and want it to remain responsive.  Or if you want other 
applications
to work while you are waiting for results to be processed.
     
     When I provide a low WORKSPACE, the disk IO increases. Where are these
     files stored? I could not find any hints in the documentation and I
     could not see and files being created in /tmp? Is there an option to set
     this directory?


You can see this if you type SHOW TEMPDIR.  On my system it is indeed under 
/tmp,
but this varies according to operating system.  You can override it with the 
TMPDIR
environment variable, or some operating systems have their own ways of defining 
a
temporary directory.  You might see a performance advantage if you set it to a 
directory
which is mounted on a different physical disk from the one you are working on.
     
     Any more ideas on performance? Can SAVE output be piped to zip-command
     directly, so some more disk IO could be saved?

I suppose you could use a fifo, like this:

mkfifo myfifo
cat myfifo | gzip -c > foo.sav.gz & 
pspp run.sps

 where run.sps contains the line SAVE OUTFILE='myfifo'.
But I am unsure that it would provide any speed benefit.

If ALL you are trying to do is convert text to a .sav file, then running PSPP 
is probably
not a good idea.  It will be much faster if you write a small perl script which 
uses the
perl modules which come with PSPP.

J'


-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://keys.gnupg.net or any PGP keyserver for public key.

Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]