pspp-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Performance questions: workspace_size default value and temp file direct


From: Stefan Tzeggai
Subject: Performance questions: workspace_size default value and temp file directory
Date: Fri, 15 Mar 2013 12:57:46 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130308 Thunderbird/17.0.4

Hi everybody and thanks for this powerful piece of free software.

I use GNU pspp 0.7.9 (Fri Jun 29 19:31:48 UTC 2012) to batch convert CSV
to SAV files. The script basically does
GET DATA /TYPE=TXT
VARIABLE LABELS
VALUE LABELS
SAVE OUTFILE /COMPRESSED

My "bigger" CSV files are between 100MB and 1GB in filesize, 300
columns, 3000000 rows, mostly numerics. PSPP performance is pretty bad
on the big files. One single CPU core uses only 20%, top's wait flickers
up to 20%wa.

I started to investigate solutions and came up with these questions:

SET WORKSPACE=workspace_size

    The maximum amount of memory that PSPP will use to store data being
    processed. If memory in excess of the workspace size is required,
    then PSPP will start to use temporary files to store the data.
    Setting a higher value will, in general, mean procedures will run
    faster, but may cause other applications to run slower. On platforms
    without virtual memory management, setting a very large workspace
    may cause PSPP to abort. 

1. Question: This is the amount of in BYTES? Any more recommendation on
this setting? Will the amount be reserved on demand (a bit more, a bit
more, a bit more) while processing or fully as soon as the command is
executed?
What is the default value and how can I query the present setting? "SHOW
workspace;" did not work.

When I set workspace=268435456 (256mb) the process uses 100% CPU and IO
wait is down. So it is an approach for more performance.

When I provide a low WORKSPACE, the disk IO increases. Where are these
files stored? I could not find any hints in the documentation and I
could not see and files being created in /tmp? Is there an option to set
this directory?

Any more ideas on performance? Can SAVE output be piped to zip-command
directly, so some more disk IO could be saved?

Many thanks in advance,
Steve



reply via email to

[Prev in Thread] Current Thread [Next in Thread]