Re: Import large field-delimited file with strings and numbers

From:

Markus Bergholz

Subject:

Date:

Mon, 8 Sep 2014 21:27:57 +0200

On Mon, Sep 8, 2014 at 9:27 PM, Markus Bergholz <address@hidden> wrote:

On Mon, Sep 8, 2014 at 7:54 PM, João Rodrigues <address@hidden> wrote:

On 08-09-2014 17:49, Philip Nienhuis wrote:

<snip>
Yet, csv2cell is orders of magnitude faster. I will break the big file
into chunks (using fileread, strfind to determine newlines and fprintf)
and then apply csv2cell chunk-wise.

Why do you need to break it up using csv2cell? AFAICS that reads the entire
file and directly translates the data into "values" in the output cell
array, using very little temporary storage (the latter quite unlike
textscan/strread).
It does read the entire file twice, once to assess the required dimensions
for the cell array, the second (more intensive) pass for actually reading
the data.

The file I want to read has around 35 million rows, 15 columns and takes 200 MB of disk space: csv2cell would simply eat up all memory and the computer stopped responding.

I tried to feed it small chunk of increasing size and found out that it behaved well until it received a chunk of 500 million rows (when memory use went through the stratosphere).

So I opted for the clumsy solution of breaking the file into small pieces and spoon feed csv2cell.

But then I found out something interesting. If I would save a cell with 35 million rows and only 3 columns in gzip format it would take very little disk space (20 MB or so) but when I tried to open it... it would again take forever and eat up GBs of memory.

Bottom line: I think it has to do with the way Octave allocates memory to cells, which is not very efficient (as opposed to dense or sparse numerical data, which it handles very well).

I managed to solve the problem I had, thanks to the help of you guys.

However, I think it would probably be nice if in future versions of Octave there was something akin to ulimit installed by default to prevent a process from eating up all available memory.

If someone wants to check this issue the data I am working with is public:

http://www.bls.gov/cew/data/files/*/csv/*_annual_singlefile.zip

where * = 1990:2013

404 in all combinations I've tried

nvm, got it.

which columns do you need?

http://www.bls.gov/cew/datatoc.htm explains the content.

_______________________________________________
Help-octave mailing list
address@hidden
https://lists.gnu.org/mailman/listinfo/help-octave

--
icq: 167498924
XMPP|Jabber: address@hidden

--
icq: 167498924
XMPP|Jabber: address@hidden