[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Import large field-delimited file with strings and numbers

From: João Rodrigues
Subject: Re: Import large field-delimited file with strings and numbers
Date: Mon, 08 Sep 2014 18:54:22 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0

On 08-09-2014 17:49, Philip Nienhuis wrote:
Yet, csv2cell is orders of magnitude faster. I will break the big file
into chunks (using fileread, strfind to determine newlines and fprintf)
and then apply csv2cell chunk-wise.
Why do you need to break it up using csv2cell? AFAICS that reads the entire
file and directly translates the data into "values" in the output cell
array, using very little temporary storage (the latter quite unlike
It does read the entire file twice, once to assess the required dimensions
for the cell array, the second (more intensive) pass for actually reading
the data.
The file I want to read has around 35 million rows, 15 columns and takes 200 MB of disk space: csv2cell would simply eat up all memory and the computer stopped responding.

I tried to feed it small chunk of increasing size and found out that it behaved well until it received a chunk of 500 million rows (when memory use went through the stratosphere).

So I opted for the clumsy solution of breaking the file into small pieces and spoon feed csv2cell.

But then I found out something interesting. If I would save a cell with 35 million rows and only 3 columns in gzip format it would take very little disk space (20 MB or so) but when I tried to open it... it would again take forever and eat up GBs of memory.

Bottom line: I think it has to do with the way Octave allocates memory to cells, which is not very efficient (as opposed to dense or sparse numerical data, which it handles very well).

I managed to solve the problem I had, thanks to the help of you guys.

However, I think it would probably be nice if in future versions of Octave there was something akin to ulimit installed by default to prevent a process from eating up all available memory.

If someone wants to check this issue the data I am working with is public:*/csv/*

where * = 1990:2013 explains the content.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]