|Subject:||Re: Import large field-delimited file with strings and numbers|
|Date:||Mon, 8 Sep 2014 21:27:01 +0200|
On 08-09-2014 17:49, Philip Nienhuis wrote:
The file I want to read has around 35 million rows, 15 columns and takes 200 MB of disk space: csv2cell would simply eat up all memory and the computer stopped responding.<snip>Why do you need to break it up using csv2cell? AFAICS that reads the entire
Yet, csv2cell is orders of magnitude faster. I will break the big file
into chunks (using fileread, strfind to determine newlines and fprintf)
and then apply csv2cell chunk-wise.
file and directly translates the data into "values" in the output cell
array, using very little temporary storage (the latter quite unlike
It does read the entire file twice, once to assess the required dimensions
for the cell array, the second (more intensive) pass for actually reading
I tried to feed it small chunk of increasing size and found out that it behaved well until it received a chunk of 500 million rows (when memory use went through the stratosphere).
So I opted for the clumsy solution of breaking the file into small pieces and spoon feed csv2cell.
But then I found out something interesting. If I would save a cell with 35 million rows and only 3 columns in gzip format it would take very little disk space (20 MB or so) but when I tried to open it... it would again take forever and eat up GBs of memory.
Bottom line: I think it has to do with the way Octave allocates memory to cells, which is not very efficient (as opposed to dense or sparse numerical data, which it handles very well).
I managed to solve the problem I had, thanks to the help of you guys.
However, I think it would probably be nice if in future versions of Octave there was something akin to ulimit installed by default to prevent a process from eating up all available memory.
If someone wants to check this issue the data I am working with is public:
where * = 1990:2013
http://www.bls.gov/cew/datatoc.htm explains the content.
Help-octave mailing list
|[Prev in Thread]||Current Thread||[Next in Thread]|