[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Import large field-delimited file with strings and numbers

From: Helios de Rosario
Subject: Re: Import large field-delimited file with strings and numbers
Date: Mon, 08 Sep 2014 15:23:00 +0200

>>> João Rodrigues<address@hidden> 06/09/2014 16:19 >>>

I need to import a large CSV file with multiple columns with mixed
string and number entries, such as:

field1, field2, field3, field4
A,        a,        1,       1.0,
B,        b,        2,        2.0,
C,        c,        3,        3.0,

and I want to pass this on to something like

cell1 ={[1,1] = A; [2,1] = B; [3,1] = C};
cell2 ={[1,1] = a; [2,1] = b; [3,1] = c};
arr3 =[1 2 3]';
arr4 =[1.0 2.0 3.0]';

furthermore, some columns can be ignored, the total number of entries is
known and there is a header.

How can I perform the import within reasonable time and little memory
overhead? Below are a few of my attempts.
As others have mentioned, the problem is easy if in your table all columns have the same number of characters. In that case, you can read the whole text into a char matrix, and then use mat2cell to separate the strings by the width of the columns, or str2num to put the numeric values in a matrix of doubles.
If the file is too large, you cannot avoid looping, but you can work with "not-so-large" text chunks, e.g.:
cell1 = cell2 = cell(nrow, 1);
arr1 = arr2 = nan(nrow, 1);
strail = []; # Variable to store trailing characters in the loop
lastrow = 0;
fid = fopen("test.txt", "r");
  s = [strail, fscanf(fid, "%c", 1024^2)]; # Chunks of 1 MB
  # Use only "full" lines, and save trailing characters for later
  lastnl = find(s=="\n", 1, "last");
  strail = s(lastnl+1:end);
  s = s(1:lastnl);
  # Here manipulate the string "s"
  # write the number of written lines in variable "n"
  # and fill in cell1(lastrow + (1:n)), etc.
  lastrow += n;
If your file does not have columns of the same width, you might use other programs to import it and export it with fixed-width columns, but you can also transform the original ascii string within Octave, and avoid writing new files into disk. Let's say that the maximum width of your columns is 5 characters:
maxlength = 5;

# Find separation characters (commas and new line)
sep = strchr(s, ",\n");
nval = length(sep);
# Actual length of your "words"
sl = [sep(1), diff(sep)];
# String with fixed width slots
sfull = ' '(ones(1,nval*(maxlength+1)));
# Positions of separation characters in "sfull"
sepfull = (maxlength + 1) * (1:nval);
# Fill in sfull backwards
for c = 0:maxlength
  sfull(sepfull - c) = ...
    ifelse(c < sl, s(max(1, sep - c)), ' ');
# Reshape sfull
sfull = reshape(sfull', (maxlength+1)*ncols)';
In Windows, you might want to use strrep to replace EOL characters ("\n\r") to UNIX-like "\n" as a first step.
Combining these two tricks, I have read a sample file of 44 MB in 17 seconds (reading chunks of 1MB), 50 times faster than looping over fscanf or fgetl + strsplit.
Helios De Rosario


365 días de Innovación
Universidad Politécnica de Valencia • Edificio 9C
Camino de Vera s/n • 46022 VALENCIA (ESPAÑA)
Tel. +34 96 387 91 60 • Fax +34 96 387 91 69

Antes de imprimir este e-mail piense bien si es necesario hacerlo.
En cumplimiento de la Ley Orgánica 15/1999 reguladora de la Protección de Datos de Carácter Personal, le informamos de que el presente mensaje contiene información confidencial, siendo para uso exclusivo del destinatario arriba indicado. En caso de no ser usted el destinatario del mismo le informamos que su recepción no le autoriza a su divulgación o reproducción por cualquier medio, debiendo destruirlo de inmediato, rogándole lo notifique al remitente.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]