Re: Use R to manage results from GNU Parallel

So here are two versions for reading into a data.frame. The first one actually reads into a data.table, and uses a data.table approach. You can convert to a data.frame with as.data.frame. The second approach uses plyr and returns a data.frame. The data.table approach is much faster than the plyr approach, at least for the data I was testing on. (And in my experience this is always the case.)

Here's some trickiness that I've accounted for, and likely there are still subtle things I'm not handling correctly:

1) For the data.table approach, I use fread, rather than read.table, because it's much faster. It tries to figure out whether you passed it a filename or a string to read from. To ensure that it knows it's reading a string, I append a newline to the end of every stdout.

2) When stdout is empty, I don't include any entries. Another possibility would be to include NAs, but that would take a few more lines of code.

3) For the data.table approach, fread tries to guess whether or not there is a header to the file using heuristics, unless you specify whether or not there is one. For the sample data you generate, the heuristic doesn't work, so I hard-coded the fread to assume that there is no header. read.table assumes, by default, that there is no header, but I included that parameter for symmetry between the two solutions.

4) For the data.table approach, we first need to convert the raw data to a data.table. This should be as easy as ddt = as.data.table(raw), but because of an oversight (I believe) by the author of data.table, it always converts character columns to factors, without an option to do otherwise. So we have to convert to a data.frame first. I believe eventually we won't have to do that.

5) One should be able to specify the separator character using the ... parameters, which are passed on to fread and read.table

6) I still think the speediest solution would be to put all the data together outside of R, and then read it in with a single read.table from a pipe, or an fread from a temporary file.

raw_to_data.table <- function(raw, ...) {

require(data.table)

varnames = setdiff(colnames(raw), c("stdout","stderr"))

rownames(raw) = 1:nrow(raw)

ddt = as.data.table(as.data.frame(raw,stringsAsFactors=FALSE)) #after data.table feature request, will be much faster

ddt[, stdout := paste0(stdout,"\n")] # ensure fread knows stdout is string and not filename

ddd = ddt[nchar(stdout)>1,fread(stdout, header=FALSE, ...), by=varnames] # drop files with empty stdout

return(ddd)

}

raw_to_data.frame <- function(raw, ...) {

require(plyr)

raw = as.data.frame(raw, stringsAsFactors = FALSE)

varnames = setdiff(colnames(raw), c("stdout","stderr"))

dd = ddply(raw, .variables=varnames, function(row) {

if (nchar(row[,"stdout"]) == 0) {

return(NULL)

}

con <- textConnection(row[,"stdout"], open = "r")

d = read.table(con, header=FALSE, ...)

return(d)

})

return(dd)

}

From:	David Rosenberg
Subject:	Re: Use R to manage results from GNU Parallel
Date:	Sun, 5 Jan 2014 15:43:17 -0500