I previously posted a method I used for merging a set of files into a dataframe. It wasn’t long before I had some very helpful comments from the R-bloggers community suggesting better methods to achieve my goal. In this post, I compare the different methods and see which is the most efficient (i.e., fastest).
The Methods
My original method is outlined in my post. In the comments, you can see two further methods suggested. One by sayan involves the use of the do.call function and lapply. A second by dan involves the use of plyr‘s ldply function. Check out the comments for the full discussion.
I will therefore compare three methods:
- My original method
- sayan‘s lapply method
- dan’s plyr method
Testing
I ran each of the three methods 10 times (not hugely powerful I know, but it still took a while). For testing purposes, I merged two 16MB text files together, containing several thousand rows and several hundred columns. Having not done any real amount of timing in R before, I searched around a bit. In the end, I found two posts which I based my timings on (here and here). If I’ve done this incorrectly, let me know and I will run them again. Anyway, here are the results. The error bars are standard errors. The time taken is in seconds.
As you can see, my method is by far the slowest. Looks like I won’t be using it ever again!
The R Code
For maximum transparency, below is the R code I used to get these numbers.
# lapply method
lap = replicate(N, system.time(
full_data<- do.call(
"rbind",lapply(file_list,
FUN=function(files){read.table(files,
header=TRUE, sep="\t")})))[3])
lap
# plyr method
ply =
replicate(N, system.time(
dataset <- ldply(file_list, read.table, header=TRUE, sep="\t")
)[3])
ply
# original method
orig=
replicate(N, system.time(
for (file in file_list){
# if the merged dataset doesn't exist, create it
if (!exists("dataset")){
dataset <- read.table(file, header=TRUE, sep="\t")
}
# if the merged dataset does exist, append to it
if (exists("dataset")){
temp_dataset <-read.table(file, header=TRUE, sep="\t")
dataset<-rbind(dataset, temp_dataset)
rm(temp_dataset)
}
}
)[3])
orig

In the original code just before the “for” loop, you could set:
dataset = NULL
This means you can remove the “if” statements within your for loop. This should speed things up slightly.
Good idea – thanks for the tip, Colin!
Have you thought about using the “rbenchmark” package on CRAN to do timings? I find it usually makes the code look cleaner. A nice example is given here http://dirk.eddelbuettel.com/blog/2011/04/12/#the_new_r_compiler_package
Thanks, Tony – I hadn’t seen rbenchmark before, but will give it a go now (installing as I type!). Looking at the examples, it certainly seems cleaner than the method I’ve used above to time scripts.
hi,
sorry for putting here my question but I think it is quite pertinet to the prsent argument. I’m working on a really big ammount of data coming from a study using driver simulators, I have more tha 400 csv files with different lengths I need to merge but also to add some columns in order to keep some pieces of informations contained in the file’s titles. I tryed to do it with the method you calle “original”, it work but only with a part of the files. If applied to all the files, the process get killed after a while (I estimate the final databse in ten milion of row). Here the code:
csv_files <-list.files(pattern='csv')
Es_files <- list.files(pattern='Esercitazione')
file_list<-csv_files[!csv_files %in% Es_files]
file_list<-file_list[52:100]
for (file in file_list){
# if the merged dataset doesn't exist, create it
if (!exists("dataset")){
dataset <- read.table(file,sep=",",header=T)
#estraggo dal nome filevil parametro data
dataset$date<-rep(paste(strsplit(file,"-")[[1]][1],"/",paste(strsplit(strsplit(file,"-")[[1]][2],"")[[1]][1],
strsplit(strsplit(file,"-")[[1]][2],"")[[1]][2],sep=""),"/",
paste(strsplit(strsplit(file,"-")[[1]][2],"")[[1]][3],
strsplit(strsplit(file,"-")[[1]][2],"")[[1]][4],sep=""),sep=""),dim(dataset)[1])
#estrago dal nome file il parametro ora di inizio
dataset$test_time<-rep(strsplit(file,"-")[[1]][3],dim(dataset)[1])
#estraggo il tipo di percorso
dataset$path<-rep(strsplit(file,"-")[[1]][4],dim(dataset)[1])
#estraggo … non ricordo che roba sia ma è importante
dataset$c<-rep(paste(strsplit(strsplit(file,"-")[[1]][5],"")[[1]][1],
strsplit(strsplit(file,"-")[[1]][5],"")[[1]][2],
strsplit(strsplit(file,"-")[[1]][5],"")[[1]][3],sep=""),dim(dataset)[1])
}
# if the merged dataset does exist, append to it
if (exists("dataset")){
temp_dataset <-read.table(file,sep=",",header=T)
#estraggo dal nome filevil parametro data
temp_dataset$date<-rep(paste(strsplit(file,"-")[[1]][1],"/",paste(strsplit(strsplit(file,"-")[[1]][2],"")[[1]][1],
strsplit(strsplit(file,"-")[[1]][2],"")[[1]][2],sep=""),"/",
paste(strsplit(strsplit(file,"-")[[1]][2],"")[[1]][3],
strsplit(strsplit(file,"-")[[1]][2],"")[[1]][4],sep=""),sep=""),dim(temp_dataset)[1])
#estrago dal nome file il parametro ora di inizio
temp_dataset$test_time<-rep(strsplit(file,"-")[[1]][3],dim(temp_dataset)[1])
#estraggo il tipo di percorso
temp_dataset$path<-rep(strsplit(file,"-")[[1]][4],dim(temp_dataset)[1])
#estraggo … non ricordo che roba sia ma è importante
temp_dataset$c<-rep(paste(strsplit(strsplit(file,"-")[[1]][5],"")[[1]][1],
strsplit(strsplit(file,"-")[[1]][5],"")[[1]][2],
strsplit(strsplit(file,"-")[[1]][5],"")[[1]][3],sep=""),dim(temp_dataset)[1])
dataset<-rbind(dataset, temp_dataset)
rm(temp_dataset)
}
}
now I tried the method with ldply that is lighter, but I can't figure out how to implement the columns addition.
here is the not working code:
library(plyr)
csv_files <-list.files(pattern='csv')
Es_files <- list.files(pattern='Esercitazione')
file_list<-csv_files[!csv_files %in% Es_files]
file_list<-file_list[52:100]
getdf<-function(File){
read.table(File,sep=",",header=T)
#estraggo dal nome filevil parametro data
df$date<-rep(paste(strsplit(File,"-")[[1]][1],"/",paste(strsplit(strsplit(File,"-")[[1]][2],"")[[1]][1],
strsplit(strsplit(File,"-")[[1]][2],"")[[1]][2],sep=""),"/",
paste(strsplit(strsplit(File,"-")[[1]][2],"")[[1]][3],
strsplit(strsplit(File,"-")[[1]][2],"")[[1]][4],sep=""),sep=""),dim(df)[1])
#estrago dal nome File il parametro ora di inizio
df$test_time<-rep(strsplit(File,"-")[[1]][3],dim(df)[1])
#estraggo il tipo di percorso
df$path<-rep(strsplit(File,"-")[[1]][4],dim(df)[1])
#estraggo … non ricordo che roba sia ma è importante
df$c<-rep(paste(strsplit(strsplit(File,"-")[[1]][5],"")[[1]][1],
strsplit(strsplit(File,"-")[[1]][5],"")[[1]][2],
strsplit(strsplit(File,"-")[[1]][5],"")[[1]][3],sep=""),dim(df)[1])
#data.frame(df)
}
dataset <- ldply(file_list, FUN=getdf(File))
can you give me an hand ?
thank you in advance