, ,

I previously posted a method I used for merging a set of files into a dataframe. It wasn’t long before I had some very helpful comments from the R-bloggers community suggesting better methods to achieve my goal. In this post, I compare the different methods and see which is the most efficient (i.e., fastest).

The Methods

My original method is outlined in my post. In the comments, you can see two further methods suggested. One by sayan involves the use of the do.call function and lapply. A second by dan involves the use of plyr‘s ldply function. Check out the comments for the full discussion.

I will therefore compare three methods:

  • My original method
  • sayan‘s lapply method
  • dan’s plyr method


I ran each of the three methods 10 times (not hugely powerful I know, but it still took a while). For testing purposes, I merged two 16MB text files together, containing several thousand rows and several hundred columns. Having not done any real amount of timing in R before, I searched around a bit. In the end, I found two posts which I based my timings on (here and here). If I’ve done this incorrectly, let me know and I will run them again. Anyway, here are the results. The error bars are standard errors. The time taken is in seconds.
As you can see, my method is by far the slowest. Looks like I won’t be using it ever again!

The R Code

For maximum transparency, below is the R code I used to get these numbers.
# lapply method
lap = replicate(N, system.time(
  full_data<- do.call(
  header=TRUE, sep="\t")})))[3])

# plyr method
ply =
replicate(N, system.time(
  dataset <- ldply(file_list, read.table, header=TRUE, sep="\t")

# original method
replicate(N, system.time(
for (file in file_list){
  # if the merged dataset doesn't exist, create it
  if (!exists("dataset")){
    dataset <- read.table(file, header=TRUE, sep="\t")
  # if the merged dataset does exist, append to it
  if (exists("dataset")){
    temp_dataset <-read.table(file, header=TRUE, sep="\t")
    dataset<-rbind(dataset, temp_dataset)