Tags

, , , ,

In this post, I provide a simple script for merging a set of files in a directory into a single, large dataset. I recently needed to do this, and it’s very straightforward.

Set the Directory

Begin by setting the current working directory to the one containing all the files that need to be merged:

setwd("target_dir/")

Getting a List of Files in a Directory

Next, it’s just a case of getting a list of the files in the directory. For this, the list.files()¬†function can be used. As I haven’t specified any target directory to list.files(), it just lists the files in the current working directory.

file_list <- list.files()

If you want it to list the files in a different directory, just specify the path to list.files. For example, if you want the files in the folder C:/foo/, you could use the following code:

file_list <- list.files("C:/foo/")

Merging the Files into a Single Dataframe

The final step is to iterate through the list of files in the current working directory and put them together to form a dataframe.

When the script encounters the first file in the file_list, it creates the main dataframe to merge everything into (called dataset here). This is done using the !exists conditional:

  • If dataset already exists, then a temporary dataframe called temp_dataset is created and added to dataset. The temporary dataframe is removed when we’re done with it using the rm(temp_dataset) command.
  • If dataset doesn’t exist (!exists¬†is true), then we create it.

Here’s the remainder of the code:

for (file in file_list){
      
  # if the merged dataset doesn't exist, create it
  if (!exists("dataset")){
    dataset <- read.table(file, header=TRUE, sep="\t")
  }
  
  # if the merged dataset does exist, append to it
  if (exists("dataset")){
    temp_dataset <-read.table(file, header=TRUE, sep="\t")
    dataset<-rbind(dataset, temp_dataset)
    rm(temp_dataset)
  }

}

The Full Code

Here’s the code in it’s entirety, put together for ease of pasting. I assume there are more efficient ways to do this, but it hasn’t taken long to merge 45 text files totalling about 400MB with some 300,000 rows and 300 columns.

setwd("target_dir/")

file_list <- list.files()

for (file in file_list){
      
  # if the merged dataset doesn't exist, create it
  if (!exists("dataset")){
    dataset <- read.table(file, header=TRUE, sep="\t")
  }
  
  # if the merged dataset does exist, append to it
  if (exists("dataset")){
    temp_dataset <-read.table(file, header=TRUE, sep="\t")
    dataset<-rbind(dataset, temp_dataset)
    rm(temp_dataset)
  }

}
About these ads