In this post, I provide a simple script for merging a set of files in a directory into a single, large dataset. I recently needed to do this, and it’s very straightforward.
Set the Directory
Begin by setting the current working directory to the one containing all the files that need to be merged:
setwd("target_dir/")
Getting a List of Files in a Directory
Next, it’s just a case of getting a list of the files in the directory. For this, the list.files() function can be used. As I haven’t specified any target directory to list.files(), it just lists the files in the current working directory.
file_list <- list.files()
If you want it to list the files in a different directory, just specify the path to list.files. For example, if you want the files in the folder C:/foo/, you could use the following code:
file_list <- list.files("C:/foo/")
Merging the Files into a Single Dataframe
The final step is to iterate through the list of files in the current working directory and put them together to form a dataframe.
When the script encounters the first file in the file_list, it creates the main dataframe to merge everything into (called dataset here). This is done using the !exists conditional:
- If dataset already exists, then a temporary dataframe called temp_dataset is created and added to dataset. The temporary dataframe is removed when we’re done with it using the rm(temp_dataset) command.
- If dataset doesn’t exist (!exists is true), then we create it.
Here’s the remainder of the code:
for (file in file_list){
# if the merged dataset doesn't exist, create it
if (!exists("dataset")){
dataset <- read.table(file, header=TRUE, sep="\t")
}
# if the merged dataset does exist, append to it
if (exists("dataset")){
temp_dataset <-read.table(file, header=TRUE, sep="\t")
dataset<-rbind(dataset, temp_dataset)
rm(temp_dataset)
}
}
The Full Code
Here’s the code in it’s entirety, put together for ease of pasting. I assume there are more efficient ways to do this, but it hasn’t taken long to merge 45 text files totalling about 400MB with some 300,000 rows and 300 columns.
setwd("target_dir/")
file_list <- list.files()
for (file in file_list){
# if the merged dataset doesn't exist, create it
if (!exists("dataset")){
dataset <- read.table(file, header=TRUE, sep="\t")
}
# if the merged dataset does exist, append to it
if (exists("dataset")){
temp_dataset <-read.table(file, header=TRUE, sep="\t")
dataset<-rbind(dataset, temp_dataset)
rm(temp_dataset)
}
}
How About
dataset <- do.call("rbind",lapply(file_list,
FUN=function(files){read.table(files,
header=TRUE, sep="\t")}))
Great! Didn’t know about do.call(). It is certainly very concise, which is always a good thing. Thanks !
This is great. I didn’t know about the the list.files() command. Is there a reason you used rbind() command instead of merge()?
Good question. I went for rbind() so that it would throw an error in case any of my text files had different columns in. All the text files should have an identical column structure, and I’ve often found that it’s good to get any script to scream and shout as loudly as possible whenever there’s an error so that I can catch it and sort it out!
Might come in handy someday, thanks!
Using lapply and do.call is more efficient (== fast). Try merging several files worth hundredths of thousands of rows and see the difference.
One of my favorite ways to do this is with ldply from the plyr package:
library(plyr)
file_list <- list.files()
dataset <- ldply(file_list, read.table, header=TRUE, sep="\t")
ldply takes a list and a function as its first two inputs, plus arguments to be passed on to that function, and returns a dataframe.
Pingback: Testing Different Methods for Merging a set of Files into a Dataframe « Psychwire
Thanks to all of you for your helpful comments – I’ve written up a post comparing the different methods suggested in this comment thread here: http://psychwire.wordpress.com/2011/06/05/testing-different-methods-for-merging-a-set-of-files-into-a-dataframe/
Cheers!
I’d like to use this code but substitute read.fwf for read.table. Will you suggest syntax for that command in this context? I tried what seemed obvious:
for (file in file_list){
# if the merged dataset doesn’t exist, create it
if (!exists(“dataset”)){
dataset <- read.fwf(file, width=c(3,8,10,9,100)
}
# if the merged dataset does exist, append to it
if (exists("dataset")){
temp_dataset <-read.fwf(file, width=c(3,8,10,9,100)
dataset if (exists(“dataset”)){
+ temp_dataset <-read.fwf(file, width=c(3,8,10,9,100)
+ dataset<-rbind(dataset, temp_dataset)
Error: unexpected symbol in:
" temp_dataset rm(temp_dataset)
Warning message:
In rm(temp_dataset) : object ‘temp_dataset’ not found
> }
Error: unexpected ‘}’ in ” }”
>
> }
Thank you!
And”
> for (file in file_list){
+
+ # if the merged dataset doesn’t exist, create it
+ if (!exists(“dataset”)){
+ dataset <- read.fwf(file, skip=1,width=c(3,8,10,9,100))
+ }
+
+ # if the merged dataset does exist, append to it
+ if (exists("dataset")){
+ temp_dataset <-read.fwf(file, skip=1,width=c(3,8,10,9,100))
+ dataset<-rbind(dataset, temp_dataset)
+ rm(temp_dataset)
+ }
+
+ }
Error in rbind(deparse.level, …) :
numbers of columns of arguments do not match
when I get the parens right!
Hi Amanda,
Thanks for your comments – probably the best thing to do would be to have a go at using the methods suggested by the others commenting, or take a look at this post http://psychwire.wordpress.com/2011/06/05/testing-different-methods-for-merging-a-set-of-files-into-a-dataframe/ which tests the different methods in detail. The method I’ve presented in this post is definitely not the best way to achieve this!
I can’t say I’ve ever worked with fixed width files, but as a test, have you tried using two copies of the same file (i.e., two identical files with different names) ? It may be throwing the error you’re getting because the files don’t have the same column widths. The easiest way to test that would be to use two identical copies of the same file. That’s just a guess though!
Hi Godwin,
I have several txt files in which each txt file contains 3 columns(A,B,C). Column A will be common to all txt files,Now I want to combine txt files with coulmn A appearing only once while the other columns(B and C) of respective files.I used cbind but it prints a file which repeation of colum A which I dont want.The column A must be repeated only once..here is my R code:
‘ data<-read.delim(file.choose(),header=T)
data2=read.delim(file.choose(),header=T) data3=cbind(data1,data2) write.table(data3,file="sample.txt",sep="\t",col.names=NA)'
Hi Hayward,
This is exactly what I was looking for – I’ve been generating .csv’s daily for some years now and R is great tool for processing them. I see you’ve found some more efficient methods, but using your original I noticed that both IF statements will trigger on the 1st pass causing the 1st .csv to be duplicated. The 2nd ‘if’ can be changed to ‘else’ to avoid this problem. Noted that Colin’s suggestion of setting to NULL on the new post improves further however.
Keep up the good work!
Cheers,
Peter
Thanks Peter – glad it was helpful! It was interesting to see the number of different ways to perform what, on the surface, is a very simple function. The updated function in the follow-up post is definitely more efficient as well.
I have tried the code. raws in first file are repeated twice.
Yes, that’s a bug. It should be ‘else if’ in stead of ‘if’ in
# if the merged dataset does exist, append to it
if (exists(“dataset”)){
Otherwise the first file gets written twice.
Amazing. Thank you!
I tried Sayan’s method, but I’d rather not have a warning if there are missing data elements as this occurs naturally in my data. Any suggestions for how to adjust
dataset <- do.call("rbind",lapply(file_list,
FUN=function(files){read.table(files,
header=TRUE, sep="\t")}))
So that it continues to append even if there are missing items?