We’re going to walk through how to merge or combine data frames in R.
This article continues the examples started in our data frame tutorial . We’re using the ChickWeight data frame example which is included in the standard R distribution. You can easily get to this by typing: data(ChickWeight) in the R console. This data frame captures the weight of chickens that were fed different diets over a period of 21 days. If you can imagine someone walking around a research farm with a clipboard for an agricultural experiment, you’ve got the right idea….
This series has a couple of parts – feel free to skip ahead to the most relevant parts.
- Inspecting your data
- Ways to Select a Subset of Data From an R Data Frame
- Create an R Data Frame
- Sort an R Data Frame
- Add and Remove Columns
- Renaming Columns
- Add and Remove Rows
- Merge Two Data Frames
This section of our tutorial is going to deal with how to combine data frames in R. This allows us to take multiple data frames with a different character vector count and combine them into one merged data frame without even needing the dplyr package. There are many ways to combine multiple dataframes, from the rbind function to left outer join to logical vector combinations, but to minimize your work and ensure that every key column and variable from your multiple datasets are combined correctly, there are three main techniques we are going to look at:
- cbind() – combining the columns of two data frames side-by-side
- rbind() – stacking two data frames on top of each other, appending one to the other
- merge() – joining two data frames using a common column
Using cbind() to merge two R data frames
We will start with the cbind() R function . This a simple way to join multiple datasets in R where the rows are in the same order and the number of records are the same. This means we don’t have any remaining columns out of place after merging multiple data frames because the left data frame and the right data frame have the same number of key variable names and no na value in the specified columns. We can select each data file and not have to create a new column at all when merging datasets with this method.
Syntax is straightforward – we’re going to use two imaginary data frames here, chicken and eggs:
# combine two datasets in reverything <-cbind(chicken, eggs)
The final result of this operation is the two data frames appended side by side. It is recommended but not required that the two data frames have the same number of rows. In the event one data frame is shorter than the other, R will recycle the values of the smaller data frame to fill the missing space.
Now, if you need to do a more complicated merge, read below. We will discuss how tomerge data frames by multiple columns, set up complex joins to handle missing values, and merge using fields with different row names. Our base case assumes you’re dealing with the same variables with minimal drama from things such as missing values.
Using rbind() to merge two R data frames
We’ve encountered rbind() before, when appending rows to a data frame. This function stacks the two data frames on top of each other, appending the second data frame to the first.
For this function to operate, both data frames need to have the same number of columns and the same column names.
Using Merge to join Two Data Frames by A Common Field
This is one of the more common applications of merging two different but related data frames. We covered a simple version of this already in our example of setting buckets and flags,where we used R code to set the value of a flag. But we usually need to integrate a much larger amount of data.
For this example, assume we have a large data frame containing a detailed nutritional analysis of each diet formula, assembled via laboratory testing each sample for a variety of nutritional components, vitamins, and minerals. We want to append this to our weights data frame using the diet id as a common key.
# merge two data frames in r# r merge by rownamesjointdataset <- merge(ChickWeight, LabResults, by = 'Diet')
Implementing more complicated merges
The merge operation will return a data frame that contains all records which can be matched between the two datasets. If you wanted to join a data frame on two fields, perhaps based on a daily analysis of what the chicks are fed, you could set up something like the following:
# merge two data frames in r# r merge data frames by multiple columnsjointdataset <- merge(ChickWeight, LabResults, by = c('Diet','Time'))
This would match the records using the two fields.
When it comes to seeing what records are returned from the merge, you have options beyond the default criteria (the equivalent of an SQL inner join, returning only records which match both data frames). You can specify an additional parameter ‘all’ which controls which records are returned.
- Outer Join – returns all records including no-matches (in either direction)
- Left / Right Join – returns all records in on data frame and matching records in the other (specify left dataframe or right dataframe)
- Cross Join – returns all possible combination of the rows in the two data frames; this is the infamous Cartesian join in SQL
The R code implementation of these additional joins:
# join datasets in rOuter Join: jointdataset <- merge(ChickWeight, LabResults,by = 'Diet', all=TRUE)Left Join: jointdataset <- merge(ChickWeight, LabResults, by = 'Diet', all.x= TRUE)Right Join: jointdataset <- merge(ChickWeight, LabResults, by = 'Diet', all.y=TRUE)Cross Join: jointdataset <- merge(ChickWeight, LabResults, by = Null)
Finally, in the event the two columns you want to merge on have different names, this can be addressed by adjusting your ‘by’ parameter to handle each one separately. Sample code looks like:
# r merge by rownames - merge by different rownamesjointdataset <-merge (ChickWeight, LabResults, by.x = 'Diet', by.y ='Sample')
Collectively, these options for merging an R dataframe replicate the core of SQL’s join function. You can use subset selection and other operations to implement filters as needed.
For complicated merge efforts, we recommend you abstract the r code into a repeatable merge function and generate a new data frame. This makes your project easier to troubleshoot.
Common Errors & Missing Value Situations
Two common issues with managing a merge.
First, there is the problem of using a field with missing value records for the join column. This breaks any simple join algorithm (such as used by the merge command). You’re going to need to look at using alternative fields as your join key, potentially combining several items in the data set as a lookup table. For example, you may accept a matching row based on one of several specified columns in the dataset to do the merge.
The next issue is that of duplicate column names. This trips you up in syntax references; be sure to indicate which table you are referring to.
In the event you need to review another section of the tutorial…
- Inspecting your data
- Ways to Select a Subset of Data From an R Data Frame
- Create an R Data Frame
- Sort an R Data Frame
- Add and Remove Columns
- Renaming Columns
- Add and Remove Rows
- Merge Two Data Frames
Topic: r merge data frames, r merge by index