Skip to Content
Learn
Joining Tables in R
Join on Specific Columns II

In the previous exercise, we learned how to use rename() to join two data frames whose columns don’t match.

A better option, however, exists. We can add the by argument when calling inner_join() to specify which columns we want to join on. In the example below, the “left” table is the one that comes first (orders), and the “right” table is the one that comes second (customers). This syntax says that we should match the customer_id from orders to the id in customers.

orders %>% inner_join(customers, by = c('customer_id' = 'id'))

The resulting data frame will look like this:

id customer_id product_id quantity timestamp customer_name address phone_number
1 2 3 1 2017-01-01 00:00:00 Jane Doe 456 Park Ave 949-867-5309
2 2 2 3 2017-01-01 00:00:00 Jane Doe 456 Park Ave 949-867-5309
3 3 1 1 2017-01-01 00:00:00 Joe Schmo 789 Broadway 112-358-1321
4 3 2 2 2016-02-01 00:00:00 Joe Schmo 789 Broadway 112-358-1321
5 3 3 3 2017-02-01 00:00:00 Joe Schmo 789 Broadway 112-358-1321
6 1 4 2 2017-03-01 00:00:00 John Smith 123 Main St. 212-123-4567
7 1 1 1 2017-02-02 00:00:00 John Smith 123 Main St. 212-123-4567
8 1 4 1 2017-02-02 00:00:00 John Smith 123 Main St. 212-123-4567

If we use this syntax, we’ll end up with two columns called id, one from the first table and one from the second. R won’t let you have two columns with the same name, so it will change them to id_x and id_y.

It will look like this:

id_x customer_id product_id quantity timestamp id_y customer_name address phone_number
1 2 3 1 2017-01-01 00:00:00 2 Jane Doe 456 Park Ave 949-867-5309
2 2 2 3 2017-01-01 00:00:00 2 Jane Doe 456 Park Ave 949-867-5309
3 3 1 1 2017-01-01 00:00:00 3 Joe Schmo 789 Broadway 112-358-1321
4 3 2 2 2016-02-01 00:00:00 3 Joe Schmo 789 Broadway 112-358-1321
5 3 3 3 2017-02-01 00:00:00 3 Joe Schmo 789 Broadway 112-358-1321
6 1 4 2 2017-03-01 00:00:00 1 John Smith 123 Main St. 212-123-4567
7 1 1 1 2017-02-02 00:00:00 1 John Smith 123 Main St. 212-123-4567
8 1 4 1 2017-02-02 00:00:00 1 John Smith 123 Main St. 212-123-4567

The new column names id_x and id_y aren’t very helpful for us when we read the table. We can help make them more useful by using the keyword suffix. We can provide a vector of suffixes to use instead of “_x” and “_y”.

For example, we could use the following code to make the suffixes reflect the table names:

orders %>% inner_join(customers, by = c('customer_id' = 'id'), suffix = c('_order','_customer'))

The resulting table would look like this:

id_order customer_id product_id quantity timestamp id_customer customer_name address phone_number
1 2 3 1 2017-01-01 00:00:00 2 Jane Doe 456 Park Ave 949-867-5309
2 2 2 3 2017-01-01 00:00:00 2 Jane Doe 456 Park Ave 949-867-5309
3 3 1 1 2017-01-01 00:00:00 3 Joe Schmo 789 Broadway 112-358-1321
4 3 2 2 2016-02-01 00:00:00 3 Joe Schmo 789 Broadway 112-358-1321
5 3 3 3 2017-02-01 00:00:00 3 Joe Schmo 789 Broadway 112-358-1321
6 1 4 2 2017-03-01 00:00:00 1 John Smith 123 Main St. 212-123-4567
7 1 1 1 2017-02-02 00:00:00 1 John Smith 123 Main St. 212-123-4567
8 1 4 1 2017-02-02 00:00:00 1 John Smith 123 Main St. 212-123-4567

Instructions

1.

Join the orders and products data frames using an inner_join(), with orders as the first argument and products as the second argument. Also include the by argument to indicate which columns to join on. Save your results to the variable orders_products, and view it.

2.

Now join the products and orders data frames using an inner_join(), with products as the first argument and orders as the second argument. Also include the by argument to indicate which columns to join on, as well as a suffix argument c('_product','_order'). Save your results to the variable products_orders, and view it.

Folder Icon

Take this course for free

Already have an account?