Learn
Data Cleaning in R
Dealing with Duplicates

Often we see duplicated rows of data in the data frames we are working with. This could happen due to errors in data collection or in saving and loading the data.

To check for duplicates, we can use the base R function duplicated(), which will return a logical vector telling us which rows are duplicate rows.

Let’s say we have a data frame fruits that represents this table:

item price calories
“banana” “$1” 105
“apple” “$0.75” 95
“apple” “$0.75” 95
“peach” “$3” 55
“peach” “$4” 55
“clementine” “$2.5” 35

If we call fruits %>% duplicated(), we would get the following vector:

>> [1] FALSE FALSE TRUE FALSE FALSE FALSE

We can see that the third row, which represents an "apple" with price "$0.75" and 95 calories, is a duplicate row. Every value in this row is the same as in another row (the previous row).

We can use the dplyr distinct() function to remove all rows of a data frame that are duplicates of another row.

If we call fruits %>% distinct(), we would get the table:

item price calories
“banana” “$1” 105
“apple” “$0.75” 95
“peach” “$3” 55
“peach” “$4” 55
“clementine” “$2.5” 35

The "apple" row was deleted because it was exactly the same as another row. But the two "peach" rows remain because there is a difference in the price column.

If we wanted to remove every row with a duplicate value in the item column, we could specify a subset:

fruits %>% distinct(item,.keep_all=TRUE)

By default, this keeps the first occurrence of the duplicate:

item price calories
“banana” “$1” 105
“apple” “$0.75” 95
“peach” “$3” 55
“clementine” “$2.5” 35

Make sure that the columns you drop duplicates from are specifically the ones where duplicates don’t belong. You wouldn’t want to drop duplicates with the price column as a subset, for example, because it’s okay if multiple items cost the same amount!

Instructions

1.

The students data frame has a column id that is neither unique nor required for our analysis. Drop the id column from the data frame and save the result to students. View the head() of students.

2.

It seems like in the data collection process, some rows may have been recorded twice. Use the duplicated() function on the students data frame to make a vector object called duplicates.

3.

table() is a base R function that takes any R object as an argument and returns a table with the counts of each unique value in the object.

Pipe the result from the previous checkpoint into table() to see how many rows are exact duplicates. Make sure to save the result to duplicates, and view duplicates.

4.

Update the value of students to be the students data frame with only unique/distinct rows.

5.

Use the duplicated() function again to make an object called updated_duplicates after dropping the duplicates. Pipe the result into table() to see if any duplicates remain, and view updated_duplicates. Are there any TRUEs left?

Folder Icon

Take this course for free

Already have an account?