Learn
Data Cleaning in R
Dealing with Duplicates

Often we see duplicated rows of data in the data frames we are working with. This could happen due to errors in data collection or in saving and loading the data.

To check for duplicates, we can use the base R function `duplicated()`, which will return a logical vector telling us which rows are duplicate rows.

Let’s say we have a data frame `fruits` that represents this table:

item price calories
“banana” “\$1” 105
“apple” “\$0.75” 95
“apple” “\$0.75” 95
“peach” “\$3” 55
“peach” “\$4” 55
“clementine” “\$2.5” 35

If we call `fruits %>% duplicated()`, we would get the following vector:

``>> [1] FALSE FALSE TRUE FALSE FALSE FALSE``

We can see that the third row, which represents an `"apple"` with price `"\$0.75"` and `95` calories, is a duplicate row. Every value in this row is the same as in another row (the previous row).

We can use the dplyr `distinct()` function to remove all rows of a data frame that are duplicates of another row.

If we call `fruits %>% distinct()`, we would get the table:

item price calories
“banana” “\$1” 105
“apple” “\$0.75” 95
“peach” “\$3” 55
“peach” “\$4” 55
“clementine” “\$2.5” 35

The `"apple"` row was deleted because it was exactly the same as another row. But the two `"peach"` rows remain because there is a difference in the price column.

If we wanted to remove every row with a duplicate value in the item column, we could specify a `subset`:

``````fruits %>%
distinct(item,.keep_all=TRUE)``````

By default, this keeps the first occurrence of the duplicate:

item price calories
“banana” “\$1” 105
“apple” “\$0.75” 95
“peach” “\$3” 55
“clementine” “\$2.5” 35

Make sure that the columns you drop duplicates from are specifically the ones where duplicates don’t belong. You wouldn’t want to drop duplicates with the `price` column as a subset, for example, because it’s okay if multiple items cost the same amount!

### Instructions

1.

The `students` data frame has a column `id` that is neither unique nor required for our analysis. Drop the `id` column from the data frame and save the result to `students`. View the `head()` of `students`.

2.

It seems like in the data collection process, some rows may have been recorded twice. Use the `duplicated()` function on the `students` data frame to make a vector object called `duplicates`.

3.

`table()` is a base R function that takes any R object as an argument and returns a table with the counts of each unique value in the object.

Pipe the result from the previous checkpoint into `table()` to see how many rows are exact duplicates. Make sure to save the result to `duplicates`, and view `duplicates`.

4.

Update the value of `students` to be the `students` data frame with only unique/distinct rows.

5.

Use the `duplicated()` function again to make an object called `updated_duplicates` after dropping the duplicates. Pipe the result into `table()` to see if any duplicates remain, and view `updated_duplicates`. Are there any `TRUE`s left?