Learn
Data Cleaning in R
Splitting By Character

Let’s say we have a column called "type" with data entries in the format "admin_US" or "user_Kenya", as shown in the table below.

id type
1011 “user_Kenya”
1112 “admin_US”
1113 “moderator_UK”

Just like we saw before, this column actually contains two types of data. One seems to be the user type (with values like “admin” or “user”) and one seems to be the country this user is in (with values like “US” or “Kenya”).

We can no longer just split along the first 4 characters because admin and user are of different lengths. Instead, we know that we want to split along the "_". We can thus use the tidyr function separate() to split this column into two, separate columns:

# Create the 'user_type' and 'country' columns df %>% separate(type,c('user_type','country'),'_')
  • type is the column to split
  • c('user_type','country') is a vector with the names of the two new columns
  • '_' is the character to split on

This would transform the table above into a table like:

id type country usertype
1011 “user_Kenya” “Kenya” “user”
1112 “admin_US” “US” “admin”
1113 “moderator_UK” “UK” “moderator”

Instructions

1.

View the head() of students. Notice that the students’ names are stored in a column called full_name.

2.

Separate the full_name column into two new columns, first_name and last_name, by splitting on the ' ' character .

Provide as an extra argument to the separate() function extra ='merge'. This will ensure that middle names or two-word last names will all end up in the last_name column.

Save the result to students, and view the head().

Folder Icon

Take this course for free

Already have an account?