Skip to Content
Learn
Data Cleaning in R
Splitting By Index

In trying to get clean data, we want to make sure each column represents one type of measurement. Often, multiple measurements are recorded in the same column, and we want to separate these out so that we can do individual analysis on each variable.

Let’s say we have a column “birthday” with data formatted in MMDDYYYY format. In other words, “11011993” represents a birthday of November 1, 1993. We want to split this data into day, month, and year so that we can use these columns as separate features.

In this case, we know the exact structure of these strings. The first two characters will always correspond to the month, the second two to the day, and the rest of the string will always correspond to year. We can easily break the data into three separate columns by splitting the strings into substrings using str_sub(), a helpful function from the stringr package:

# Create the 'month' column df %>% mutate(month = str_sub(birthday,1,2)) # Create the 'day' column df %>% mutate(day = str_sub(birthday,3,4)) # Create the 'year' column df %>% mutate(year = str_sub(birthday,5))
  • The first command takes the characters starting at index 1 and ending at index 2 of each value in the birthday column and puts it into a month column.
  • The second command takes the characters starting at index 3 and ending at index 4 of each value in the birthday column and puts it into a day column.
  • The third command takes the characters starting at index 5 and ending at the end of the value in the birthday column and puts it into a year column.

This would transform a table like:

id birthday
1011 “12241989”
1112 “10311966”
1113 “01052011”

into a table like:

id birthday month day year
1011 “12241989” “12” “24” “1989”
1112 “10311966” “10” “31” “1966”
1113 “01052011” “01” “05” “2011”

We will practice changing string columns into numerical columns (like converting "10" to 10) in a future exercise.

Instructions

1.

Print out the columns of the students data frame.

2.

The column gender_age sounds like it contains both gender and age!

View the head() of students to see what kind of data gender_age contains.

3.

It looks like the first character of the values in gender_age contains the gender, while the rest of the string contains the age. Let’s separate out the gender data into a new column called gender. Save the result to students, and view the head().

4.

Now, separate out the age data into a new column called age. Save the updated data frame to students, and view the head().

5.

Now, we don’t need that gender_age column anymore. Drop gender_age from students, and save the result to students. View the head() of students.

Folder Icon

Take this course for free

Already have an account?