Skip to Content
Data Cleaning in R
Diagnose the Data

We often describe data that is easy to analyze and visualize as “tidy data”. What does it mean to have tidy data?

For data to be tidy, it must have:

  • Each variable as a separate column
  • Each row as a separate observation

For example, we would want to reshape a table like:

Account Checkings Savings
“12456543” 8500 8900
“12283942” 6410 8020
“12839485” 78000 92000

Into a table that looks more like:

Account Account Type Amount
“12456543” “Checking” 8500
“12456543” “Savings” 8900
“12283942” “Checking” 6410
“12283942” “Savings” 8020
“12839485” “Checking” 78000
“12839485” “Savings” 920000

The first step of diagnosing whether or not a dataset is tidy is using base R and dplyr functions to explore and probe the dataset.

You’ve seen most of the functions we often use to diagnose a dataset for cleaning. Some of the most useful ones are:

  • head() — display the first 6 rows of the table
  • summary() — display the summary statistics of the table
  • colnames() — display the column names of the table



Provided in notebook.Rmd are two data frames, grocery_1 and grocery_2.

Begin by viewing the head() of both grocery_1 and grocery_2.


Explore the data frames using the other functions listed.

Which data frame is “clean”, tidy, and ready for analysis? Create a variable named clean_data_frame and assign it the value 1 if grocery_1 is a clean and tidy data frame or 2 if grocery_2 is a clean and tidy data frame.

Folder Icon

Take this course for free

Already have an account?