Skip to Content
Learn
Aggregates in R
Calculating Column Statistics

In this exercise, you will learn how to combine all of the values from a column for a single calculation. This can be done with the help of the dplyr function summarize(), which returns a new data frame containing the desired calculation.

Some examples of this type of calculation include:

  • The data frame customers contains the names and ages of all of your customers. You want to find the median age:
customers %>% select(age) # c(23, 25, 31, 35, 35, 46, 62) customers %>% summarize(median_age = median(age)) # 35
  • The data frame shipments contains address information for all shipments that you’ve sent out in the past year. You want to know how many different states you have shipped to.
shipments %>% select(states) # c('CA', 'CA', 'CA', 'CA', 'NY', 'NY', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ') shipments %>% summarize(n_distinct_states = n_distinct(states)) # 3
  • The data frame inventory contains a list of types of t-shirts that your company makes. You want to know the standard deviation of the prices of your inventory.
inventory %>% select(price) # c(31, 23, 30, 27, 30, 22, 27, 22, 39, 27, 36) inventory %>% summarize(sd_price = sd(price)) # 5.465595

The general syntax for these calculations is:

df %>% summarize(var_name = command(column_name))
  • df is the data frame you are working with
  • summarize is a dplyr function that reduces multiple values to a single value
  • var_name is the name you assign to the column that stores the results of the summary function in the returned data frame
  • command is the summary function that is applied to the column by summarize()
  • column_name is the name of the column of df that is being summarized

The following table includes common summary functions that can be given as an argument to summarize():

Command Description
mean() Average of all values in column
median() Median value of column
sd() Standard deviation of column
var() Variance of column
min() Minimum value in column
max() Maximum value in column
IQR() Interquartile range of column
n_distinct() Number of unique values in column
sum() Sum values of column

Instructions

1.

ShoeFly.com has a new batch of orders stored in the data frame orders. Inspect the first 10 rows if the data frame using head().

2.

Our finance department wants to know the price of the most expensive pair of shoes purchased. Save your answer to the variable most_expensive.

3.

Woah, wait a minute! Take a look at the output of the code you just ran. The result for the most expensive pair of shoes is coming back as NA. Why is this happening?

If you scroll up in the rendered notebook to where orders.csv is loaded, you can see a warning about row 99 of the file. There is a missing column of information! It appears that the price for row 99 was not in the file, and this is causing your maximum value calculation to return NA.

Add the following as an additional argument to max() so that it removes all missing values before computing the maximum value.

na.rm = TRUE
4.

Our fashion department wants to know how many different colors of shoes we are selling. Save your answer to the variable num_colors.

Folder Icon

Take this course for free

Already have an account?