Learn
Aggregates in R
Calculating Column Statistics

In this exercise, you will learn how to combine all of the values from a column for a single calculation. This can be done with the help of the dplyr function `summarize()`, which returns a new data frame containing the desired calculation.

Some examples of this type of calculation include:

• The data frame `customers` contains the names and ages of all of your customers. You want to find the median age:
``````customers %>%
select(age)
# c(23, 25, 31, 35, 35, 46, 62)
customers %>%
summarize(median_age = median(age))
# 35``````
• The data frame `shipments` contains address information for all shipments that you’ve sent out in the past year. You want to know how many different states you have shipped to.
``````shipments %>%
select(states)
# c('CA', 'CA', 'CA', 'CA', 'NY', 'NY', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ')
shipments %>%
summarize(n_distinct_states = n_distinct(states))
# 3``````
• The data frame `inventory` contains a list of types of t-shirts that your company makes. You want to know the standard deviation of the prices of your inventory.
``````inventory %>%
select(price)
# c(31, 23, 30, 27, 30, 22, 27, 22, 39, 27, 36)
inventory %>%
summarize(sd_price = sd(price))
# 5.465595``````

The general syntax for these calculations is:

``````df %>%
summarize(var_name = command(column_name))``````
• `df` is the data frame you are working with
• `summarize` is a dplyr function that reduces multiple values to a single value
• `var_name` is the name you assign to the column that stores the results of the summary function in the returned data frame
• `command` is the summary function that is applied to the column by `summarize()`
• `column_name` is the name of the column of `df` that is being summarized

The following table includes common summary functions that can be given as an argument to `summarize()`:

Command Description
`mean()` Average of all values in column
`median()` Median value of column
`sd()` Standard deviation of column
`var()` Variance of column
`min()` Minimum value in column
`max()` Maximum value in column
`IQR()` Interquartile range of column
`n_distinct()` Number of unique values in column
`sum()` Sum values of column

### Instructions

1.

ShoeFly.com has a new batch of orders stored in the data frame `orders`. Inspect the first `10` rows if the data frame using `head()`.

2.

Our finance department wants to know the price of the most expensive pair of shoes purchased. Save your answer to the variable `most_expensive`.

3.

Woah, wait a minute! Take a look at the output of the code you just ran. The result for the most expensive pair of shoes is coming back as `NA`. Why is this happening?

If you scroll up in the rendered notebook to where `orders.csv` is loaded, you can see a warning about row `99` of the file. There is a missing column of information! It appears that the `price` for row `99` was not in the file, and this is causing your maximum value calculation to return `NA`.

Add the following as an additional argument to `max()` so that it removes all missing values before computing the maximum value.

``na.rm = TRUE``
4.

Our fashion department wants to know how many different colors of shoes we are selling. Save your answer to the variable `num_colors`.