Learn
Linear Regression in R
Assumptions of Simple Linear Regression

While the linear regression is perhaps the most widely applied method in Data Science, it relies on a strict set of assumptions about the relationship between predictor and outcome variables. The most obvious (but crucial!) assumption is a linear relationship between the predictor and outcome. Following from this assumption is one key observation about any variables we want to include in our model which must be tested before building a model:

The expected value of the outcome variable is be a straight-line function of exclusively the predictor variable. The best test for this relationship is quite straightforward–– we can just visualize the relationship between the predictor and outcome variables as a scatterplot. A linear relationship will resemble a straight line with a slope not equal to zero, like the relationship between spending on TV ads and the overall sales volume of the related product found in our `advertising` dataset.

We can also quantitatively test for a linear relationship by computing the correlation coefficient. The correlation coefficient is always between positive one and negative one. A coefficient close to `0` (roughly between `-0.20` and `0.20`) suggests a weak linear relationship between two variables. A coefficient closer to positive or negative one suggests a stronger linear relationship. In R, we can compute the correlation coefficient using the `cor.test()` method as follows:

``````coefficient <- cor.test(advertising\$TV, advertising\$Sales)
coefficient\$estimate
# Output:
0.837``````

### Instructions

1.

Load the `conversion.csv` into the working environment using `read.csv()`. Save the result to a variable called `conversion`, and don’t forget to set the `header` parameter to `TRUE`!

2.

A good statistical workflow always involves and thorough understanding of the data available to model and a qualitative analysis of relevant variables. Use `str()` to write out the structure of the dataset and list of variables types. Which variables seem like possible predictors of purchase, or `total_convert`?

3.

Use a combination of the base `ggplot()` function and `geom_bar()` to plot the distribution of the `clicks` variable, a measure of how many times a user clicked on an advertisement. Save the result to a variable called `clicks_dist`. Call `clicks_dist`.

4.

Take a closer look at the `clicks_dist` visualization. What is the approximate range of the `clicks` variable? What seems like the most common value (otherwise called the mode) of `clicks`? Set `clicks_mode` equal to approximate value of the `clicks` mode.

5.

Assign the result of calling `cor.test()`, with `conversion\$total_convert` and `conversion\$clicks` as input parameters, to a variable called `correlation`. Print out `correlation\$estimate`. Does the coefficient value suggest that the variables have a linear relationship?