Math 430: Lecture 5a

Linear Regression Conditions

Professor Catalina Medina

We discussed how simple linear regression can be used to solve different goals

Prediction

Inference

Learn the relationship between \(Y\) and \(X\) (confidence interval)
Test if there is a linear relationship between \(Y\) and \(X\) (hypothesis test)

Recall our linear regression from last week

ggplot(elephants, aes(x = Height, y = Mass)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Height (cm)", y = "Mass (kg)")

Was this model appropriate?

Why would we care if the model is “appropriate”?

it may be chance that we got a good fit for our sample, but the model might be a bad fit for the population
our estimates relied on our conditions, so if those don’t hold our estimates are meaningless

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i, \epsilon_i \overset{iid}{\sim} N(0, \sigma^2)\]

In order of importance:

People have developed some statistical tests to asses some model conditions.

We want to avoid unnecessary tests because each test we perform introduces some error.

Alternatively, we will use visual assessments

model_elephant <- lm(Mass ~ Height, data = elephants)

library(ggfortify)

autoplot(model_elephant)

Recall residuals: \(e_i = y_i - \hat{y}_i\)

Linear

Non-linear

Implications:

Remedy:

Consider how the data were collected (random sample? no repeat sampling?) and on whom the data were collected (would you expect independence?)
Dependence could be revealed by plotting residuals versus unmodeled variables (time, class, etc)

Implications:

Remedy:

Consider a different model that accounts for correlation (e.g. if you have longitudinal data then a linear mixed effects model may be appropriate)

Examine plot of residuals vs fitted and check there is not a “funnel-shape”
Specifically, for each fitted value does the spread of the residuals look similar?

Constant variance