When we perform multiple linear regression, we usually are interested in answering a few important questions.
Is at least one of the predictors \(X_1, X_2,...,X_p\) useful in predicting the response?
Do all the predictors help to explain \(Y\) , or is only a subset of the predictors useful?
How well does the model fit the data?
Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
Q1: Is There a Linear Relationship Between the Response and Predictors?
Q1 with simple linear regression
\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]
Hypothesis test for \(\beta_1\)!
\(H_0: \beta_1 = 0\)
\(H_A: \beta_1 \neq 0\)
Compute p-value.
What is the definition of p-value?
Judge if you (1) found evidence against the null hypothesis (in support of the alternative) or (2) failed to find evidence against the null hypothesis (failed to find evidence to support the alternative)
Judge if you (1) found evidence against the null hypothesis (in support of the alternative) or (2) failed to find evidence against the null hypothesis (failed to find evidence to support the alternative)
F-statistic
\(H_0: \beta_1 = \beta_2 = \dots = \beta_p = 0\)
\(H_A\): At least one \(\beta_j\) is non-zero
Our test test statistic for the hypothesis test is: \[F = \frac{(TSS - RSS) / p}{RSS / (n - p -1)}\]
TSS = …
RSS = …
n = …
p = …
F-statistic (in practice)
The dataset contains data on test performance, school characteristics and student demographic backgrounds for school districts in California in 1999.
Notice we could also extract\(R^2\)and\(RSE = \hat{\sigma}\)!
F-statistic (in practice)
\(H_0: \beta_1 = \beta_2 = \dots = \beta_p = 0\)
\(H_A\): At least one \(\beta_j\) is non-zero
Statistics conclusion: Our p-value is 1.4216963^{-132}. Therefore there is very strong evidence against \(H_0\) in support of \(H_A\).
Conclusion in context: We found very strong evidence (p-value = 1.4216963^{-132}) that at least one of student-teacher ratio, student-computer ratio, or % reduced lunch is linearly related to score.
Partial F-test for nested models
From the plots it was pretty clear lunch is linearly related to score. What if we only wanted to do a test to see if student-teacher ratio and or student-computer ratio is linearly related to score?
We can do a partial F-test for nested models!
\(H_0: \beta_1 = \beta_2 = 0\)
\(H_A:\) At least one of \(\beta_1\) or \(\beta_2\) are nonzero.
Statistics conclusion: Our p-value is 9.1955202^{-7}. Therefore there is very strong evidence against \(H_0\) in support of \(H_A\).
Conclusion in context: We found very strong evidence (p-value = 9.1955202^{-7}) that at least one of student-teacher ratio or student-computer ratio is linearly related to score, in addition to lunch.
For instance, consider an example in which \(p = 100\) and \(H_0 : \beta_1 = \beta_2 =··· = \beta_p = 0\) is true, so no variable is truly associated with the response.
Imagine we collected a bunch of samples
With a boundary (significance level) of 0.05, we will see approximately 5% of p-values for each predictor below 0.05 by chance
Our significance level is actually the type I error rate, the probability of incorrectly rejecting a true null hypothesis (“false positive”)
An F-statistic adjusts for \(p\)
Q2: Do all the predictors help to explain \(Y\), or is only a subset of the predictors useful?
Statistical Inference
If you are doing statistical inference you need to be conscious not to increase your type I error rate (e.g. Pharmaceutical setting testing if Drug A lowers cholesterol relative to Drug B).
Therefore you should consider ahead of time what predictors should be included based on the scientific peer-reviewed literature
You should not fit a bunch of models, see what “fits best”, then do a hypothesis test (“p-hacking”)
After you answer your question of interest you can then do exploratory analyses
Prediction
Fit a bunch of models and see what “fits best”
We will cover this in week 12
Q3 How well does the model fit the data?
Metrics we already know
RSE for multiple linear regression
\[RSE = \sqrt{\frac{RSS}{n - p -1}}\]
Reduces to \(RSE = \sqrt{\frac{RSS}{n - 2}}\) for simple linear regression
Q4 Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
Same as with simple linear regression!
We can compute a prediction by plugging in values for all of our \(X_j\)’s into our fitted model
We can compute a prediction interval
What do we predict the average test score will be for a school with a student-teacher ratio of 22, student-computer ratio of 10, and 50% qualifying for reduced-price lunch?