
Linear Regression Model Evaluations
It depends on our purpose.
There are three values commonly used:
\[RSE = \hat{\sigma} = \sqrt{\frac{1}{n - 2} RSS} = \sqrt{\frac{1}{n - 2} \sum_{i = 1}^{n} (y_i - \hat{y}_i)^2}\]
\(n - 2\) is required to make \(E[\hat{\sigma}^2] = \sigma^2\), but if often replaced by just \(n\) when the sole goal is prediction accuracy instead of inference
What would you say is a “good” RSE/RMSE value? What would be a “bad” one?
\[R^2 = \frac{TSS - RSS}{TSS}\]
where \(TSS = \sum_{i = 1}^{n} (y_i - \bar{y})^2\).
What \(R^2\) be for perfectly linear data?
What would \(R^2\) be for completely unlinear data?
In practice it is context specific for what constitutes a “reasonably good” \(R^2\)
\(Cor(X, Y) = \frac{\sum_{i = 1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i = 1}^{n} (x_i - \bar{x})^2 }\sqrt{\sum_{i = 1}^{n} (y_i - \bar{y})^2}}\)
Correlation between \(X\) and \(Y\) is a measure of the linear relationship between \(X\) and \(Y\)
For simple linear regression \(R^2 = [Cor(X, Y)]^2\)
Low \(R^2\) does not mean no/(or weak) relationship, simply not a linear relationship

\(R^2\) = 0.11
High \(R^2\) does not mean a regression line fits the data well. Another function may describe it better

\(R^2\) = 0.74
We know \(\hat{y}_i\) is our model’s point estimate for expectation of \(Y_i\)
What about an uncertainty estimate?
Let \(\mu_Y = E[Y_i | X_i = x_h]\), the mean response when the explanatory variable value is \(x_h\). The confidence interval for \(\mu_y\) is:
\[\text{Point estimate} \pm t^* \times SE(\text{Point estiamte})\]
\[\hat{y}_h \pm t_{(1 - \alpha / 2, n - 2)} RSE\sqrt{ \left(\frac{1}{n} + \frac{(x_h - \bar{x})^2}{\sum_{i = 1}^{n} (x_i - \bar{x})^2}\right)}\]
Let \(Y_{new}\) be the new response when the explanatory variable value is \(x_h\). The confidence interval for \(Y_{new}\) is:
\[\hat{y}_h \pm t_{(1 - \alpha / 2, n - 2)} RSE \sqrt{1 + \frac{1}{n} + \frac{(x_h - \bar{x})^2}{\sum_{i = 1}^{n} (x_i - \bar{x})^2}}\]
The hospital infection risk dataset consists of a sample of n = 58 hospitals in the east and north-central U.S. (Hospital Infection Data Region 1 and 2 data).
hospital_infection <- read_tsv("https://online.stat.psu.edu/onlinecourses/sites/stat501/files/data/hospitalinfct_reg1and2.txt")
glimpse(hospital_infection)Rows: 58
Columns: 12
$ ID <dbl> 5, 10, 11, 13, 18, 23, 27, 28, 29, 31, 33, 34, 36, 51, 55, …
$ Stay <dbl> 11.20, 8.84, 11.07, 12.78, 11.62, 9.78, 9.31, 8.19, 11.65, …
$ Age <dbl> 56.5, 56.3, 53.2, 56.8, 53.9, 52.3, 47.2, 52.1, 54.5, 49.9,…
$ InfctRsk <dbl> 5.7, 6.3, 4.9, 7.7, 6.4, 5.0, 4.5, 3.2, 4.4, 5.0, 5.3, 6.1,…
$ Culture <dbl> 34.5, 29.6, 28.5, 46.0, 25.5, 17.6, 30.2, 10.8, 18.6, 19.7,…
$ Xray <dbl> 88.9, 82.6, 122.0, 116.9, 99.2, 95.9, 101.3, 59.2, 96.1, 10…
$ Beds <dbl> 180, 85, 768, 322, 133, 270, 170, 176, 248, 318, 196, 312, …
$ MedSchool <dbl> 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,…
$ Region <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ Census <dbl> 134, 59, 591, 252, 113, 240, 124, 156, 217, 270, 164, 258, …
$ Nurses <dbl> 151, 66, 656, 349, 101, 198, 173, 88, 189, 335, 165, 169, 1…
$ Facilities <dbl> 40.0, 40.0, 80.0, 57.1, 37.1, 57.1, 37.1, 37.1, 37.1, 57.1,…
Fitting the model
Call:
lm(formula = InfctRsk ~ Stay, data = hospital_infection)
Residuals:
Min 1Q Median 3Q Max
-2.6145 -0.4660 0.1388 0.4970 2.4310
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.15982 0.95580 -1.213 0.23
Stay 0.56887 0.09416 6.041 1.3e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.024 on 56 degrees of freedom
Multiple R-squared: 0.3946, Adjusted R-squared: 0.3838
F-statistic: 36.5 on 1 and 56 DF, p-value: 1.302e-07
new_hospital_data <- data.frame(Stay = 10)
predict(
hospital_model,
newdata = new_hospital_data,
interval = "confidence",
level = 0.95
) fit lwr upr
1 4.528846 4.259205 4.798486
We estimate the average risk of infection associated with a hospital with an average stay of 10 days to be 4.53. We estimate it to be between 4.26 and 4.8 with 95% confidence.
new_hospital_data <- data.frame(Stay = 10)
predict(
hospital_model,
newdata = new_hospital_data,
interval = "prediction",
level = 0.95
) fit lwr upr
1 4.528846 2.45891 6.598781
We estimate the risk of infection associated with a new hospital with an average stay of 10 days to be 4.53. We estimate it to be between 2.46 and 6.6 with 95% confidence.
We can easily plot confidence bounds for average response - confidence intervals for average response for many \(x_i\)’s
The red line is our 95% confidence interval
The purple line is the 95% prediction interval for a new hospital