Math 430: Lecture 5b

Linear Regression Model Evaluations

Professor Catalina Medina

How do we decide one model is better than another?

It depends on our purpose.

There are three values commonly used:

  • Residual Standard Error (RSE)
  • Root Mean Squared Deviation (RMSE)
  • \(R^2\)

Residual Standard Error (RSE)

\[RSE = \hat{\sigma} = \sqrt{\frac{1}{n - 2} RSS} = \sqrt{\frac{1}{n - 2} \sum_{i = 1}^{n} (y_i - \hat{y}_i)^2}\]

  • Measures lack of fit of the model to the data
  • Measured in units of \(y\)

\(n - 2\) is required to make \(E[\hat{\sigma}^2] = \sigma^2\), but if often replaced by just \(n\) when the sole goal is prediction accuracy instead of inference

  • using \(n\) instead of \(n - 2\) is called Root Mean Squared Error (RMSE)

What would you say is a “good” RSE/RMSE value? What would be a “bad” one?

  • You can evaluate it relative to RSE/RMSE of another model (a model with different \(X\) for the same \(Y\))
  • RMSE is “roughly” the average error, so you can think in context if the average error is acceptable or unacceptable

\(R^2\) statistic

\[R^2 = \frac{TSS - RSS}{TSS}\]

where \(TSS = \sum_{i = 1}^{n} (y_i - \bar{y})^2\).

  • \(TSS\) measures amount of variability in \(Y\)
  • \(RSS\) measures amount of variability in \(Y\) unexplained by model
  • \(R^2\) measures percent of variability in \(Y\) explained by model
  • \(TSS - RSS\) measures amount of variability explained by the model

\(R^2\) statistic

What \(R^2\) be for perfectly linear data?

What would \(R^2\) be for completely unlinear data?

In practice it is context specific for what constitutes a “reasonably good” \(R^2\)

  • consider physics versus marketing

Correlation and the \(R^2\) statistic

\(Cor(X, Y) = \frac{\sum_{i = 1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i = 1}^{n} (x_i - \bar{x})^2 }\sqrt{\sum_{i = 1}^{n} (y_i - \bar{y})^2}}\)

Correlation between \(X\) and \(Y\) is a measure of the linear relationship between \(X\) and \(Y\)

For simple linear regression \(R^2 = [Cor(X, Y)]^2\)

Cautions with \(R^2\)statistic

Low \(R^2\) does not mean no/(or weak) relationship, simply not a linear relationship

\(R^2\) = 0.11

Cautions with \(R^2\)statistic

High \(R^2\) does not mean a regression line fits the data well. Another function may describe it better

\(R^2\) = 0.74

Estimating the response

We know \(\hat{y}_i\) is our model’s point estimate for expectation of \(Y_i\)

What about an uncertainty estimate?

Confidence interval for average \(Y_i\)

Let \(\mu_Y = E[Y_i | X_i = x_h]\), the mean response when the explanatory variable value is \(x_h\). The confidence interval for \(\mu_y\) is:

\[\text{Point estimate} \pm t^* \times SE(\text{Point estiamte})\]

\[\hat{y}_h \pm t_{(1 - \alpha / 2, n - 2)} RSE\sqrt{ \left(\frac{1}{n} + \frac{(x_h - \bar{x})^2}{\sum_{i = 1}^{n} (x_i - \bar{x})^2}\right)}\]

  • \(x_h\) must be within the range of the observed \(x_i\) values to avoid extrapolation, but need not be an observed \(x_i\)

Prediction interval for a new response

Let \(Y_{new}\) be the new response when the explanatory variable value is \(x_h\). The confidence interval for \(Y_{new}\) is:

\[\hat{y}_h \pm t_{(1 - \alpha / 2, n - 2)} RSE \sqrt{1 + \frac{1}{n} + \frac{(x_h - \bar{x})^2}{\sum_{i = 1}^{n} (x_i - \bar{x})^2}}\]

  • \(x_h\) must be within the range of the observed \(x_i\) values to avoid extrapolation, but need not be an observed \(x_i\)
  • How does this compare the the CI for average \(Y_i\)?

Example: Hospital infection risk

The hospital infection risk dataset consists of a sample of n = 58 hospitals in the east and north-central U.S. (Hospital Infection Data Region 1 and 2 data).

  • The response variable is y = infection risk (percent of patients who get an infection).
  • The explanatory variable is x = average length of stay (in days).

Example: Hospital infection risk data

hospital_infection <- read_tsv("https://online.stat.psu.edu/onlinecourses/sites/stat501/files/data/hospitalinfct_reg1and2.txt")

glimpse(hospital_infection)
Rows: 58
Columns: 12
$ ID         <dbl> 5, 10, 11, 13, 18, 23, 27, 28, 29, 31, 33, 34, 36, 51, 55, …
$ Stay       <dbl> 11.20, 8.84, 11.07, 12.78, 11.62, 9.78, 9.31, 8.19, 11.65, …
$ Age        <dbl> 56.5, 56.3, 53.2, 56.8, 53.9, 52.3, 47.2, 52.1, 54.5, 49.9,…
$ InfctRsk   <dbl> 5.7, 6.3, 4.9, 7.7, 6.4, 5.0, 4.5, 3.2, 4.4, 5.0, 5.3, 6.1,…
$ Culture    <dbl> 34.5, 29.6, 28.5, 46.0, 25.5, 17.6, 30.2, 10.8, 18.6, 19.7,…
$ Xray       <dbl> 88.9, 82.6, 122.0, 116.9, 99.2, 95.9, 101.3, 59.2, 96.1, 10…
$ Beds       <dbl> 180, 85, 768, 322, 133, 270, 170, 176, 248, 318, 196, 312, …
$ MedSchool  <dbl> 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,…
$ Region     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ Census     <dbl> 134, 59, 591, 252, 113, 240, 124, 156, 217, 270, 164, 258, …
$ Nurses     <dbl> 151, 66, 656, 349, 101, 198, 173, 88, 189, 335, 165, 169, 1…
$ Facilities <dbl> 40.0, 40.0, 80.0, 57.1, 37.1, 57.1, 37.1, 37.1, 37.1, 57.1,…

Example: Hospital infection risk model

Fitting the model

hospital_model <- lm(InfctRsk ~ Stay, data = hospital_infection)
summary(hospital_model)

Call:
lm(formula = InfctRsk ~ Stay, data = hospital_infection)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6145 -0.4660  0.1388  0.4970  2.4310 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.15982    0.95580  -1.213     0.23    
Stay         0.56887    0.09416   6.041  1.3e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.024 on 56 degrees of freedom
Multiple R-squared:  0.3946,    Adjusted R-squared:  0.3838 
F-statistic:  36.5 on 1 and 56 DF,  p-value: 1.302e-07

Example: CI for Average risk of infection for average stay of 10 days?

new_hospital_data <- data.frame(Stay = 10)

predict(
  hospital_model, 
  newdata = new_hospital_data, 
  interval = "confidence",
  level = 0.95
)
       fit      lwr      upr
1 4.528846 4.259205 4.798486

We estimate the average risk of infection associated with a hospital with an average stay of 10 days to be 4.53. We estimate it to be between 4.26 and 4.8 with 95% confidence.

Example: Prediction interval for risk of infection for average stay of 10 days?

new_hospital_data <- data.frame(Stay = 10)

predict(
  hospital_model, 
  newdata = new_hospital_data, 
  interval = "prediction",
  level = 0.95
)
       fit     lwr      upr
1 4.528846 2.45891 6.598781

We estimate the risk of infection associated with a new hospital with an average stay of 10 days to be 4.53. We estimate it to be between 2.46 and 6.6 with 95% confidence.

Confidence bands

We can easily plot confidence bounds for average response - confidence intervals for average response for many \(x_i\)’s

ggplot(hospital_infection, aes(x = Stay, y = InfctRsk)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE)

Confidence bands

The red line is our 95% confidence interval

Confidence bands

The purple line is the 95% prediction interval for a new hospital