Day 16: Simple Linear Regression Part 2 (Sections 6.3-6.4)

BSTA 511/611

Meike Niederhausen, PhD

OHSU-PSU School of Public Health

2024-12-02

Where are we?

Goals for today (Sections 6.3-6.4)

Simple Linear Regression Part 2

Review of
- best-fit line (aka regression line or least-squares line)
- residuals
- population model
LINE conditions and how to assess them
- New diagnostic tools:
  - Normal QQ plots of residuals
  - Residual plots
Coefficient of determination ($R^2$)
Regression inference
1. Inference for population slope $\beta_1$
  - CI & hypothesis test
2. CI for mean response $\mu_{Y|x^*}$
3. Prediction interval for predicting individual observations
- Confidence bands vs. predictions bands

Life expectancy vs. female adult literacy rate

https://www.gapminder.org/tools/#$model$markers$bubble$encoding$x$data$concept=literacy_rate_adult_female_percent_of_females_ages_15_above&source=sg&space@=country&=time;;&scale$domain:null&zoomed:null&type:null;;&frame$value=2011;;;;;&chart-type=bubbles&url=v1

Dataset description

Data file: lifeexp_femlit_water_2011.csv
Data were downloaded from https://www.gapminder.org/data/
2011 is the most recent year with the most complete data
Life expectancy = the average number of years a newborn child would live if current mortality patterns were to stay the same. Source: https://www.gapminder.org/data/documentation/gd004/
Adult literacy rate is the percentage of people ages 15 and above who can, with understanding, read and write a short, simple statement on their everyday life. Source: http://data.uis.unesco.org/
At least basic water source (%) = the percentage of people using at least basic water services. This indicator encompasses both people using basic water services as well as those using safely managed water services. Basic drinking water services is defined as drinking water from an improved source, provided collection time is not more than 30 minutes for a round trip. Improved water sources include piped water, boreholes or tubewells, protect dug wells, protected springs, and packaged or delivered water.

Get to know the data

Load data

gapm_original <- read_csv(here::here("data", "lifeexp_femlit_water_2011.csv"))

Glimpse of the data

glimpse(gapm_original)

Rows: 194
Columns: 5
$ country                    <chr> "Afghanistan", "Albania", "Algeria", "Andor…
$ life_expectancy_years_2011 <dbl> 56.7, 76.7, 76.7, 82.6, 60.9, 76.9, 76.0, 7…
$ female_literacy_rate_2011  <dbl> 13.0, 95.7, NA, NA, 58.6, 99.4, 97.9, 99.5,…
$ water_basic_source_2011    <dbl> 52.6, 88.1, 92.6, 100.0, 40.3, 97.0, 99.5, …
$ water_2011_quart           <chr> "Q1", "Q2", "Q2", "Q4", "Q1", "Q3", "Q4", "…

Note the missing values for our variables of interest

gapm_original %>% select(life_expectancy_years_2011, female_literacy_rate_2011) %>% 
  get_summary_stats()

# A tibble: 2 × 13
  variable        n   min   max median    q1    q3   iqr   mad  mean    sd    se
  <fct>       <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 life_expec…   187  47.5  82.9   72.7  64.3  76.9  12.6  9.04  70.7  8.44 0.617
2 female_lit…    80  13    99.8   91.6  71.0  98.0  27.0 11.4   81.7 22.0  2.45 
# ℹ 1 more variable: ci <dbl>

Remove missing values

Remove rows with missing data for life expectancy and female literacy rate

gapm <- gapm_original %>% 
  drop_na(life_expectancy_years_2011, female_literacy_rate_2011)

glimpse(gapm)

Rows: 80
Columns: 5
$ country                    <chr> "Afghanistan", "Albania", "Angola", "Antigu…
$ life_expectancy_years_2011 <dbl> 56.7, 76.7, 60.9, 76.9, 76.0, 73.8, 71.0, 7…
$ female_literacy_rate_2011  <dbl> 13.0, 95.7, 58.6, 99.4, 97.9, 99.5, 53.4, 9…
$ water_basic_source_2011    <dbl> 52.6, 88.1, 40.3, 97.0, 99.5, 97.8, 96.7, 9…
$ water_2011_quart           <chr> "Q1", "Q2", "Q1", "Q3", "Q4", "Q3", "Q3", "…

No missing values now for our variables of interest

gapm %>% select(life_expectancy_years_2011, female_literacy_rate_2011) %>% 
  get_summary_stats()

# A tibble: 2 × 13
  variable        n   min   max median    q1    q3   iqr   mad  mean    sd    se
  <fct>       <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 life_expec…    80    48  81.8   72.4  65.9  75.8  9.95  6.30  69.9  7.95 0.889
2 female_lit…    80    13  99.8   91.6  71.0  98.0 27.0  11.4   81.7 22.0  2.45 
# ℹ 1 more variable: ci <dbl>

Important

Removing the rows with missing data was not needed to run the regression model.
I did this step since later we will be calculating the standard deviations of the explanatory and response variables for just the values included in the regression model. It’ll be easier to do this if we remove the missing values now.

Regression line = best-fit line

\[\widehat{y} = b_0 + b_1 \cdot x \]

$\hat{y}$ is the predicted outcome for a specific value of $x$.
$b_0$ is the intercept
$b_1$ is the slope of the line, i.e., the increase in $\hat{y}$ for every increase of one (unit increase) in $x$.
- slope = rise over run

Intercept
- The expected outcome for the $y$-variable when the $x$-variable is 0.
Slope
- For every increase of 1 unit in the $x$-variable, there is an expected increase of, on average, $b_1$ units in the $y$-variable.
- We only say that there is an expected increase and not necessarily a causal increase.

Regression in R: `lm()`, `summary()`, & `tidy()`

model1 <- lm(life_expectancy_years_2011 ~ female_literacy_rate_2011,
                 data = gapm)
summary(model1)


Call:
lm(formula = life_expectancy_years_2011 ~ female_literacy_rate_2011, 
    data = gapm)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.299  -2.670   1.145   4.114   9.498 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)               50.92790    2.66041  19.143  < 2e-16 ***
female_literacy_rate_2011  0.23220    0.03148   7.377  1.5e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.142 on 78 degrees of freedom
Multiple R-squared:  0.4109,    Adjusted R-squared:  0.4034 
F-statistic: 54.41 on 1 and 78 DF,  p-value: 1.501e-10

tidy(model1) %>% gt()

term	estimate	std.error	statistic	p.value
(Intercept)	50.9278981	2.66040695	19.142898	3.325312e-31
female_literacy_rate_2011	0.2321951	0.03147744	7.376557	1.501286e-10

Regression equation for our model:

\[\widehat{\textrm{life expectancy}} = 50.9 + 0.232 \cdot \textrm{female literacy rate} \]

Residuals

Observed values $y_i$
- the values in the dataset
Fitted values $\widehat{y}_i$
- the values that fall on the best-fit line for a specific $x_i$
Residuals $e_i = y_i - \widehat{y}_i$
- the differences between the observed and fitted values

The (population) regresison model

The (population) regression model is denoted by

\[Y = \beta_0 + \beta_1 \cdot X + \epsilon\]

$\beta_0$ and $\beta_1$ are unknown population parameters
$\epsilon$ (epsilon) is the error about the line
- It is assumed to be a random variable:
  - $\epsilon \sim N(0, \sigma^2)$
  - variance $\sigma^2$ is constant

https://bookdown.org/roback/bookdown-bysh/ch-MLRreview.html#ordinary-least-squares-ols-assumptions

The line is the average (expected) value of $Y$ given a value of $x$: $E(Y|x)$.
The point estimates for $\beta_0$ and $\beta_1$ based on a sample are denoted by $b_0, b_1, s_{residuals}^2$
- Note: also common notation is $\widehat{\beta}_0, \widehat{\beta}_1, \widehat{\sigma}^2$

What are the LINE conditions?

For “good” model fit and to be able to make inferences and predictions based on our models, 4 conditions need to be satisfied.

Briefly:

L inearity of relationship between variables
I ndependence of the Y values
N ormality of the residuals
E quality of variance of the residuals (homoscedasticity)

L: Linearity of relationship between variables

Is the association between the variables linear?

Diagnostic tools:
- Scatterplot
- Residual plot (see later section for E : Equality of variance of the residuals)

I: Independence of the residuals ($Y$ values)

Are the data points independent of each other?
Examples of when they are not independent, include
- repeated measures (such as baseline, 3 months, 6 months)
- data from clusters, such as different hospitals or families
This condition is checked by reviewing the study design and not by inspecting the data
How to analyze data using regression models when the $Y$-values are not independent is covered in BSTA 519 (Longitudinal data)

N: Normality of the residuals

Extract residuals from regression model in R
Diagnostic tools:
- Distribution plots of residuals
- QQ plots

N: Normality of the residuals

The responses Y are normally distributed at each level of x

https://bookdown.org/roback/bookdown-bysh/ch-MLRreview.html#ordinary-least-squares-ols-assumptions

Extract model’s residuals in R

First extract the residuals’ values from the model output using the augment() function from the broom package.
Get a tibble with the orginal data, as well as the residuals and some other important values.

model1 <- lm(life_expectancy_years_2011 ~ female_literacy_rate_2011, 
                data = gapm)
aug1 <- augment(model1) 

glimpse(aug1)

Rows: 80
Columns: 8
$ life_expectancy_years_2011 <dbl> 56.7, 76.7, 60.9, 76.9, 76.0, 73.8, 71.0, 7…
$ female_literacy_rate_2011  <dbl> 13.0, 95.7, 58.6, 99.4, 97.9, 99.5, 53.4, 9…
$ .fitted                    <dbl> 53.94643, 73.14897, 64.53453, 74.00809, 73.…
$ .resid                     <dbl> 2.7535654, 3.5510294, -3.6345319, 2.8919074…
$ .hat                       <dbl> 0.13628996, 0.01768176, 0.02645854, 0.02077…
$ .sigma                     <dbl> 6.172684, 6.168414, 6.167643, 6.172935, 6.1…
$ .cooksd                    <dbl> 1.835891e-02, 3.062372e-03, 4.887448e-03, 2…
$ .std.resid                 <dbl> 0.48238134, 0.58332052, -0.59972251, 0.4757…

Check normality with “usual” distribution plots

Note that below I save each figure, and then combine them together in one row of output using grid.arrange() from the gridExtra package.

hist1 <- ggplot(aug1, aes(x = .resid)) +
  geom_histogram()

density1 <- ggplot(aug1, aes(x = .resid)) +
  geom_density()

box1 <- ggplot(aug1, aes(x = .resid)) +
  geom_boxplot()

library(gridExtra) # NEW!!!
grid.arrange(hist1, density1, box1, nrow = 1)

Normal QQ plots (QQ = quantile-quantile)

It can be tricky to eyeball with a histogram or density plot whether the residuals are normal or not
QQ plots are often used to help with this

Vertical axis: data quantiles
- data points are sorted in order and
- assigned quantiles based on how many data points there are
Horizontal axis: theoretical quantiles
- mean and standard deviation (SD) calculated from the data points
- theoretical quantiles are calculated for each point, assuming the data are modeled by a normal distribution with the mean and SD of the data

Data are approximately normal if points fall on a line.

See more info at https://data.library.virginia.edu/understanding-QQ-plots/

Examples of Normal QQ plots (1/5)

Data:
- Body measurements from 507 physically active individuals
- in their 20’s or early 30’s
- within normal weight range.

Examples of Normal QQ plots (2/5)

Skewed right distribution

Examples of Normal QQ plots (3/5)

Long tails in distribution

Examples of Normal QQ plots (4/5)

Bimodal distribution

Examples of Normal QQ plots (5/5)

QQ plot of residuals of `model1`

ggplot(aug1, aes(sample = .resid)) + 
  stat_qq() +     # points
  stat_qq_line()  # line

Compare to randomly generated Normal QQ plots

How “good” we can expect a QQ plot to look depends on the sample size.

The QQ plots on the next slides are randomly generated
- using random samples from actual standard normal distributions $N(0,1)$.
Thus, all the points in the QQ plots should theoretically fall in a line
However, there is sampling variability…

Randomly generated Normal QQ plots: n=100

Note that stat_qq_line() doesn’t work with randomly generated samples, and thus the code below manually creates the line that the points should be on (which is $y=x$ in this case.)

samplesize <- 100

rand_qq1 <- ggplot() +
  stat_qq(aes(sample = rnorm(samplesize))) + 
  # line y=x
  geom_abline(intercept = 0, slope = 1, 
              color = "blue") 

rand_qq2 <- ggplot() +
  stat_qq(aes(sample = rnorm(samplesize))) + 
  geom_abline(intercept = 0, slope = 1, 
              color = "blue")

rand_qq3 <- ggplot() +
  stat_qq(aes(sample = rnorm(samplesize))) + 
  geom_abline(intercept = 0, slope = 1, 
              color = "blue")

rand_qq4 <- ggplot() +
  stat_qq(aes(sample = rnorm(samplesize))) + 
  geom_abline(intercept = 0, slope = 1, 
              color = "blue")

grid.arrange(rand_qq1, rand_qq2, 
             rand_qq3, rand_qq4, ncol =2)

Examples of simulated Normal QQ plots: n=10

With fewer data points,

simulated QQ plots are more likely to look “less normal”
even though the data points were sampled from normal distributions.

samplesize <- 10  # only change made to code!

rand_qq1 <- ggplot() +
  stat_qq(aes(sample = rnorm(samplesize))) + 
  # line y=x
  geom_abline(intercept = 0, slope = 1, 
              color = "blue") 

rand_qq2 <- ggplot() +
  stat_qq(aes(sample = rnorm(samplesize))) + 
  geom_abline(intercept = 0, slope = 1, 
              color = "blue")

rand_qq3 <- ggplot() +
  stat_qq(aes(sample = rnorm(samplesize))) + 
  geom_abline(intercept = 0, slope = 1, 
              color = "blue")

rand_qq4 <- ggplot() +
  stat_qq(aes(sample = rnorm(samplesize))) + 
  geom_abline(intercept = 0, slope = 1, 
              color = "blue")

grid.arrange(rand_qq1, rand_qq2, 
             rand_qq3, rand_qq4, ncol =2)

Examples of simulated Normal QQ plots: n=1,000

With more data points,

simulated QQ plots are more likely to look “more normal”

samplesize <- 1000 # only change made to code!

rand_qq1 <- ggplot() +
  stat_qq(aes(sample = rnorm(samplesize))) + 
  # line y=x
  geom_abline(intercept = 0, slope = 1, 
              color = "blue") 

rand_qq2 <- ggplot() +
  stat_qq(aes(sample = rnorm(samplesize))) + 
  geom_abline(intercept = 0, slope = 1, 
              color = "blue")

rand_qq3 <- ggplot() +
  stat_qq(aes(sample = rnorm(samplesize))) + 
  geom_abline(intercept = 0, slope = 1, 
              color = "blue")

rand_qq4 <- ggplot() +
  stat_qq(aes(sample = rnorm(samplesize))) + 
  geom_abline(intercept = 0, slope = 1, 
              color = "blue")

grid.arrange(rand_qq1, rand_qq2, 
             rand_qq3, rand_qq4, ncol =2)

Back to our example

Residuals from Life Expectancy vs. Female Literacy Rate Regression

ggplot(aug1, 
      aes(sample = .resid)) + 
  stat_qq() + 
  stat_qq_line()

Simulated QQ plot of Normal Residuals with n = 80

# number of observations 
# in fitted model
nobs(model1)

[1] 80

ggplot() +
  stat_qq(aes(
    sample = rnorm(80))) + 
  geom_abline(
    intercept = 0, slope = 1, 
    color = "blue")

E: Equality of variance of the residuals

Homoscedasticity
Diagnostic tool: residual plot

https://bookdown.org/roback/bookdown-bysh/ch-MLRreview.html#ordinary-least-squares-ols-assumptions

Residual plot

$x$ = explanatory variable from regression model
- (or the fitted values for a multiple regression)
$y$ = residuals from regression model

names(aug1)

[1] "life_expectancy_years_2011" "female_literacy_rate_2011" 
[3] ".fitted"                    ".resid"                    
[5] ".hat"                       ".sigma"                    
[7] ".cooksd"                    ".std.resid"

ggplot(aug1, 
       aes(x = female_literacy_rate_2011, 
           y = .resid)) + 
  geom_point() +
  geom_abline(
    intercept = 0, 
    slope = 0, 
    color = "orange") +
  labs(title = "Residual plot")

E: Equality of variance of the residuals (Homoscedasticity)

The variance or, equivalently, the standard deviation of the responses is equal for all values of x.
This is called homoskedasticity (top row)
If there is heteroskedasticity (bottom row), then the assumption is not met.

$R^2$ = Coefficient of determination

Another way to assess model fit

$R^2$ = Coefficient of determination (1/2)

Recall that the correlation coefficient $r$ measures the strength of the linear relationship between two numerical variables
$R^2$ is usually used to measure the strength of a linear fit
- For a simple linear regression model (one numerical predictor), $R^2$ is just the square of the correlation coefficient
In general, $R^2$ is the proportion of the variability of the dependent variable that is explained by the independent variable(s)

\[R^2 = \frac{\textrm{variance of predicted y-values}} {\textrm{variance of observed y-values}} = \frac{\sum_{i=1}^n(\widehat{y}_i-\bar{y})^2} {\sum_{i=1}^n(y_i-\bar{y})^2} = \frac{s_y^2 - s_{\textrm{residuals}}^2} {s_y^2}\] \[R^2 = 1- \frac{s_{\textrm{residuals}}^2} {s_y^2}\] where $\frac{s_{\textrm{residuals}}^2}{s_y^2}$ is the proportion of “unexplained” variability in the $y$ values,
and thus $R^2 = 1- \frac{s_{\textrm{residuls}}^2}{s_y^2}$ is the proportion of “explained” variability in the $y$ values

$R^2$ = Coefficient of determination (2/2)

Recall, $-1<r<1$
Thus, $0<R^2<1$
In practice, we want “high” $R^2$ values, i.e. $R^2$ as close to 1 as possible.

Calculating $R^2$ in R using glance() from the broom package:

glance(model1)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.411         0.403  6.14      54.4 1.50e-10     1  -258.  521.  529.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

glance(model1)$r.squared

[1] 0.4109366

Warning

A model can have a high $R^2$ value when there is a curved pattern.
Always first check whether a linear model is reasonable or not.

$R^2$ in `summary()` R output

summary(model1)


Call:
lm(formula = life_expectancy_years_2011 ~ female_literacy_rate_2011, 
    data = gapm)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.299  -2.670   1.145   4.114   9.498 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)               50.92790    2.66041  19.143  < 2e-16 ***
female_literacy_rate_2011  0.23220    0.03148   7.377  1.5e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.142 on 78 degrees of freedom
Multiple R-squared:  0.4109,    Adjusted R-squared:  0.4034 
F-statistic: 54.41 on 1 and 78 DF,  p-value: 1.501e-10

Compare to the square of the correlation coefficient $r$:

r <- cor(x = gapm$life_expectancy_years_2011, 
    y = gapm$female_literacy_rate_2011,
    use =  "complete.obs")
r

[1] 0.6410434

r^2

[1] 0.4109366

Regression inference

Inference for population slope $\beta_1$
CI for mean response $\mu_{Y|x^*}$
Prediction interval for predicting individual observations

Inference for population slope $\beta_1$

# Fit regression model:
model1 <- lm(life_expectancy_years_2011 ~ female_literacy_rate_2011,
                 data = gapm)
# Get regression table:
tidy(model1, conf.int = TRUE) %>% gt() # conf.int = TRUE part is new!

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	50.9278981	2.66040695	19.142898	3.325312e-31	45.6314348	56.2243615
female_literacy_rate_2011	0.2321951	0.03147744	7.376557	1.501286e-10	0.1695284	0.2948619

\[\begin{align} \widehat{y} =& b_0 + b_1 \cdot x\\ \widehat{\text{life expectancy}} =& 50.9 + 0.232 \cdot \text{female literacy rate} \end{align}\]

What are $H_0$ and $H_A$?
How do we calculate the standard error, statistic, p-value, and CI?

Note

We can also test & calculate CI for the population intercept
This will be covered in BSTA 512

Inference for the population slope: CI and hypothesis test

Population model
line + random “noise”

\[Y = \beta_0 + \beta_1 \cdot X + \varepsilon\] with $\varepsilon \sim N(0,\sigma)$
$\sigma$ is the variability (SD) of the residuals

Sample best-fit (least-squares) line:

\[\widehat{y} = b_0 + b_1 \cdot x \]

Note: Some sources use $\widehat{\beta}$ instead of $b$.

Construct a 95% confidence interval for the population slope $\beta_1$

Conduct the hypothesis test

\[\begin{align} H_0 &: \beta_1 = 0\\ \text{vs. } H_A&: \beta_1 \neq 0 \end{align}\]

Note: R reports p-values for 2-sided tests

CI for population slope $\beta_1$

Recall the general CI formula:

\[\textrm{Point Estimate} \pm t^*\cdot SE_{\textrm{Point Estimate}}\]

For the CI of the coefficient $b_1$ this translates to

\[b_1 \pm t^*\cdot SE_{b_1}\] where $t^*$ is the critical value from a $t$-distribution with $df = n -2$.

How is $\text{SE}_{b_1}$ calculated? See next slide.

tidy(model1, conf.int = TRUE)

# A tibble: 2 × 7
  term                  estimate std.error statistic  p.value conf.low conf.high
  <chr>                    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)             50.9      2.66       19.1  3.33e-31   45.6      56.2  
2 female_literacy_rate…    0.232    0.0315      7.38 1.50e-10    0.170     0.295

Standard error of fitted slope $b_1$

\[\text{SE}_{b_1} = \frac{s_{\textrm{residuals}}}{s_x\sqrt{n-1}}\]

$\text{SE}_{b_1}$ is the variability of the statistic $b_1$

$s_{\textrm{residuals}}$ is the sd of the residuals

$s_x$ is the sample sd of the explanatory variable $x$

$n$ is the sample size, or the number of (complete) pairs of points

glance(model1)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.411         0.403  6.14      54.4 1.50e-10     1  -258.  521.  529.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

# standard deviation of the residuals (Residual standard error in summary() output)
(s_resid <- glance(model1)$sigma)

[1] 6.142157

# standard deviation of x's
(s_x <- sd(gapm$female_literacy_rate_2011))

[1] 21.95371

# number of pairs of complete observations
(n <- nobs(model1))

[1] 80

(se_b1 <- s_resid/(s_x * sqrt(n-1))) # compare to SE in regression output

[1] 0.03147744

Calculate CI for population slope $\beta_1$

\[b_1 \pm t^*\cdot SE_{b_1}\]

where $t^*$ is the $t$-distribution critical value with $df = n -2$.

tidy(model1, conf.int = TRUE) %>% gt()

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	50.9278981	2.66040695	19.142898	3.325312e-31	45.6314348	56.2243615
female_literacy_rate_2011	0.2321951	0.03147744	7.376557	1.501286e-10	0.1695284	0.2948619

Save regression output for the row with the slope’s information:

model1_b1 <-tidy(model1) %>% filter(term == "female_literacy_rate_2011")
model1_b1 %>% gt()

term	estimate	std.error	statistic	p.value
female_literacy_rate_2011	0.2321951	0.03147744	7.376557	1.501286e-10

Save values needed for CI:

b1 <- model1_b1$estimate
SE_b1 <- model1_b1$std.error

nobs(model1) # sample size n

[1] 80

(tstar <- qt(.975, df = 80-2))

[1] 1.990847

Compare CI bounds below with the ones in the regression table above.

(CI_LB <- b1 - tstar*SE_b1)

[1] 0.1695284

(CI_UB <- b1 + tstar*SE_b1)

[1] 0.2948619

Hypothesis test for population slope $\beta_1$

\[\begin{align} H_0 &: \beta_1 = 0\\ \text{vs. } H_A&: \beta_1 \neq 0 \end{align}\]

The test statistic for $b_1$ is

\[t = \frac{ b_1 - \beta_1}{ \text{SE}_{b_1}} = \frac{ b_1}{ \text{SE}_{b_1}}\]

when we assume $H_0: \beta_1 = 0$ is true.

tidy(model1, conf.int = TRUE) %>% gt()

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	50.9278981	2.66040695	19.142898	3.325312e-31	45.6314348	56.2243615
female_literacy_rate_2011	0.2321951	0.03147744	7.376557	1.501286e-10	0.1695284	0.2948619

Calculate the test statistic using the values in the regression table:

# recall model1_b1 is regression table restricted to b1 row
(TestStat <- model1_b1$estimate / model1_b1$std.error)

[1] 7.376557

Compare this test statistic value to the one from the regression table above

$p$-value for testing population slope $\beta_1$

As usual, the $p$-value is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic assuming the null hypothesis $H_0$ is true.
To calculate the $p$-value, we need to know the probability distribution of the test statistic (the null distribution) assuming $H_0$ is true.
Statistical theory tells us that the test statistic $t$ can be modeled by a $t$-distribution with $df = n-2$.
Recall that this is a 2-sided test:

(pv = 2*pt(TestStat, df=80-2, lower.tail=F))

[1] 1.501286e-10

Compare the $p$-value to the one from the regression table below

tidy(model1, conf.int = TRUE) %>% gt()  # compare p-value calculated above to p-value in table

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	50.9278981	2.66040695	19.142898	3.325312e-31	45.6314348	56.2243615
female_literacy_rate_2011	0.2321951	0.03147744	7.376557	1.501286e-10	0.1695284	0.2948619

Prediction (& inference)

Prediction for mean response
Prediction for new individual observation

Prediction with regression line

term	estimate	std.error	statistic	p.value
(Intercept)	50.9278981	2.66040695	19.142898	3.325312e-31
female_literacy_rate_2011	0.2321951	0.03147744	7.376557	1.501286e-10

\[\widehat{\textrm{life expectancy}} = 50.9 + 0.232 \cdot \textrm{female literacy rate} \]

What is the predicted life expectancy for a country with female literacy rate 60%?

\[\widehat{\textrm{life expectancy}} = 50.9 + 0.232 \cdot 60 = 64.82\]

(y_60 <- 50.9 + 0.232*60)

[1] 64.82

How do we interpret the predicted value?
How variable is it?

Prediction with regression line

Recall the population model:

line + random “noise”

\[Y = \beta_0 + \beta_1 \cdot X + \varepsilon\] with $\varepsilon \sim N(0,\sigma)$
$\sigma$ is the variability (SD) of the residuals

When we take the expected value, at a given value $x^*$, we have that the predicted response is the average expected response at $x^*$:

\[\widehat{E[Y|x^*]} = b_0 + b_1 x^*\]

These are the points on the regression line.
The mean responses has variability, and we can calculate a CI for it, for every value of $x^*$.

CI for mean response $\mu_{Y|x^*}$

\[\widehat{E[Y|x^*]} \pm t_{n-2}^* \cdot SE_{\widehat{E[Y|x^*]}}\]

$SE_{\widehat{E[Y|x^*]}}$ is calculated using

\[SE_{\widehat{E[Y|x^*]}} = s_{residuals} \sqrt{\frac{1}{n} + \frac{(x^* - \bar{x})^2}{(n-1)s_x^2}}\]

$\widehat{E[Y|x^*]}$ is the predicted value at the specified point $x^*$ of the explanatory variable
$s_{\textrm{residuals}}^2$ is the sd of the residuals
$n$ is the sample size, or the number of (complete) pairs of points
$\bar{x}$ is the sample mean of the explanatory variable $x$
$s_x$ is the sample sd of the explanatory variable $x$

Recall that $t_{n-2}^*$ is calculated using qt() and depends on the confidence level.

Example: CI for mean response $\mu_{Y|x^*}$

Find the 95% CI for the mean life expectancy when the female literacy rate is 60.

\[\begin{align} \widehat{E[Y|x^*]} &\pm t_{n-2}^* \cdot SE_{\widehat{E[Y|x^*]}}\\ 64.8596 &\pm 1.990847 \cdot s_{residuals} \sqrt{\frac{1}{n} + \frac{(x^* - \bar{x})^2}{(n-1)s_x^2}}\\ 64.8596 &\pm 1.990847 \cdot 6.142157 \sqrt{\frac{1}{80} + \frac{(60 - 81.65375)^2}{(80-1)21.95371^2}}\\ 64.8596 &\pm 1.990847 \cdot 0.9675541\\ 64.8596 &\pm 1.926252\\ (62.93335 &, 66.78586) \end{align}\]

(Y60 <- 50.9278981 + 0.2321951 * 60)

[1] 64.8596

(tstar <- qt(.975, df = 78))

[1] 1.990847

(s_resid <- glance(model1)$sigma)

[1] 6.142157

(n <- nobs(model1))

[1] 80

(mx <- mean(gapm$female_literacy_rate_2011))

[1] 81.65375

(s_x <- sd(gapm$female_literacy_rate_2011))

[1] 21.95371

(SE_Yx <- s_resid *sqrt(1/n + (60 - mx)^2/((n-1)*s_x^2)))

[1] 0.9675541

(MOE_Yx <- SE_Yx*tstar)

[1] 1.926252

Y60 - MOE_Yx

[1] 62.93335

Y60 + MOE_Yx

[1] 66.78586

Example: Using R for CI for mean response $\mu_{Y|x^*}$

Find the 95% CI’s for the mean life expectancy when the female literacy rate is 40, 60, and 80.

Use the base R predict() function
Requires specification of a newdata “value”
- The newdata value is $x^*$
- This has to be in the format of a data frame though
- with column name identical to the predictor variable in the model

newdata <- data.frame(female_literacy_rate_2011 = c(40, 60, 80)) 
newdata

  female_literacy_rate_2011
1                        40
2                        60
3                        80

predict(model1, 
        newdata=newdata, 
        interval="confidence")

       fit      lwr      upr
1 60.21570 57.26905 63.16236
2 64.85961 62.93335 66.78586
3 69.50351 68.13244 70.87457

Interpretation

We are 95% confident that the average life expectancy for a country with a 60% female literacy rate will be between 62.9 and 66.8 years.

Confidence bands for mean response $\mu_{Y|x^*}$

Often we plot the CI for many values of X, creating confidence bands
The confidence bands are what ggplot creates when we set se = TRUE within geom_smooth
For what values of x are the confidence bands (intervals) narrowest?

ggplot(gapm,
       aes(x=female_literacy_rate_2011, 
           y=life_expectancy_years_2011)) +
  geom_point()+
  geom_smooth(method = lm, se=TRUE)+
  ggtitle("Life expectancy vs. female literacy rate")

Width of confidence bands for mean response $\mu_{Y|x^*}$

For what values of $x^*$ are the confidence bands (intervals) narrowest? widest?

\[\begin{align} \widehat{E[Y|x^*]} &\pm t_{n-2}^* \cdot SE_{\widehat{E[Y|x^*]}}\\ \widehat{E[Y|x^*]} &\pm t_{n-2}^* \cdot s_{residuals} \sqrt{\frac{1}{n} + \frac{(x^* - \bar{x})^2}{(n-1)s_x^2}} \end{align}\]

Prediction interval for predicting individual observations

We do not call this interval a CI since $Y$ is a random variable instead of a parameter
The form is similar to a CI though:

\[\widehat{Y|x^*} \pm t_{n-2}^* \cdot s_{residuals} \sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar{x})^2}{(n-1)s_x^2}}\]

Note that the only difference to the CI for a mean value of y is the additional 1+ under the square root.
- Thus the width is wider!

Example: Prediction interval

Find the 95% prediction interval for the life expectancy when the female literacy rate is 60.

\[\begin{align} \widehat{Y|x^*} &\pm t_{n-2}^* \cdot s_{residuals} \sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar{x})^2}{(n-1)s_x^2}}\\ 64.8596 &\pm 1.990847 \cdot 6.142157 \sqrt{1+\frac{1}{80} + \frac{(60 - 81.65375)^2}{(80-1)21.95371^2}}\\ (52.48072 &, 77.23849) \end{align}\]

(Y60 <- 50.9278981 + 0.2321951 * 60)

[1] 64.8596

(tstar <- qt(.975, df = 78))

[1] 1.990847

(s_resid <- glance(model1)$sigma)

[1] 6.142157

(n <- nobs(model1))

[1] 80

(mx <- mean(gapm$female_literacy_rate_2011))

[1] 81.65375

(s_x <- sd(gapm$female_literacy_rate_2011))

[1] 21.95371

(SE_Ypred <- s_resid *sqrt(1 + 1/n + (60 - mx)^2/((n-1)*s_x^2)))

[1] 6.217898

(MOE_Ypred <- SE_Ypred*tstar)

[1] 12.37888

Y60 - MOE_Ypred

[1] 52.48072

Y60 + MOE_Ypred

[1] 77.23849

Example: Using R for prediction interval

Find the 95% prediction intervals for the life expectancy when the female literacy rate is 40, 60, and 80.

newdata  # previously defined for CI's

  female_literacy_rate_2011
1                        40
2                        60
3                        80

predict(model1, 
        newdata=newdata, 
        interval="prediction")  # prediction instead of "confidence"

       fit      lwr      upr
1 60.21570 47.63758 72.79382
2 64.85961 52.48072 77.23849
3 69.50351 57.19879 81.80823

Interpretation

We are 95% confident that a new selected country with a 60% female literacy rate will have a life expectancy between 52.5 and 77.2 years.

Prediction bands vs. confidence bands (1/2)

Create a scatterplot with the regression line, 95% confidence bands, and 95% prediction bands.

First create a data frame with the original data points (both x and y values), their respective predicted values, andtheir respective prediction intervals
Can do this with augment() from the broom package.

model1_pred_bands <- augment(model1, interval = "prediction")

# take a look at new object:
names(model1_pred_bands)

 [1] "life_expectancy_years_2011" "female_literacy_rate_2011" 
 [3] ".fitted"                    ".lower"                    
 [5] ".upper"                     ".resid"                    
 [7] ".hat"                       ".sigma"                    
 [9] ".cooksd"                    ".std.resid"

# glimpse of select variables of interest:
model1_pred_bands %>% 
  select(life_expectancy_years_2011, female_literacy_rate_2011, 
         .fitted:.upper) %>% 
  glimpse()

Rows: 80
Columns: 5
$ life_expectancy_years_2011 <dbl> 56.7, 76.7, 60.9, 76.9, 76.0, 73.8, 71.0, 7…
$ female_literacy_rate_2011  <dbl> 13.0, 95.7, 58.6, 99.4, 97.9, 99.5, 53.4, 9…
$ .fitted                    <dbl> 53.94643, 73.14897, 64.53453, 74.00809, 73.…
$ .lower                     <dbl> 40.91166, 60.81324, 52.14572, 61.65365, 61.…
$ .upper                     <dbl> 66.98121, 85.48470, 76.92334, 86.36253, 86.…

Prediction bands vs. confidence bands (2/2)

names(model1_pred_bands)

 [1] "life_expectancy_years_2011" "female_literacy_rate_2011" 
 [3] ".fitted"                    ".lower"                    
 [5] ".upper"                     ".resid"                    
 [7] ".hat"                       ".sigma"                    
 [9] ".cooksd"                    ".std.resid"

ggplot(model1_pred_bands, 
       aes(x=female_literacy_rate_2011, y=life_expectancy_years_2011)) +
  geom_point() +
  geom_ribbon(aes(ymin = .lower, ymax = .upper), # prediction bands
              alpha = 0.2, fill = "red") +
  geom_smooth(method=lm) +  # confidence bands
  labs(title = "SLR with Confidence & Prediction Bands")

Caution…

Corrrelation doesn’t imply causation`*`!

This might seem obvious, but make sure to not write your analysis results in a way that implies causation if the study design doesn’t warrant it (such as an observational study).
Beware of spurious correlations: http://www.tylervigen.com/spurious-correlations

*Caveat: there is a whole field of statistics/epidemiology on causal inference. https://ftp.cs.ucla.edu/pub/stat_ser/r350.pdf

Day 16: Simple Linear Regression Part 2 (Sections 6.3-6.4)

Where are we?

Goals for today (Sections 6.3-6.4)

Life expectancy vs. female adult literacy rate

Dataset description

Get to know the data

Remove missing values

Regression line = best-fit line

Regression in R: lm(), summary(), & tidy()

Residuals

The (population) regresison model

What are the LINE conditions?

L: Linearity of relationship between variables

I: Independence of the residuals (\(Y\) values)

N: Normality of the residuals

N: Normality of the residuals

Extract model’s residuals in R

Check normality with “usual” distribution plots

Normal QQ plots (QQ = quantile-quantile)

Examples of Normal QQ plots (1/5)

Examples of Normal QQ plots (2/5)

Examples of Normal QQ plots (3/5)

Examples of Normal QQ plots (4/5)

Examples of Normal QQ plots (5/5)

QQ plot of residuals of model1

Compare to randomly generated Normal QQ plots

Randomly generated Normal QQ plots: n=100

Examples of simulated Normal QQ plots: n=10

Examples of simulated Normal QQ plots: n=1,000

Back to our example

E: Equality of variance of the residuals

Residual plot

E: Equality of variance of the residuals (Homoscedasticity)

\(R^2\) = Coefficient of determination

\(R^2\) = Coefficient of determination (1/2)

\(R^2\) = Coefficient of determination (2/2)

\(R^2\) in summary() R output

Regression inference

Inference for population slope \(\beta_1\)

Inference for the population slope: CI and hypothesis test

CI for population slope \(\beta_1\)

Standard error of fitted slope \(b_1\)

Calculate CI for population slope \(\beta_1\)

Hypothesis test for population slope \(\beta_1\)

\(p\)-value for testing population slope \(\beta_1\)

Prediction (& inference)

Prediction with regression line

Prediction with regression line

CI for mean response \(\mu_{Y|x^*}\)

Example: CI for mean response \(\mu_{Y|x^*}\)

Example: Using R for CI for mean response \(\mu_{Y|x^*}\)

Interpretation

Confidence bands for mean response \(\mu_{Y|x^*}\)

Width of confidence bands for mean response \(\mu_{Y|x^*}\)

Prediction interval for predicting individual observations

Example: Prediction interval

Example: Using R for prediction interval

Interpretation

Prediction bands vs. confidence bands (1/2)

Prediction bands vs. confidence bands (2/2)

Caution…

Corrrelation doesn’t imply causation*!

What’s next?

Regression in R: `lm()`, `summary()`, & `tidy()`

QQ plot of residuals of `model1`

\(R^2\) in `summary()` R output

Corrrelation doesn’t imply causation`*`!