Day 9: Confidence intervals (4.2)

BSTA 511/611

Week 6
Author
Affiliation

Meike Niederhausen, PhD

OHSU-PSU School of Public Health

Published

October 30, 2023

Load packages

  • Packages need to be loaded every time you restart R or render an Qmd file
# run these every time you open Rstudio
library(tidyverse)    
library(oibiostat)
library(ggridges)
library(janitor)
library(rstatix)
library(knitr)
library(gtsummary)
library(moderndive) # NEW!!

set.seed(456)
  • You can check whether a package has been loaded or not
    • by looking at the Packages tab and
    • seeing whether it has been checked off or not

Last time -> Goals for today

Day 8: Section 4.1

  • Sampling from a population
    • population parameters vs. point estimates
    • sampling variation


  • Sampling distribution of a mean
  • Central Limit Theorem

Day 9: Section 4.2

What are Confidence Intervals?

  • How to calculate CI’s?
  • How to interpret & NOT interpret CI’s?


  • What if we don’t know \(\sigma\)?
  • Student’s t-distribution

Our hypothetical population: YRBSS

Youth Risk Behavior Surveillance System (YRBSS)

  • Yearly survey conducted by the US Centers for Disease Control (CDC)
  • Measures health-related activity in high-school aged youth
  • yrbss contains responses from n = 13,583 participants in 2013 for a subset of the variables included in the complete survey data
library(oibiostat)
data("yrbss")  #load the data
# ?yrbss
dim(yrbss)
[1] 13583    13
names(yrbss)
 [1] "age"                      "gender"                  
 [3] "grade"                    "hispanic"                
 [5] "race"                     "height"                  
 [7] "weight"                   "helmet.12m"              
 [9] "text.while.driving.30d"   "physically.active.7d"    
[11] "hours.tv.per.school.day"  "strength.training.7d"    
[13] "school.night.hours.sleep"

Transform height & weight from metric to to standard

Also, drop missing values and add a column of id values

yrbss2 <- yrbss %>%                 # save new dataset with new name
  mutate(                           # add variables for 
    height.ft = 3.28084*height,     #     height in feet
    weight.lb = 2.20462*weight      #     weight in pounds
  ) %>% 
  drop_na(height.ft, weight.lb) %>% # drop rows w/ missing height/weight values
  mutate(id = 1:nrow(.)) %>%        # add id column
  select(id, height.ft, weight.lb)  # restrict dataset to columns of interest

head(yrbss2)  
  id height.ft weight.lb
1  1  5.675853  186.0038
2  2  5.249344  122.9957
3  3  4.921260  102.9998
4  4  5.150919  147.9961
5  5  5.413386  289.9957
6  6  6.167979  157.0130
dim(yrbss2)
[1] 12579     3
# number of rows deleted that had missing values for height and/or weight:
nrow(yrbss) - nrow(yrbss2) 
[1] 1004

yrbss2: stats for height in feet

summary(yrbss2)
       id          height.ft       weight.lb     
 Min.   :    1   Min.   :4.167   Min.   : 66.01  
 1st Qu.: 3146   1st Qu.:5.249   1st Qu.:124.01  
 Median : 6290   Median :5.512   Median :142.00  
 Mean   : 6290   Mean   :5.549   Mean   :149.71  
 3rd Qu.: 9434   3rd Qu.:5.840   3rd Qu.:167.99  
 Max.   :12579   Max.   :6.923   Max.   :399.01  
(mean_height.ft <- mean(yrbss2$height.ft))
[1] 5.548691
(sd_height.ft <- sd(yrbss2$height.ft))
[1] 0.3434949

Take 10,000 samples of size n = 30 from yrbss2

samp_n30_rep10000 <- yrbss2 %>%
  rep_sample_n(size = 30, 
               reps = 10000,
               replace = FALSE)
samp_n30_rep10000
# A tibble: 300,000 × 4
# Groups:   replicate [10,000]
   replicate    id height.ft weight.lb
       <int> <int>     <dbl>     <dbl>
 1         1  5869      5.15      145.
 2         1  6694      5.41      127.
 3         1  2517      5.74      130.
 4         1  5372      6.07      180.
 5         1  5403      6.07      163.
 6         1  2329      6.07      182.
 7         1  8863      5.25      125.
 8         1  8058      5.84      135.
 9         1   335      6.17      235.
10         1  4698      5.58      124.
# ℹ 299,990 more rows
means_hght_samp_n30_rep10000 <- 
  samp_n30_rep10000 %>% 
  group_by(replicate) %>% 
  summarise(mean_height = 
            mean(height.ft))

means_hght_samp_n30_rep10000
# A tibble: 10,000 × 2
   replicate mean_height
       <int>       <dbl>
 1         1        5.59
 2         2        5.59
 3         3        5.51
 4         4        5.65
 5         5        5.64
 6         6        5.57
 7         7        5.61
 8         8        5.60
 9         9        5.52
10        10        5.64
# ℹ 9,990 more rows

Simulated sampling distribution for n = 30 using 10,000 sample mean heights

ggplot(
  means_hght_samp_n30_rep10000, 
  aes(x = mean_height)) + 
  geom_histogram()

CLT tells us that we can model the sampling distribution of mean heights using a normal distribution.

mu <- 5.55
SE <- 0.34/sqrt(30)
sig <- round(SE, 2)

ggplot(data.frame(x = c(mu-4*sig, mu+4*sig)), aes(x = x)) + 
  stat_function(fun = dnorm, 
                args = list(mean = mu, sd = sig)) + 
  stat_function(fun = dnorm, 
                args = list(mean = mu, sd = sig), 
                xlim = c(mu-4*sig, mu-2*sig), 
                geom = "area", fill = "darkblue") +
  stat_function(fun = dnorm, 
                args = list(mean = mu, sd = sig), 
                xlim = c(mu+2*sig, mu+4*sig), 
                geom = "area", fill = "darkblue") +
  stat_function(fun = dnorm, 
                args = list(mean = mu, sd = sig), 
                xlim = c(mu-2*sig, mu-1*sig), 
                geom = "area", fill = "darkgreen") +
  stat_function(fun = dnorm, 
                args = list(mean = mu, sd = sig), 
                xlim = c(mu+1*sig, mu+2*sig), 
                geom = "area", fill = "darkgreen")+
  stat_function(fun = dnorm, 
                args = list(mean = mu, sd = sig), 
                xlim = c(mu-1*sig, mu+1*sig), 
                geom = "area", fill = "orange") +
  scale_x_continuous(name ="mean height (ft)", 
                    breaks=c(mu-2*sig,mu-1*sig,mu, mu+1*sig, mu+2*sig))  +
  labs(title = "Sampling distribution", y = "") +
  scale_y_continuous(labels = NULL, breaks = NULL)

Confidence interval (CI) for the mean \(\mu\)

\[\overline{x}\ \pm\ z^*\times \text{SE}\]

where

  • \(SE = \frac{\sigma}{\sqrt{n}}\)
  • \(z^*\) depends on the confidence level
    • For a 95% CI, \(z^*\) is chosen such that 95% of the standard normal curve is between \(-z^*\) and \(z^*\)
qnorm(.975)
[1] 1.959964
qnorm(.995)
[1] 2.575829

When can this be applied?

Example: CI for mean height

  • A random sample of 30 high schoolers has mean height 5.6 ft.
  • Find the 95% confidence interval for the population mean, assuming that the population standard deviation is 0.34 ft.

How to interpret a CI?

Simulating Confidence Intervals: http://www.rossmanchance.com/applets/ConfSim.html

Actual interpretation:

  • If we were to
    • repeatedly take random samples from a population and
    • calculate a 95% CI for each random sample,
  • then we would expect 95% of our CI’s to contain the true population parameter \(\mu\).

What we typically write as “shorthand”:

  • We are 95% confident that (the 95% confidence interval) captures the value of the population parameter.

WRONG interpretation:

  • There is a 95% chance that (the 95% confidence interval) captures the value of the population parameter.
    • For one CI on its own, it either does or doesn’t contain the population parameter with probability 0 or 1. We just don’t know which!

Interpretation of our heights CI

Correct interpretation:

  • We are 95% confident that the mean height for high schoolers is between 5.43 and 5.67 feet.

WRONG:

  • There is a 95% chance that the mean height for high schoolers is between 5.43 and 5.67 feet.

What if we don’t know \(\sigma\) ?

  • In real life, we don’t know what the population sd is ( \(\sigma\) )

  • If we replace \(\sigma\) with \(s\) in the SE formula, we add in additional variability to the SE! \[\frac{\sigma}{\sqrt{n}} ~~~~\textrm{vs.} ~~~~ \frac{s}{\sqrt{n}}\]

  • Thus when using \(s\) instead of \(\sigma\) when calculating the SE, we need a different probability distribution with thicker tails than the normal distribution.

    • In practice this will mean using a different value than 1.96 when calculating the CI.

Student’s t-distribution

The Student’s t-distribution:

  • Is bell shaped and symmetric with mean = 0.
  • Its tails are a thicker than that of a normal distribution
    • The “thickness” depends on its degrees of freedom: \(df = n–1\) , where n = sample size.
  • As the degrees of freedom (sample size) increase,
    • the tails are less thick, and
    • the t-distribution is more like a normal distribution
    • in theory, with an infinite sample size the t-distribution is a normal distribution.

Calculating the CI for the population mean using \(s\)

CI for \(\mu\):

\[\bar{x} \pm t^*\cdot\frac{s}{\sqrt{n}}\]

where \(t^*\) is determined by the t-distribution and dependent on the
df = \(n-1\) and the confidence level

  • qt gives the quartiles for a t-distribution. Need to specify
    • the percent under the curve to the left of the quartile
    • the degrees of freedom = n-1
  • Note in the R output to the right that \(t^*\) gets closer to 1.96 as the sample size increases.
qt(.975, df=9)  # df = n-1
[1] 2.262157
qt(.975, df=49)
[1] 2.009575
qt(.975, df=99)
[1] 1.984217
qt(.975, df=999)
[1] 1.962341

Example: CI for mean height (revisited)

  • A random sample of 30 high schoolers has mean height 5.6 ft and standard deviation 0.34 ft.
  • Find the 95% confidence interval for the population mean.

\(z\) vs \(t\)??
(& important comment about Chapter 4 of textbook)

Textbook’s rule of thumb

  • (Ch 4) If \(n \geq 30\) and population distribution not strongly skewed:
    • Use normal distribution
    • No matter if using \(\sigma\) or \(s\) for the \(SE\)
    • If there is skew or some large outliers, then need \(n \geq 50\)
  • (Ch 5) If \(n < 30\) and data approximately symmetric with no large outliers:
    • Use Student’s t-distribution

BSTA 511 rule of thumb

  • Use normal distribution ONLY if know \(\sigma\)
    • If using \(s\) for the \(SE\), then use the Student’s t-distribution

For either case, can apply if either

  • \(n \geq 30\) and population distribution not strongly skewed
    • If there is skew or some large outliers, then \(n \geq 50\) gives better estimates
  • \(n < 30\) and data approximately symmetric with no large outliers

If do not know population distribution, then check the distribution of the data.