Adult literacy rate is the percentage of people ages 15 and above who can, with understanding, read and write a short, simple statement on their everyday life. Source: http://data.uis.unesco.org/
At least basic water source (%) = the percentage of people using at least basic water services. This indicator encompasses both people using basic water services as well as those using safely managed water services. Basic drinking water services is defined as drinking water from an improved source, provided collection time is not more than 30 minutes for a round trip. Improved water sources include piped water, boreholes or tubewells, protect dug wells, protected springs, and packaged or delivered water.
Rows: 194 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, water_2011_quart
dbl (3): life_expectancy_years_2011, female_literacy_rate_2011, water_basic_...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 2 × 13
variable n min max median q1 q3 iqr mad mean sd se
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 life_expec… 80 48 81.8 72.4 65.9 75.8 9.95 6.30 69.9 7.95 0.889
2 female_lit… 80 13 99.8 91.6 71.0 98.0 27.0 11.4 81.7 22.0 2.45
# ℹ 1 more variable: ci <dbl>
Important
Removing the rows with missing data was not needed to run the regression model.
I did this step since later we will be calculating the standard deviations of the explanatory and response variables for just the values included in the regression model. It’ll be easier to do this if we remove the missing values now.
Regression line = best-fit line
is the predicted outcome for a specific value of .
is the intercept
is the slope of the line, i.e., the increase in for every increase of one (unit increase) in .
slope = rise over run
`geom_smooth()` using formula = 'y ~ x'
Intercept
The expected outcome for the -variable when the -variable is 0.
Slope
For every increase of 1 unit in the -variable, there is an expected increase of, on average, units in the -variable.
We only say that there is an expected increase and not necessarily a causal increase.
Body measurements from 507 physically active individuals
in their 20’s or early 30’s
within normal weight range.
Examples of Normal QQ plots (2/5)
Skewed right distribution
Examples of Normal QQ plots (3/5)
Long tails in distribution
Examples of Normal QQ plots (4/5)
Bimodal distribution
Examples of Normal QQ plots (5/5)
QQ plot of residuals of model1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(aug1, aes(sample = .resid)) +stat_qq() +# pointsstat_qq_line() # line
Compare to randomly generated Normal QQ plots
How “good” we can expect a QQ plot to look depends on the sample size.
The QQ plots on the next slides are randomly generated
using random samples from actual standard normal distributions .
Thus, all the points in the QQ plots should theoretically fall in a line
However, there is sampling variability…
Randomly generated Normal QQ plots: n=100
Note that stat_qq_line() doesn’t work with randomly generated samples, and thus the code below manually creates the line that the points should be on (which is in this case.)
samplesize <-100rand_qq1 <-ggplot() +stat_qq(aes(sample =rnorm(samplesize))) +# line y=xgeom_abline(intercept =0, slope =1, color ="blue") rand_qq2 <-ggplot() +stat_qq(aes(sample =rnorm(samplesize))) +geom_abline(intercept =0, slope =1, color ="blue")rand_qq3 <-ggplot() +stat_qq(aes(sample =rnorm(samplesize))) +geom_abline(intercept =0, slope =1, color ="blue")rand_qq4 <-ggplot() +stat_qq(aes(sample =rnorm(samplesize))) +geom_abline(intercept =0, slope =1, color ="blue")
Compare CI bounds below with the ones in the regression table above.
(CI_LB <- b1 - tstar*SE_b1)
[1] 0.1695284
(CI_UB <- b1 + tstar*SE_b1)
[1] 0.2948619
Hypothesis test for population slope
The test statistic for is
when we assume is true.
tidy(model1, conf.int =TRUE) %>%gt()
term
estimate
std.error
statistic
p.value
conf.low
conf.high
(Intercept)
50.9278981
2.66040695
19.142898
3.325312e-31
45.6314348
56.2243615
female_literacy_rate_2011
0.2321951
0.03147744
7.376557
1.501286e-10
0.1695284
0.2948619
Calculate the test statistic using the values in the regression table:
# recall model1_b1 is regression table restricted to b1 row(TestStat <- model1_b1$estimate / model1_b1$std.error)
[1] 7.376557
Compare this test statistic value to the one from the regression table above
-value for testing population slope
As usual, the -value is the probability of obtaining a test statisticjust as extreme or more extremethan the observed test statistic assuming the null hypothesisis true.
To calculate the -value, we need to know the probability distribution of the test statistic (the null distribution) assuming is true.
Statistical theory tells us that the test statistic can be modeled by a -distribution with .
Recall that this is a 2-sided test:
(pv =2*pt(TestStat, df=80-2, lower.tail=F))
[1] 1.501286e-10
Compare the -value to the one from the regression table below
tidy(model1, conf.int =TRUE) %>%gt() # compare p-value calculated above to p-value in table
term
estimate
std.error
statistic
p.value
conf.low
conf.high
(Intercept)
50.9278981
2.66040695
19.142898
3.325312e-31
45.6314348
56.2243615
female_literacy_rate_2011
0.2321951
0.03147744
7.376557
1.501286e-10
0.1695284
0.2948619
Prediction (& inference)
Prediction for mean response
Prediction for new individual observation
Prediction with regression line
term
estimate
std.error
statistic
p.value
(Intercept)
50.9278981
2.66040695
19.142898
3.325312e-31
female_literacy_rate_2011
0.2321951
0.03147744
7.376557
1.501286e-10
What is the predicted life expectancy for a country with female literacy rate 60%?
(y_60 <-50.9+0.232*60)
[1] 64.82
How do we interpret the predicted value?
How variable is it?
Prediction with regression line
Recall the population model:
line + random “noise”
with is the variability (SD) of the residuals
When we take the expected value, at a given value , we have that the predicted response is the average expected response at :
`geom_smooth()` using formula = 'y ~ x'
These are the points on the regression line.
The mean responses has variability, and we can calculate a CI for it, for every value of .
CI for mean response
is calculated using
is the predicted value at the specified point of the explanatory variable
is the sd of the residuals
is the sample size, or the number of (complete) pairs of points
is the sample mean of the explanatory variable
is the sample sd of the explanatory variable
Recall that is calculated using qt() and depends on the confidence level.
Example: CI for mean response
Find the 95% CI for the mean life expectancy when the female literacy rate is 60.
We are 95% confident that a new selected country with a 60% female literacy rate will have a life expectancy between 52.5 and 77.2 years.
Prediction bands vs. confidence bands (1/2)
Create a scatterplot with the regression line, 95% confidence bands, and 95% prediction bands.
First create a data frame with the original data points (both x and y values), their respective predicted values, andtheir respective prediction intervals
Can do this with augment() from the broom package.
model1_pred_bands <-augment(model1, interval ="prediction")# take a look at new object:names(model1_pred_bands)
This might seem obvious, but make sure to not write your analysis results in a way that implies causation if the study design doesn’t warrant it (such as an observational study).