Rows: 20 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Group
dbl (1): Taps
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Question: Is there evidence to support that drinking caffeine increases the number of finger taps/min?
Null and alternative hypotheses in words
Include as much context as possible
: The population difference in mean finger taps/min between the caffeine and control groups is …
: The population difference in mean finger taps/min between the caffeine and control groups is …
Null and alternative hypotheses in symbols
Step 3: Test statistic (part 1)
Recall that in general the test statistic has the form:
Thus, for a two sample independent means test, we have:
What is the formula for ?
What is the probability distribution of the test statistic?
What assumptions need to be satisfied?
What distribution does have?
Let and be the means of random samples from two independent groups, with parameters shown in table:
Group 1
Group 2
sample size
pop mean
pop sd
Some theoretical statistics:
If and are independent normal r.v.’s, then is also normal
What is the mean of ?
What is the standard deviation of ?
Step 3: Test statistic (part 2)
are the sample means
is the mean value specified in
are the sample SD’s
are the sample sizes
Statistical theory tells us that follows a student’s t-distribution with
smaller of and
this is a conservative estimate (smaller than actual )
Assumptions:
Independent observations & samples
The observations were collected independently.
In particular, the observations from the two groups were not paired in any meaningful way.
Approximately normal samples or big n’s
The distributions of the samples should be approximately normal
orboth their sample sizes should be at least 30.
Step 3: Test statistic (part 3)
Group
variable
n
mean
sd
Caffeine
Taps
10
248.3
2.214
NoCaffeine
Taps
10
244.8
2.394
Based on the value of the test statistic, do you think we are going to reject or fail to reject ?
Step “3b”: Assumptions satisfied?
Assumptions:
Independent observations & samples
The observations were collected independently.
In particular, the observations from the two groups were not paired in any meaningful way.
Approximately normal samples or big n’s
The distributions of the samples should be approximately normal
orboth their sample sizes should be at least 30.
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.
Step 4: p-value
The p-value is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic assuming the null hypothesis is true.
Calculate the p-value:
Step 5: Conclusion to hypothesis test
Recall the -value = 0.00397
Use = 0.05.
Do we reject or fail to reject ?
Conclusion statement:
Stats class conclusion
There is sufficient evidence that the (population) difference in mean finger taps/min with vs. without caffeine is greater than 0 ( -value = 0.004).
More realistic manuscript conclusion:
The mean finger taps/min were 244.8 (SD = 2.4) and 248.3 (SD = 2.2) for the control and caffeine groups, and the increase of 3.5 taps/min was statistically discrenible ( -value = 0.004).
95% CI for the mean difference in cholesterol levels
Group
variable
n
mean
sd
Caffeine
Taps
10
248.3
2.214
NoCaffeine
Taps
10
244.8
2.394
CI for :
Interpretation:
We are 95% confident that the (population) difference in mean finger taps/min between the caffeine and control groups is between 1.167 mg/dL and 5.833 mg/dL.
Based on the CI, is there evidence that drinking caffeine made a difference in finger taps/min? Why or why not?
R: 2-sample t-test (with long data)
The CaffTaps data are in a long format, meaning that
all of the outcome values are in one column and
another column indicates which group the values are from
This is a common format for data from multiple samples, especially if the sample sizes are different.
(Taps_2ttest <-t.test(formula = Taps ~ Group, alternative ="greater", data = CaffTaps))
Welch Two Sample t-test
data: Taps by Group
t = 3.3942, df = 17.89, p-value = 0.001628
alternative hypothesis: true difference in means between group Caffeine and group NoCaffeine is greater than 0
95 percent confidence interval:
1.711272 Inf
sample estimates:
mean in group Caffeine mean in group NoCaffeine
248.3 244.8
tidy the t.test output
# use tidy command from broom package for briefer output that's a tibbletidy(Taps_2ttest) %>%gt()
estimate
estimate1
estimate2
statistic
p.value
parameter
conf.low
conf.high
method
alternative
3.5
248.3
244.8
3.394168
0.001627703
17.89012
1.711272
Inf
Welch Two Sample t-test
greater
Pull the p-value:
tidy(Taps_2ttest)$p.value # we can pull specific values from the tidy output
[1] 0.001627703
R: 2-sample t-test (with wide data)
# make CaffTaps data wide: pivot_wider needs an ID column so that it # knows how to "match" values from the Caffeine and NoCaffeine groupsCaffTaps_wide <- CaffTaps %>%mutate(id =rep(1:10, 2)) %>%# "fake" IDs for pivot_wider steppivot_wider(names_from ="Group",values_from ="Taps")glimpse(CaffTaps_wide)
t.test(x = CaffTaps_wide$Caffeine, y = CaffTaps_wide$NoCaffeine, alternative ="greater") %>%tidy() %>%gt()
estimate
estimate1
estimate2
statistic
p.value
parameter
conf.low
conf.high
method
alternative
3.5
248.3
244.8
3.394168
0.001627703
17.89012
1.711272
Inf
Welch Two Sample t-test
greater
Why are the df’s in the R output different?
From many slides ago:
Statistical theory tells us that follows a student’s t-distribution with
smaller of and
this is a conservative estimate (smaller than actual )
The actual degrees of freedom are calculated using Satterthwaite’s method:
Verify the p-value in the R output using = 17.89012:
pt(3.3942, df =17.89012, lower.tail =FALSE)
[1] 0.001627588
Pooled standard deviation estimate
Sometimes we have reasons to believe that the population SD’s from the two groups are equal, such as when randomizing participants to two groups
In this case we can use a pooled SD:
, are the sample sizes, and
, are the sample standard deviations
of the two groups
We use the pooled SD instead of and when calculating the standard error
Test statistic with pooled SD:
CI with pooled SD:
The distribution degrees of freedom are now:
R: 2-sample t-test with pooled SD
# t-test with pooled SDt.test(formula = Taps ~ Group, alternative ="greater", var.equal =TRUE, # pooled SD data = CaffTaps) %>%tidy() %>%gt()
estimate
estimate1
estimate2
statistic
p.value
parameter
conf.low
conf.high
method
alternative
3.5
248.3
244.8
3.394168
0.001616497
18
1.711867
Inf
Two Sample t-test
greater
# t-test without pooled SDt.test(formula = Taps ~ Group, alternative ="greater", var.equal =FALSE, # default, NOT pooled SD data = CaffTaps) %>%tidy() %>%gt()
estimate
estimate1
estimate2
statistic
p.value
parameter
conf.low
conf.high
method
alternative
3.5
248.3
244.8
3.394168
0.001627703
17.89012
1.711272
Inf
Welch Two Sample t-test
greater
Similar output in this case - why??
What’s next?
CI’s and hypothesis tests for different scenarios:
Day
Book
Population parameter
Symbol
Point estimate
Symbol
SE
10
5.1
Pop mean
Sample mean
10
5.2
Pop mean of paired diff
or
Sample mean of paired diff
11
5.3
Diff in pop means
Diff in sample means
or pooled
12
8.1
Pop proportion
Sample prop
???
12
8.2
Diff in pop proportions
Diff in sample proportions
???
Power and sample size calculations
Critical values & rejection region
Type I & II errors
Power
How to calculate sample size needed for a study?
Materials are from
Section 4.3.4 Decision errors
Section 5.4 Power calculations for a difference of means
plus notes
Critical values
Critical values are the cutoff values that determine whether a test statistic is statistically significant or not.
If a test statistic is greater in absolute value than the critical value, we reject
Critical values are determined by
the significance level ,
whether a test is 1- or 2-sided, &
the probability distribution being used to calculate the p-value (such as normal or t-distribution).
The critical values in the figure should look very familiar!
Where have we used these before?
How can we calculate the critical values using R?
Rejection region
If the absolute value of the test statistic is greater than the critical value, we reject
In this case the test statistic is in the rejection region.
Power = P(correctly rejecting the null hypothesis)
Power is also called the
true positive rate,
probability of detection, or
the sensitivity of a test
Power vs. Type II error
Power = 1 - P(Type II error) = 1 -
Thus as = P(Type II error) decreases, the power increases
P(Type II error) decreases as the mean of the alternative population shifts further away from the mean of the null population (effect size gets bigger).
Typically want at least 80% power; 90% power is good
Example calculating power
Suppose the mean of the null population is 0 ( ) with standard error 1
Find the power of a 2-sided test if the actual , assuming the SE doesn’t change.
Power = Reject when alternative pop is
When = 0.05, we reject when the test statistic z is at least 1.96
Thus for we need to calculate :
# left tail + right tail:pnorm(-1.96, mean=3, sd=1, lower.tail=TRUE) +pnorm(1.96, mean=3, sd=1, lower.tail=FALSE)
[1] 0.8508304
The left tail probability pnorm(-1.96, mean=3, sd=1, lower.tail=TRUE) is essentially 0 in this case.
Note that this power calculation specified the value of the SE instead of the standard deviation and sample size individually.
Sample size calculation for testing one mean
Recall in our body temperature example that °F and °F.
The p-value from the hypothesis test was highly significant (very small).
What would the sample size need to be for 80% power?
Calculate ,
given , power ( ), “true” alternative mean , and null ,
assuming the test statistic is normal (instead of t-distribution):
pwr.t.test(n = NULL, d = NULL, sig.level = 0.05, power = NULL, type = c("two.sample", "one.sample", "paired"),alternative = c("two.sided", "less", "greater"))
d is Cohen’s d effect size:
Specify all parameters except for the sample size:
library(pwr)t.n <-pwr.t.test(d = (98.6-98.25)/0.73, sig.level =0.05, power =0.80, type ="one.sample")t.n
One-sample t test power calculation
n = 36.11196
d = 0.4794521
sig.level = 0.05
power = 0.8
alternative = two.sided
plot(t.n)
pwr: power for one mean test
pwr.t.test(n = NULL, d = NULL, sig.level = 0.05, power = NULL, type = c("two.sample", "one.sample", "paired"),alternative = c("two.sided", "less", "greater"))
d is Cohen’s d effect size:
Specify all parameters except for the power:
t.power <-pwr.t.test(d = (98.6-98.25)/0.73, sig.level =0.05, # power = 0.80, n =130,type ="one.sample")t.power
One-sample t test power calculation
n = 130
d = 0.4794521
sig.level = 0.05
power = 0.9997354
alternative = two.sided
plot(t.power)
pwr: Two-sample t-test: sample size
pwr.t.test(n = NULL, d = NULL, sig.level = 0.05, power = NULL, type = c("two.sample", "one.sample", "paired"),alternative = c("two.sided", "less", "greater"))
d is Cohen’s d effect size:
Example: Suppose the data collected for the caffeine taps study were pilot day for a larger study. Investigators want to know what sample size they would need to detect a 2 point difference between the two groups. Assume the SD in both groups is 2.3.
Specify all parameters except for the sample size:
t2.n <-pwr.t.test(d =2/2.3, sig.level =0.05, power =0.80, type ="two.sample") t2.n
Two-sample t test power calculation
n = 21.76365
d = 0.8695652
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
plot(t2.n)
pwr: Two-sample t-test: power
pwr.t.test(n = NULL, d = NULL, sig.level = 0.05, power = NULL, type = c("two.sample", "one.sample", "paired"),alternative = c("two.sided", "less", "greater"))
d is Cohen’s d effect size:
Example: Suppose the data collected for the caffeine taps study were pilot day for a larger study. Investigators want to know what sample size they would need to detect a 2 point difference between the two groups. Assume the SD in both groups is 2.3.
Specify all parameters except for the power:
t2.power <-pwr.t.test(d =2/2.3, sig.level =0.05, # power = 0.80, n =22,type ="two.sample") t2.power
Two-sample t test power calculation
n = 22
d = 0.8695652
sig.level = 0.05
power = 0.8044288
alternative = two.sided
NOTE: n is number in *each* group
plot(t2.power)
What information do we need for a power (or sample size) calculation?
There are 4 pieces of information:
Level of significance
Usually fixed to 0.05
Power
Ideally at least 0.80
Sample size
Effect size (expected change)
Given any 3 pieces of information, we can solve for the 4th.
pwr.t.test(d = (98.6-98.25)/0.73,sig.level =0.05, # power = 0.80, n=130,type ="one.sample")
One-sample t test power calculation
n = 130
d = 0.4794521
sig.level = 0.05
power = 0.9997354
alternative = two.sided
More software for power and sample size calculations: PASS
PASS is a very powerful (& expensive) software that does power and sample size calculations for many advanced statistical modeling techniques.
Even if you don’t have access to PASS, their documentation is very good and free online.