HW 7: BSTA 511/611 F24

Author

Your name here - update this!!!!

Published

December 7, 2024

Due 12/7/24

Download the .qmd file for this assignment from https://github.com/niederhausen/BSTA_511_F24/blob/main/homework/HW_7_F24_bsta511.qmd

Graded exercises

The exercises listed below will be graded for this assignment. You are strongly encouraged to complete the entire assignment. You will receive feedback on exercises you turn in that are not being graded.

Book exercises
- 6.10, 6.28, 6.32
R exercises
- R1: Palmer Penguins SLR
- R2: Life expectancy vs. CO2 emissions parts (a)-(j)
Nonparametric exercises
- NPT 1, NPT 2, NPT 3

Directions

Important

Complete all exercises in this assignment using Quarto.
I highly recommend using LaTeX to format equations.
- See the .qmd files from class notes for LaTeX code to make it easier to show your work in computations.
- For instructions on creating equations in the Visual editor, check out https://quarto.org/docs/get-started/authoring/rstudio.html#equations. html
- Also check out examples of LaTeX formatting for statistics created by recent biostats alum Ariel Weingarten.
- If you have difficulty rendering the LaTeX equations, I recommend installing and running the R package tinytex. See this website for instructions.

Please upload your homework to Sakai. Upload both your .qmd code file and the rendered .html file.
- Use the assignment .qmd file linked to above as a template for your own assignment.
Please always use the following naming convention for submitting your files:
- Lastname_Firstname_HWx.qmd, such as Niederhausen_Meike_HW2.qmd
- Lastname_Firstname_HWx.html, such as Niederhausen_Meike_HW2.html
For each question, make sure to show all of your work.
- This includes all code and resulting output in the html file to support your answers for exercises requiring work done in R (including any arithmetic calculations).
- For non-calculation questions, this includes an explanation of your answer (why did you choose your answer?).
For each question, include a sentence summarizing the answer for that question in the context of the research question.

Tip

It is a good idea to try rendering your document from time to time as you go along! Note that rendering automatically saves your Qmd file and rendering frequently helps you catch your errors more quickly.

Book exercises

6.2 Identify relationships, Part II

6.6 Over-under, Part II

6.10 Guppies, Part I

6.12 Trends in the residuals

6.20 Guppies, Part IV

6.26 (a, e) Helmets and lunches

Skip parts (b)-(d). To complete (e), use that the slope from part (b) is −0.537, and the intercept is 55.34.

Note: if you have time, it would be good practice to calculate the regression line as well. This was covered on the previous assignment.

6.28 Guppies, Part V

6.32 (a, b) Guppies, Part VI

Skip part (c).

R exercises

Load packages

Load all the packages you need in the first code chunk of the file that starts with #| label: "setup".

R1: Palmer Penguins SLR

Important

Below I frequently use the terminology variable1 vs. variable2. When we write this, the first variable is \(y\) (vertical axis) and the second is \(x\) (horizontal axis). Thus it’s always \(y\) vs. \(x\) (NOT \(x\) vs. \(y\)).

(a) Scatterplots

For each of the following pairs of variables, make a scatterplot showing the best fit line and describe the relationship between the variables.
In particular address
- whether the association is linear,
- how strong it is (based purely on the plot), and
- what direction (positive, negative, or neither).

body mass vs. flipper length
bill depth vs. flipper length
bill depth vs. bill length

(b) Correlations

For each of the following pairs of variables, find the correlation coefficient \(r\).

body mass vs. flipper length
bill depth vs. flipper length
bill depth vs. bill length

(c) Compare associations

Which pair of variables has the strongest association? Which has the weakest? Explain how you determined this.

(d) Body mass vs. flipper length SLR

Run the simple linear regression model for body mass vs. flipper length, and display the regression table output.

(e) Regression equation

Write out the regression equation for this model, using the variable names instead of the generic \(x\) and \(y\), and inserting the regression coefficient values.

(f) \(b_1\) calculation

Very that the formula \(b_1 = r \cdot \frac{s_y}{s_x}\) holds for this example using the values of the statistics.

(g) Interpret intercept

Write a sentence interpreting the intercept for this example. Is it meaningful in this example?

(h) Interpret slope

Write a sentence interpreting the slope for this example.

(i) Prediction

What is the expected body mass of a penguin with flipper length 210 mm based on the model?

R2: Life expectancy vs. CO2 emissions

Use the dataset Gapminder_2011_LifeExp_CO2.csv

Data were downloaded from https://www.gapminder.org/data/.

Life expectancy = the average number of years a newborn child would live if current mortality patterns were to stay the same. Source: https://www.gminder.org/data/documentation/gd004/
CO2 emissions (tons per person) = Carbon dioxide emissions from the burning of fossil fuels (metric tons of CO2 per person). Source: https://cdiac.ess-dive.lbl.gov/
2011 is the most recent year with the most complete data

(a) Load data

Load the dataset Gapminder_2011_LifeExp_CO2.csv and do a quick inspection of it. What are the dimensions? Variable names?

(b) Linear association?

Make a scatterplot of life expectancy vs. CO2 emissions per person showing the best fit line and describe the relationship between the variables. In addition comment on whether the relationship looks linear or not.

(c) SLR

Run the simple linear regression of life expectancy vs.CO2 emissions per person, and write out the corresponding regression equation.

(d) Prediction

Using the regression equation, what is the expected life expectancy for a country with \(CO_2\) emissions 20 metric tons per person?

(e) Independent data points?

Explain whether you think the data point are independent of each other or not.

(f) Normality of residuals?

Make a histogram, density plot, and boxplot of the model’s residuals. What is the distribution shape of the residuals? What shape do we want them to have?

(g) QQ plot

Make a QQ plot of the model’s residuals.Explain whether or not the distribution of the residuals deviates from normality and how you made that conclusion based on the QQ plot.

(h) Random Normal QQ plots

Make 5 QQ plots with points randomly generated from a normal distribution, where the number of points matches the sample size used in the linear model.

(i) QQ plot comparison

Compare the QQ plot of the model’s residuals to the randomly generated QQ plots. Is the QQ plot of the residuals similar to the randomly generated plots or different? Based on the these QQ plots, do you think it’s possible that the residuals could be normally distributed?

(j) Equality of variance of the residuals?

Make a residual plot. Describe what the plot looks like and whether there are any patterns in the residuals, and whether the equality of variance assumption the residuals seems to be satisfied or not.

(k) Transformation: log(x)

Add a new variable to the dataset for the natural logarithm (log()) of the CO2 emissions per person values.

(l) Linear association (with transformation)?

Make a scatterplot of life expectancy vs. log of CO2 emissions per person showing the best fit line and describe the relationship between the variables. In addition comment on whether the relationship looks linear or not.

(m) SLR (with transformation)

Run the simple linear regression of life expectancy vs.CO2 emissions per person, and write out the corresponding regression equation.

(n) Prediction (with transformation)

Using the regression equation, what is the expected life expectancy for a country with \(CO_2\) emissions 20 metric tons per person?

(o) Compare predictions from without and with transformation

Compare the predicted values from the models with and without the transformation. Which is bigger and why? Explain in terms of visually comparing the respective regression lines on the scatterplots.

(p) Normality of residuals (with transformation)?

Make a histogram, density plot, and boxplot of the model’s residuals. What is the distribution shape of the residuals? What shape do we want them to have?

(q) QQ plot (with transformation)

Make a QQ plot of the model’s residuals.Explain whether or not the distribution of the residuals deviates from normality and how you made that conclusion based on the QQ plot.

(r) Random Normal QQ plots (with transformation)

Compare the QQ plot of the model’s residuals to the randomly generated QQ plots (use the ones you generated above). Is the QQ plot of the residuals similar to the randomly generated plots or different? Based on the these QQ plots, do you think it’s possible that the residuals could be normally distributed?

(s) Equality of variance of the residuals (with transformation)?

(t) Comparison of models without and with transformation

Which of the models do you think has a better fit? Make sure your explanation comments on each of the LINE assumptions, and also compare the \(R^2\) values from the models.

Nonparametric Tests

NPT 1: Sign test

Vegetarian diet and cholesterol levels

When covering paired t-tests on Day 10 Part 2, the class notes used the example of testing whether a vegetarian diet changed cholesterol levels. The data are in the file chol213.csv at https://niederhausen.github.io/BSTA_511_F24/data/chol213.csv. In this exercise we will use non-parametric tests to test for a change and compare the results to the paired t-test.

(a) Hypotheses

What are the hypotheses for the sign test (2-sided) in the context of the problem?

(b) \(D^+\) and \(D^-\)

Calculate \(D^+\) and \(D^-\), the number of positive and negative differences when the differences are calculated as After - Before.

(c) Probability distribution

What probability distribution can be used to model the number of positive differences? Make sure to specify its parameters.

(d) Exact probability

Find the exact probability that there were at most 3 positive differences.

(e) Sign test in R

Run the sign test in R. What is the sign test p-value and how does it compare to the p-value of the paired t-test (check the class notes for this)?

(f) S

The sign test output includes the value for S. What is S and what does it measure?

(g) p-value

Does the probability that there were at most 3 positive differences match the p-value from the R output? Why or why not?

(h) Normal approximation

Would it be appropriate to use a normal approximation to calculate the p-value for this test? Why or why not?

NPT 2: (Wilcoxon) Signed-rank test

Vegetarian diet and cholesterol levels

(a) Hypotheses

What are the hypotheses for the (Wilcoxon) Signed-rank test (2-sided) in the context of the problem?

(b) Signed ranks

Find the signed ranks. Make sure to account for ties when doing so.

Hint: See the extra code from the code file to help with this: https://niederhausen.github.io/BSTA_511_F24/slides_code/Day17_bsta511_code.html#new-calculate-rankss-using-rank.

(c) \(T^+\)

Calculate the sum of the positive ranks ( \(T^+\) )

(d) Exact p-value

Can an exact p-value for the (Wilcoxon) Signed-rank test be calculated? Why or why not?

(e) ormal approximation

Is it appropriate to use the normal approximation method in this case?

(f) Test in R

Run the (Wilcoxon) Signed-rank test in R. What is the p-value and how does it compare to the p-value of the sign test and the paired t-test (check the class notes for this)?

(g) Condition

There’s one more condition that should be satisfied to use the (Wilcoxon) Signed-rank test that has not been asked about yet in these questions. What is it and do you think it’s satisfied?

NPT 3: Wilcoxon rank-sum test

Does caffeine increase finger taps/min?

When covering 2-sample t-tests on Day 11, the class notes used the example of testing whether caffeine increases finger taps/min. The data are in the file CaffeineTaps.csv at https://niederhausen.github.io/BSTA_511_F24/data/CaffeineTaps.csv. In this exercise we will use a non-parametric test and compare the results to the paired t-test.

(a) Condition

What condition needs to be satisfied to apply the Wilcoxon rank-sum test and is it satisfied for these data?

Answer the following questions using the Wilcoxon rank-sum test whether you think the condition has been satisfied or not.

(b) Why Wilcoxon rank-sum test?

How would we know to use the Wilcoxon rank-sum test instead of the sign test or (Wilcoxon) Signed-rank test?

(c) Hypotheses

What are the hypotheses for the Wilcoxon rank-sum test (1-sided) in the context of the problem?

(d) Exact test in R

Run the exact Wilcoxon rank-sum test in R. What warning(s) does it give you?

(e) Normal approximation test in R

Run the Wilcoxon rank-sum test in R with the normal approximation. Comment on whether it is appropriate to use the normal approximation or not in this case.

(f) p-value

What is the p-value and how does it compare to the p-value of the 2-sample t-test (check the class notes for this)?

(g) Conclusion

Write a conclusion to the test in the context of the problem.