HW 9: BSTA 511-611 F23

Author

Your name here - update this!!!!

Published

December 4, 2023

Due Monday, 12/4/23

Download the .qmd file for this assignment from https://github.com/niederhausen/BSTA_511_F23/blob/main/homework/HW_9_F23_bsta511.qmd

Directions

Please upload your homework to Sakai. Upload both your .qmd code file and the rendered .html file.
For each question, make sure to include all code and resulting output in the html file to support your answers.

R & LaTeX code

See the .qmd files with the code from class notes for LaTeX and R code.
The LaTeX code will make it easier to show your work in computations.

Tip

It is a good idea to try rendering your document from time to time as you go along! Note that rendering automatically saves your Qmd file and rendering frequently helps you catch your errors more quickly.

Book exercises

6.12 Trends in the residuals

6.20 Guppies, Part IV

6.26 (a, e) Helmets and lunches

Skip parts (b)-(d). To complete (e), use that the slope from part (b) is −0.537, and the intercept is 55.34.

Note: if you have time, it would be good practice to calculate the regression line as well. This was covered on the previous assignment.

6.28 Guppies, Part V

6.32 (a, b) Guppies, Part VI

Skip part (c).

1 R exercises

1.1 Load all the packages you need below here.

1.2 R1: Life expectancy vs. CO2 emissions

Use the dataset Gapminder_2011_LifeExp_CO2.csv

Data were downloaded from https://www.gapminder.org/data/.

Life expectancy = the average number of years a newborn child would live if current mortality patterns were to stay the same. Source: https://www.gminder.org/data/documentation/gd004/
CO2 emissions (tons per person) = Carbon dioxide emissions from the burning of fossil fuels (metric tons of CO2 per person). Source: https://cdiac.ess-dive.lbl.gov/
2011 is the most recent year with the most complete data

1.2.1 Load data

Load the dataset Gapminder_2011_LifeExp_CO2.csv and do a quick inspection of it. What are the dimensions? Variable names?

1.2.2 Linear association?

Make a scatterplot of life expectancy vs. CO2 emissions per person showing the best fit line and describe the relationship between the variables. In addition comment on whether the relationship looks linear or not.

1.2.3 SLR

Run the simple linear regression of life expectancy vs.CO2 emissions per person, and write out the corresponding regression equation.

1.2.4 Prediction

Using the regression equation, what is the expected life expectancy for a country with \(CO_2\) emissions 20 metric tons per person?

1.2.5 Independent data points?

Explain whether you think the data point are independent of each other or not.

1.2.6 Normality of residuals?

Make a histogram, density plot, and boxplot of the model’s residuals. What is the distribution shape of the residuals? What shape do we want them to have?

1.2.7 QQ plot

Make a QQ plot of the model’s residuals.Explain whether or not the distribution of the residuals deviates from normality and how you made that conclusion based on the QQ plot.

1.2.8 Random Normal QQ plots

Make 5 QQ plots with points randomly generated from a normal distribution, where the number of points matches the sample size used in the linear model.

1.2.9 QQ plot comparison

Compare the QQ plot of the model’s residuals to the randomly generated QQ plots. Is the QQ plot of the residuals similar to the randomly generated plots or different? Based on the these QQ plots, do you think it’s possible that the residuals could be normally distributed?

1.2.10 Equality of variance of the residuals?

Make a residual plot. Describe what the plot looks like and whether there are any patterns in the residuals, and whether the equality of variance assumption the residuals seems to be satisfied or not.

1.2.11 Transformation: log(x)

Add a new variable to the dataset for the natural logarithm (log()) of the CO2 emissions per person values.

1.2.12 Linear association (with transformation)?

Make a scatterplot of life expectancy vs. log of CO2 emissions per person showing the best fit line and describe the relationship between the variables. In addition comment on whether the relationship looks linear or not.

1.2.13 SLR (with transformation)

Run the simple linear regression of life expectancy vs.CO2 emissions per person, and write out the corresponding regression equation.

1.2.14 Prediction (with transformation)

Using the regression equation, what is the expected life expectancy for a country with \(CO_2\) emissions 20 metric tons per person?

1.2.15 Compare predictions from without and with transformation

Compare the predicted values from the models with and without the transformation. Which is bigger and why? Explain in terms of visually comparing the respective regression lines on the scatterplots.

1.2.16 Normality of residuals (with transformation)?

Make a histogram, density plot, and boxplot of the model’s residuals. What is the distribution shape of the residuals? What shape do we want them to have?

1.2.17 QQ plot (with transformation)

Make a QQ plot of the model’s residuals.Explain whether or not the distribution of the residuals deviates from normality and how you made that conclusion based on the QQ plot.

1.2.18 Random Normal QQ plots (with transformation)

Compare the QQ plot of the model’s residuals to the randomly generated QQ plots (use the ones you generated above). Is the QQ plot of the residuals similar to the randomly generated plots or different? Based on the these QQ plots, do you think it’s possible that the residuals could be normally distributed?

1.2.19 Equality of variance of the residuals (with transformation)?

1.2.20 Comparison of models without and with transformation

Which of the models do you think has a better fit? Make sure your explanation comments on each of the LINE assumptions, and also compare the \(R^2\) values from the models.

2 Nonparametric-Tests

2.1 NPT 1: Sign test

Vegetarian diet and cholesterol levels

When covering paired t-tests on Day 10 Part 2, the class notes used the example of testing whether a vegetarian diet changed cholesterol levels. The data are in the file chol213.csv at https://niederhausen.github.io/BSTA_511_F23/data/chol213.csv. In this exercise we will use non-parametric tests to test for a change and compare the results to the paired t-test.

2.1.1 Hypotheses

What are the hypotheses for the (Wilcoxon) Signed-rank test (2-sided) in the context of the problem?

2.1.2 \(D^+\) and \(D^-\)

Calculate \(D^+\) and \(D^-\), the number of positive and negative differences when the differences are calculated as After - Before.

2.1.3 Probability distribution

What probability distribution can be used to model the number of positive differences? Make sure to specify its parameters.

2.1.4 Exact probability

Find the exact probability that there were at most 3 positive differences.

2.1.5 Sign test in R

Run the sign test in R. What is the sign test p-value and how does it compare to the p-value of the paired t-test (check the class notes for this)?

2.1.6 S

The sign test output includes the value for S. What is S and what does it measure?

2.1.7 p-value

Does the probability that there were at most 3 positive differences match the p-value from the R output? Why or why not?

2.1.8 Normal approximation

Would it be appropriate to use a normal approximation to calculate the p-value for this test? Why or why not?

2.2 NPT 2: (Wilcoxon) Signed-rank test

Vegetarian diet and cholesterol levels

2.2.1 Hypotheses

What are the hypotheses for the sign test (2-sided) in the context of the problem?

2.2.2 Signed ranks

Find the signed ranks. Make sure to account for ties when doing so.

2.2.3 \(T^+\)

Calculate the sum of the positive ranks ( \(T^+\) )

2.2.4 Exact p-value

Can an exact p-value for the (Wilcoxon) Signed-rank test be calculated? Why or why not?

2.2.5 Normal approximation

Is it appropriate to use the normal approximation method in this case?

2.2.6 Test in R

Run the (Wilcoxon) Signed-rank test in R. What is the p-value and how does it compare to the p-value of the sign test and the paired t-test (check the class notes for this)?

2.2.7 Condition

There’s one more condition that should be satisfied to use the (Wilcoxon) Signed-rank test that has not been asked about yet in these questions. What is it and do you think it’s satisfied?

2.3 NPT 3: Wilcoxon rank-sum test

Does caffeine increase finger taps/min?

When covering 2-sample t-tests on Day 11, the class notes used the example of testing whether caffeine increases finger taps/min. The data are in the file CaffeineTaps.csv at https://niederhausen.github.io/BSTA_511_F23/data/CaffeineTaps.csv. In this exercise we will use a non-parametric test and compare the results to the paired t-test.

2.3.1 Condition

What condition needs to be satisfied to apply the Wilcoxon rank-sum test and is it satisfied for these data?

Answer the following questions using the Wilcoxon rank-sum test whether you think the condition has been satisfied or not.

2.3.2 Why Wilcoxon rank-sum test?

How would we know to use the Wilcoxon rank-sum test instead of the sign test or (Wilcoxon) Signed-rank test?

2.3.3 Hypotheses

What are the hypotheses for the Wilcoxon rank-sum test (1-sided) in the context of the problem?

2.3.4 Exact test in R

Run the exact Wilcoxon rank-sum test in R. What warning(s) does it give you?

2.3.5 Normal approximation test in R

Run the Wilcoxon rank-sum test in R with the normal approximation. Comment on whether it is appropriate to use the normal approximation or not in this case.

2.3.6 p-value

What is the p-value and how does it compare to the p-value of the 2-sample t-test (check the class notes for this)?

2.3.7 Conclusion

Write a conclusion to the test in the context of the problem.