install.packages("NHANES")
HW 2: BSTA 511/611 F24
Due 10/19/24
Download the .qmd file for this assignment from https://github.com/niederhausen/BSTA_511_F24/blob/main/homework/HW_2_F24_bsta511.qmd
Graded exercises
The exercises listed below will be graded for this assignment. You are strongly encouraged to complete the entire assignment. You will receive feedback on exercises you turn in that are not being graded.
- Book exercises
- 2.24
- Non-book exercise
- R exercises
- R1: NHANES - part 1
- R2: NHANES - part 2
Directions
*
Starred exercises in the sectionBook exercises
may be completed by hand (such as on paper or using a tablet) not using Quarto.- If you complete this part of the assignment not using Quarto, you will be uploading 3 files on Sakai for this HW: qmd & html files for your R work, and a pdf with your written work.
- If you are completing the homework on paper, you can use a scanning app, such as Adobe Scan, to create a pdf of your assignment.
- If you decided to type the solutions to these exercises in this Quarto file, I highly recommend using LaTeX to format the equations.
- For instrctions on creating equations in the Visual editor, check out https://quarto.org/docs/get-started/authoring/rstudio.html#equations.
- You can see examples of LaTeX formatting in the solutions to HW 1.
- Also check out examples of LaTeX formatting for statistics created by recent biostats alumn Ariel Weingarten.
- If you have difficulty rendering the LaTeX equations, I recommend installing and running the R package
tinytex
. See this website for instructions.
- Please upload your homework to Sakai. Upload both your .qmd code file and the rendered .html file.
- Use the assignment .qmd file linked to above as a template for your own assignment.
- Please always use the following naming convention for submitting your files:
- Lastname_Firstname_HWx.qmd, such as Niederhausen_Meike_HW2.qmd
- Lastname_Firstname_HWx.html, such as Niederhausen_Meike_HW2.html
- For each question, make sure to show all of your work. This includes all code and resulting output in the html file to support your answers for exercises requiring work done in R (including any arithmetic calculations).
- For each question, include a sentence summarizing the answer for that question in the context of the research question.
It is a good idea to try rendering your document from time to time as you go along! Note that rendering automatically saves your Qmd file and rendering frequently helps you catch your errors more quickly.
Book exercises
1.24 Income and education in US counties
1.28 Mix-and-match
1.36 Associations
* 2.18 Predisposition for thrombosis
* 2.24 Breast cancer and age
* Non-book exercise
Suppose a patient has abdominal pain and their clinician suspects that they either have disease 1, disease 2, or no disease, where the probability of having abdominal pain if they have disease 1 is 0.80, the probability of having abdominal pain if they have disease 2 is 0.90, and the probability of having abdominal pain if they have no disease is 0.01. Based on the patient’s medical history, the probability of having disease 1 is 0.009, having disease 2 is 0.001, and having no disease is 0.99. What is the probability the patient has disease 2 given that they have abdominal pain?
R exercises
!!! Load all the packages you need below here.
R1: NHANES - part 1
- Below you will be using the dataset called NHANES from the
NHANES
R package. - Install and load the NHANES package using the code below.
- This loads the dataset also called NHANES that is within the NHANES package.
library(NHANES)
data("NHANES")
The National Health and Nutrition Examination Survey (NHANES) is a survey conducted annually by the US National Center for Health Statistics (NCHS). While the original data uses a survey design that oversamples certain subpopulations, the data have been reweighted to undo oversampling effects and can be treated as if it were a simple random sample from the American population.
- To view the complete list of study variables and their descriptions, access the NHANES documentation page with
?NHANES
.- You must first install the
NHANES
package to see the help files.
- You must first install the
a) Dimensions and columns
What are the dimensions and column names of the dataset?
b) Unique ID identifiers
How many unique ID identifiers are in the dataset? Compare this to the number of rows in the dataset. What is the explanation for these two different numbers?
c) Age distributions
Using numerical and graphical summaries, describe the distribution of ages among study participants.
d) Height distributions
Using numerical and graphical summaries, describe the distribution of heights among study participants.
e) Adult height age
Investigate at which age people generally reach their adult height. Is it possible to do the same for weight; why or why not?
Use whatever EDA tools you think are appropriate to answer this question.
f) Poverty variable
Calculate the median and interquartile range of the distribution of the variable Poverty
. Write a sentence explaining the median and IQR in the context of these data.
R2: NHANES - part 2
Continue using the same NHANES data as for the previous problem.
- For most of the summary statistic base R commands (such as
mean()
,sd()
,median()
, etc.), you will getNA
as the result if there are missing values. - In order for R to compute the statistic using the values in the data set, you need to tell R to remove the missing values using the
na.rm = TRUE
option within the parentheses of the command:mean(dataset$variablename, na.rm = TRUE)
.
a) Heights in inches
Use the mutate()
command explained on Slide 23 (pdf pg 22) of the Day 3 Part 3 notes to create a new column (variable) in the NHANES dataset for the heights in inches. Then find the mean and standard deviation of the heights in inches. Note that 1 centimeter is approximately 0.3937 inches.
- Column names cannot start with a number or symbol
- Names are case sensitive
- It’s easier to work with column names that don’t have spaces in them. I usually use _ or . instead of spaces.
b) College graduates
What proportion of Americans at least 25 years of age are college graduates? Hint: First create a new dataset that is restricted to Americans at least 25 years old.
c) College graduates v2
What proportion of Americans at least 25 years of age with a high school degree are college graduates?
d) Two-way table
Construct a two-way table (contingency table), with PhysActive
as the row variable and Diabetes
as the column variable. Among participants who are not physically active, what proportion have diabetes? What proportion of physically active participants have diabetes?
e) Relative Risk
In this context, relative risk is the ratio of the proportion of participants who have diabetes among those who are not physically active to the proportion of participants with diabetes among those physically active. Relative risks greater than 1 indicate that people who are not physically active seem to be at a higher risk for diabetes than physically active people. Calculate the relative risk of diabetes for the participants. From these calculations, is it possible to conclude that being physically active reduces one’s chance of becoming diabetic?