install.packages("NHANES")
HW 2: BSTA 511-611 F23
Due 10/14/23
Download the .qmd file for this assignment from https://github.com/niederhausen/BSTA_511_F23/blob/main/homework/HW_2_F23_bsta511.qmd
Directions
- The non-R exercises (see sections
Book exercises
andNon-book exercise
) may be completed not using Quarto. I especially recommend writing out by hand the chapter 2 probability questions, whether on paper or a tablet. - If you complete part of the assignment not using Quarto, you will be uploading 3 files on Sakai for HW 2: qmd & html files for your R work, and a pdf with your written work.
- If you are completing the homework on paper, you can use a scanning app, such as Adobe Scan, to create a pdf of your assignment.
- Please upload your homework to Sakai. Upload both your .qmd code file and the rendered .html file.
- For each question, make sure to include all code and resulting output in the html file to support your answers.
It is a good idea to try rendering your document from time to time as you go along! Note that rendering automatically saves your Qmd file and rendering frequently helps you catch your errors more quickly.
Book exercises
1.24 Income and education in US counties
1.28 Mix-and-match
1.36 Associations
1.38 Smoking and stenosis
See Section1.6.2 for more on how the relative risk is calculated.
2.6 Poverty and language
Part (b) asks you to create a Venn Diagram. If you are submitting this question in R, you do not need to turn this part in. If you want an R challenge though, you can use the VennDiagram
or other package to create one. See https://www.geeksforgeeks.org/how-to-create-a-venn-diagram-in-r/ for some examples.
2.8 School absences
Part (b) asks you to create a Venn Diagram. If you are submitting this question in R, you do not need to turn this part in. If you want an R challenge though, you can use the VennDiagram
or other package to create one. See https://www.geeksforgeeks.org/how-to-create-a-venn-diagram-in-r/ for some examples.
2.10 Health coverage, frequencies
2.14 Health coverage, relative frequencies
2.18 Predisposition for thrombosis
2.24 Breast cancer and age
1 Non-book exercise
Suppose a patient has abdominal pain and their clinician suspects that they either have disease 1, disease 2, or no disease, where the probability of having abdominal pain if they have disease 1 is 0.80, the probability of having abdominal pain if they have disease 2 is 0.90, and the probability of having abdominal pain if they have no disease is 0.01. Based on the patient’s medical history, the probability of having disease 1 is 0.009, having disease 2 is 0.001, and having no disease is 0.99. What is the probability the patient has disease 2 given that they have abdominal pain?
2 R exercises
2.1 Load all the packages you need below here.
3 R1: NHANES
- Below you will be using the dataset called NHANES from the
NHANES
R package. - Install and load the NHANES package using the code below.
- This loads the dataset also called NHANES that is within the NHANES package.
library(NHANES)
data("NHANES")
The National Health and Nutrition Examination Survey (NHANES) is a survey conducted annually by the US National Center for Health Statistics (NCHS). While the original data uses a survey design that oversamples certain subpopulations, the data have been reweighted to undo oversampling effects and can be treated as if it were a simple random sample from the American population.
- To view the complete list of study variables and their descriptions, access the NHANES documentation page with
?NHANES
.- You must first install the
NHANES
package to see the help files.
- You must first install the
3.1 What are the dimensions and column names of the dataset?
3.2 How many unique ID identifiers are in the dataset? Compare this to the number of rows in the dataset. What is the explanation for these two different numbers?
3.3 Using numerical and graphical summaries, describe the distribution of ages among study participants.
3.4 Using numerical and graphical summaries, describe the distribution of heights among study participants.
3.5 Investigate at which age people generally reach their adult height. Is it possible to do the same for weight; why or why not?
Use whatever EDA tools you think are appropriate to answer this question.
3.6 Calculate the median and interquartile range of the distribution of the variable Poverty
. Write a sentence explaining the median and IQR in the context of these data.
4 R2: NHANES
Continue using the same NHANES data as for the previous problem.
- For most of the summary statistic base R commands (such as
mean()
,sd()
,median()
, etc.), you will getNA
as the result if there are missing values. - In order for R to compute the statistic using the values in the data set, you need to tell R to remove the missing values using the
na.rm = TRUE
option within the parentheses of the command:mean(dataset$variablename, na.rm = TRUE)
.
4.1 Use the mutate()
command explained on Slide 23 (pdf pg 22) of the Day 3 Part 3 notes to create a new column (variable) in the NHANES dataset for the heights in inches. Then find the mean and standard deviation of the heights in inches. Note that 1 centimeter is approximately 0.3937 inches.
- Column names cannot start with a number or symbol
- Names are case sensitive
- It’s easier to work with column names that don’t have spaces in them. I usually use _ or . instead of spaces.
4.2 What proportion of Americans at least 25 years of age are college graduates?
Hint: First create a new dataset that is restricted to Americans at least 25 years old.
4.3 What proportion of Americans at least 25 years of age with a high school degree are college graduates?
4.4 Two-way table
Construct a two-way table (contingency table), with PhysActive
as the row variable and Diabetes
as the column variable. Among participants who are not physically active, what proportion have diabetes? What proportion of physically active participants have diabetes?
4.5 Relative Risk
In this context, relative risk is the ratio of the proportion of participants who have diabetes among those who are not physically active to the proportion of participants with diabetes among those physically active. Relative risks greater than 1 indicate that people who are not physically active seem to be at a higher risk for diabetes than physically active people. Calculate the relative risk of diabetes for the participants. From these calculations, is it possible to conclude that being physically active reduces one’s chance of becoming diabetic?