BSTA 511/611 Fall 2024 Day 5, OHSU
2024-10-16
dds.discr
dataset from oibiostat
packageThe textbook’s datasets are in the R package oibiostat
Make sure the oibiostat
package is installed before running the code below.
Load the oibiostat
package and the dataset dds.discr
the code below needs to be run every time you restart R or render a Qmd file
dds.discr
using data("dds.discr")
, you will see dds.discr
in the Data list of the Environment window.glimpse()
New: glimpse()
glimpse()
from the tidyverse
package (technically it’s from the dplyr
package) to get information about variable types.glimpse()
tends to have nicer output for tibbles
than str()
Rows: 1,000
Columns: 6
$ id <int> 10210, 10409, 10486, 10538, 10568, 10690, 10711, 10778, 1…
$ age.cohort <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-…
$ age <int> 17, 37, 3, 19, 13, 15, 13, 17, 14, 13, 13, 14, 15, 17, 20…
$ gender <fct> Female, Male, Male, Female, Male, Female, Female, Male, F…
$ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 3873, 5021, 28…
$ ethnicity <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…
ethnicity, age, and expenditures (code on next slide)
Plot on previous slide
dds.discr_Hips_WhnH <- dds.discr %>%
filter(ethnicity == "White not Hispanic" | ethnicity == "Hispanic" ) %>%
droplevels() # remove empty factor levels
ggplot(data = dds.discr_Hips_WhnH,
aes(x = expenditures,
y = age.cohort)) +
geom_boxplot(color="darkgrey") +
facet_grid(rows = "ethnicity") +
labs(x = "Annual Expenditures ($)",
y = "Race and ethnicity") +
geom_jitter(
aes(color = ethnicity),
alpha = 0.3,
show.legend = FALSE,
position = position_jitter(
height = 0.4))
# A tibble: 12 × 3
# Groups: ethnicity [2]
ethnicity age.cohort ave
<fct> <fct> <dbl>
1 Hispanic 0-5 1393.
2 Hispanic 6-12 2312.
3 Hispanic 13-17 3955.
4 Hispanic 18-21 9960.
5 Hispanic 22-50 40924.
6 Hispanic 51+ 55585
7 White not Hispanic 0-5 1367.
8 White not Hispanic 6-12 2052.
9 White not Hispanic 13-17 3904.
10 White not Hispanic 18-21 10133.
11 White not Hispanic 22-50 40188.
12 White not Hispanic 51+ 52670.
mean_expend_wide <- mean_expend_wide %>%
mutate(diff_mean = `White not Hispanic` - Hispanic)
mean_expend_wide
# A tibble: 6 × 4
age.cohort Hispanic `White not Hispanic` diff_mean
<fct> <dbl> <dbl> <dbl>
1 0-5 1393. 1367. -26.3
2 6-12 2312. 2052. -260.
3 13-17 3955. 3904. -50.9
4 18-21 9960. 10133. 173.
5 22-50 40924. 40188. -736.
6 51+ 55585 52670. -2915.
Question: Are the data sufficient evidence of ethnic discrimination in DDS expenditures when comparing Hispanics with White non-Hispanics?
This case study is an example of confounding known as Simpson’s paradox
Simpson’s paradox happens when an association observed in several groups disappears or reverses direction when the groups are combined.
In other words, an association between two variables \(X\) and \(Y\) may disappear or reverse direction once data are partitioned into subpopulations based on a third variable \(Z\) (i.e., a confounding variable).
tidyverse
functions
tidyverse
is a suite of packages that implement tidy
methods for data importing, cleaning, wrangling, and visualizingtidyverse
packages by running the code library(tidyverse)
tidyverse
!%>%
%>%
to string together commands in sequencemutate()
to add a new variable to a datasetselect()
to select columns (or deselect columns with -variable)filter()
to select specific rowspivot_wider()
to reshape a dataset from a long to a wide formatSummarizing data
tabyl()
from janitor
package to make frequency tables of categorical variablessummarize()
to get summary statistics of variablesgroup_by()
to group data by categorical variables before finding summariestidyverse
?Core packages
These automatically load when loading the tidyverse package
List of all packages:
[1] "broom" "conflicted" "cli" "dbplyr"
[5] "dplyr" "dtplyr" "forcats" "ggplot2"
[9] "googledrive" "googlesheets4" "haven" "hms"
[13] "httr" "jsonlite" "lubridate" "magrittr"
[17] "modelr" "pillar" "purrr" "ragg"
[21] "readr" "readxl" "reprex" "rlang"
[25] "rstudioapi" "rvest" "stringr" "tibble"
[29] "tidyr" "xml2" "tidyverse"