library(oibiostat)
data("dds.discr")
Day 3 code: Data visualization
BSTA 511/611, OHSU
Week 2
Back to research question
Case study: discrimination in developmental disability support (1.7.1)
- Previous research
- Researchers examined DDS expenditures for developmentally disabled residents by ethnicity
- Found that the mean annual expenditures on Hispanics was less than that on White non-Hispanics.
- Result: an allegation of ethnic discrimination was brought against the California DDS.
- Question: Are the data sufficient evidence of ethnic discrimination?
Load dds.discr
dataset from oibiostat
package
The textbook’s datasets are in the R package
oibiostat
Make sure the
oibiostat
package is installed before running the code below.Load the
oibiostat
package and the datasetdds.discr
the code below needs to be run every time you restart R or render a Qmd file
- After loading the dataset
dds.discr
usingdata("dds.discr")
, you will seedds.discr
in the Data list of the Environment window.
glimpse()
New: glimpse()
- Use
glimpse()
from thetidyverse
package (technically it’s from thedplyr
package) to get information about variable types. glimpse()
tends to have nicer output fortibbles
thanstr()
library(tidyverse)
glimpse(dds.discr) # from tidyverse package (dplyr)
Rows: 1,000
Columns: 6
$ id <int> 10210, 10409, 10486, 10538, 10568, 10690, 10711, 10778, 1…
$ age.cohort <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-…
$ age <int> 17, 37, 3, 19, 13, 15, 13, 17, 14, 13, 13, 14, 15, 17, 20…
$ gender <fct> Female, Male, Male, Female, Male, Female, Female, Male, F…
$ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 3873, 5021, 28…
$ ethnicity <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…
Recall previous data viz
ggplot(data = dds.discr,
aes(x = expenditures,
y = ethnicity)) +
geom_boxplot(color="darkgrey") +
labs(x = "Annual Expenditures ($)",
y = "Race and ethnicity") +
geom_jitter(
aes(color = ethnicity),
alpha = 0.3,
show.legend = FALSE,
position = position_jitter(height = 0.4))
ggplot(data = dds.discr,
aes(x = age,
y = expenditures)) +
geom_point() +
labs(x = "Age",
y = "Annual Expenditures ($)")
Visualize in more detail: ethnicity, age, and expenditures
<- dds.discr %>%
dds.discr_Hips_WhnH filter(ethnicity == "White not Hispanic" | ethnicity == "Hispanic" ) %>%
droplevels() # remove empty factor levels
ggplot(data = dds.discr_Hips_WhnH,
aes(x = expenditures,
y = age.cohort)) +
geom_boxplot(color="darkgrey") +
facet_grid(rows = "ethnicity") +
labs(x = "Annual Expenditures ($)",
y = "Race and ethnicity") +
geom_jitter(
aes(color = ethnicity),
alpha = 0.3,
show.legend = FALSE,
position = position_jitter(
height = 0.4))
Mean annual DDS expenditures by race/ethnicity
default long format
<- dds.discr_Hips_WhnH %>%
mean_expend group_by(ethnicity, age.cohort)%>%
summarize(ave = mean(expenditures))
mean_expend
# A tibble: 12 × 3
# Groups: ethnicity [2]
ethnicity age.cohort ave
<fct> <fct> <dbl>
1 Hispanic 0-5 1393.
2 Hispanic 6-12 2312.
3 Hispanic 13-17 3955.
4 Hispanic 18-21 9960.
5 Hispanic 22-50 40924.
6 Hispanic 51+ 55585
7 White not Hispanic 0-5 1367.
8 White not Hispanic 6-12 2052.
9 White not Hispanic 13-17 3904.
10 White not Hispanic 18-21 10133.
11 White not Hispanic 22-50 40188.
12 White not Hispanic 51+ 52670.
wide format
<- mean_expend %>%
mean_expend_wide pivot_wider(names_from = ethnicity,
values_from = ave)
mean_expend_wide
# A tibble: 6 × 3
age.cohort Hispanic `White not Hispanic`
<fct> <dbl> <dbl>
1 0-5 1393. 1367.
2 6-12 2312. 2052.
3 13-17 3955. 3904.
4 18-21 9960. 10133.
5 22-50 40924. 40188.
6 51+ 55585 52670.
Differences in mean annual DDS expenditures by age cohort and race/ethnicity
<- mean_expend_wide %>%
mean_expend_wide mutate(diff_mean = `White not Hispanic` - Hispanic)
mean_expend_wide
# A tibble: 6 × 4
age.cohort Hispanic `White not Hispanic` diff_mean
<fct> <dbl> <dbl> <dbl>
1 0-5 1393. 1367. -26.3
2 6-12 2312. 2052. -260.
3 13-17 3955. 3904. -50.9
4 18-21 9960. 10133. 173.
5 22-50 40924. 40188. -736.
6 51+ 55585 52670. -2915.
Question: Are the data sufficient evidence of ethnic discrimination in DDS expenditures when comparing Hispanics with White non-Hispanics?
Summary of data wrangling so far
- The pipe
%>%
to string together commands in sequence mutate()
to add a new variable to a datasetselect()
to select columns (or deselect columns with -variable)filter()
to select specific rowspivot_wider()
to reshape a dataset from a long to a wide format
Summarizing data
tabyl()
fromjanitor
package to make frequency tables of categorical variablessummarize()
to get summary statistics of variablesgroup_by()
to group data by categorical variables before finding summaries