Day 3 code: Data visualization

BSTA 511/611 Fall 2024 Day 5, OHSU

Week 3

Author

Meike Niederhausen, PhD

Published

October 16, 2024

Load packages

Packages need to be loaded every time you restart R or render an Qmd file

# run these every time you open Rstudio
library(tidyverse)    
library(oibiostat)
library(ggridges)
library(janitor)
library(rstatix)
library(knitr)
library(gtsummary) # NEW!!

You can check whether a package has been loaded or not
- by looking at the Packages tab and
- seeing whether it has been checked off or not

Case study: discrimination in developmental disability support (1.7.1)

In the US, individuals with developmental disabilities typically receive services and support from state governments.
- California allocates funds to developmentally disabled residents through the Department of Developmental Services (DDS)
- Recipients of DDS funds are referred to as “consumers.”
Dataset dds.discr
- sample of 1,000 DDS consumers (out of a total of ~ 250,000)
- data include age, gender, race/ethnicity, and annual DDS financial support per consumer
Previous research
- Researchers examined expenditures on consumers by ethnicity
- Found that the mean annual expenditure on Hispanics was less than that on White non-Hispanics.
Result: an allegation of ethnic discrimination was brought against the California DDS.
Question: Are the data sufficient evidence of ethnic discrimination?
See Section 1.7.1 for more details

Load `dds.discr` dataset from `oibiostat` package

The textbook’s datasets are in the R package oibiostat
Make sure the oibiostat package is installed before running the code below.
Load the oibiostat package and the dataset dds.discr

the code below needs to be run every time you restart R or render a Qmd file

library(oibiostat)
data("dds.discr")

After loading the dataset dds.discr using data("dds.discr"), you will see dds.discr in the Data list of the Environment window.

Getting to know the dataset

dim(dds.discr)

[1] 1000    6

names(dds.discr)

[1] "id"           "age.cohort"   "age"          "gender"       "expenditures"
[6] "ethnicity"

length(unique(dds.discr$id)) # How many unique id's are there?

[1] 1000

`str()`

We previously used the base R structure command str() to get information about variable types in a dataset.
Note this dataset is a tibble instead of a data.frame

str(dds.discr)      # base R

tibble [1,000 × 6] (S3: tbl_df/tbl/data.frame)
 $ id          : int [1:1000] 10210 10409 10486 10538 10568 10690 10711 10778 10820 10823 ...
 $ age.cohort  : Factor w/ 6 levels "0-5","6-12","13-17",..: 3 5 1 4 3 3 3 3 3 3 ...
 $ age         : int [1:1000] 17 37 3 19 13 15 13 17 14 13 ...
 $ gender      : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 1 1 2 1 2 ...
 $ expenditures: int [1:1000] 2113 41924 1454 6400 4412 4566 3915 3873 5021 2887 ...
 $ ethnicity   : Factor w/ 8 levels "American Indian",..: 8 8 4 4 8 4 8 3 8 4 ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_integer(),
  ..   `Age Cohort` = col_character(),
  ..   Age = col_integer(),
  ..   Gender = col_character(),
  ..   Expenditures = col_integer(),
  ..   Ethnicity = col_character()
  .. )

`glimpse()`

New: glimpse()

Use glimpse() from the tidyverse package (technically it’s from the dplyr package) to get information about variable types.
glimpse() tends to have nicer output for tibbles than str()

library(tidyverse)
glimpse(dds.discr)  # from tidyverse package (dplyr)

Rows: 1,000
Columns: 6
$ id           <int> 10210, 10409, 10486, 10538, 10568, 10690, 10711, 10778, 1…
$ age.cohort   <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-…
$ age          <int> 17, 37, 3, 19, 13, 15, 13, 17, 14, 13, 13, 14, 15, 17, 20…
$ gender       <fct> Female, Male, Male, Female, Male, Female, Female, Male, F…
$ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 3873, 5021, 28…
$ ethnicity    <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…

`summary()`

We previously used the base R structure command summary() to get summary information about variables

summary(dds.discr)      # base R

       id        age.cohort       age          gender     expenditures  
 Min.   :10210   0-5  : 82   Min.   : 0.0   Female:503   Min.   :  222  
 1st Qu.:31809   6-12 :175   1st Qu.:12.0   Male  :497   1st Qu.: 2899  
 Median :55384   13-17:212   Median :18.0                Median : 7026  
 Mean   :54663   18-21:199   Mean   :22.8                Mean   :18066  
 3rd Qu.:76135   22-50:226   3rd Qu.:26.0                3rd Qu.:37713  
 Max.   :99898   51+  :106   Max.   :95.0                Max.   :75098  
                                                                        
              ethnicity  
 White not Hispanic:401  
 Hispanic          :376  
 Asian             :129  
 Black             : 59  
 Multi Race        : 26  
 American Indian   :  4  
 (Other)           :  5

Getting to know the dataset: `tbl_summary()`

New: Use tbl_summary() from the gtsummary package to get summary information

# library(gtsummary)
tbl_summary(dds.discr)  # from package gtsummary

Characteristic	N = 1,000 ¹
id	55,385 (31,759, 76,205)
age.cohort
0-5	82 (8.2%)
6-12	175 (18%)
13-17	212 (21%)
18-21	199 (20%)
22-50	226 (23%)
51+	106 (11%)
age	18 (12, 26)
gender
Female	503 (50%)
Male	497 (50%)
expenditures	7,026 (2,898, 37,718)
ethnicity
American Indian	4 (0.4%)
Asian	129 (13%)
Black	59 (5.9%)
Hispanic	376 (38%)
Multi Race	26 (2.6%)
Native Hawaiian	3 (0.3%)
Other	2 (0.2%)
White not Hispanic	401 (40%)
¹ Median (Q1, Q3); n (%)

Visualize numerical variables with `ggplot`

Histograms

What is being measured on the vertical axes?

ggplot(data = dds.discr, aes(x = age)) +
  geom_histogram()

ggplot(data = dds.discr, aes(x = expenditures)) +
  geom_histogram()

Histograms showing proportions

ggplot(data = dds.discr, aes(x = age)) +
  geom_histogram(aes(y = stat(density)))

Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

ggplot(data = dds.discr, aes(x = age)) +
  geom_histogram(aes(y = stat(density))) +  
  scale_y_continuous(labels = scales::percent_format())

Density plots

What is being measured on the vertical axes?

ggplot(data = dds.discr, aes(x = age)) +
  geom_density()

ggplot(data = dds.discr, aes(x = age)) +
  geom_histogram()

Dot plots (better for smaller samples)

What is being measured on the vertical axes?

ggplot(data = dds.discr, aes(x = age)) +
  geom_dotplot(binwidth =1)

ggplot(data = dds.discr, aes(x = age)) +
  geom_histogram(binwidth =1)

Boxplots

ggplot(data = dds.discr, aes(x = age)) + 
  geom_boxplot()

ggplot(data = dds.discr, aes(y = age)) + 
  geom_boxplot()

Visualizing relationships between numerical and categorical variables (1.6.3)

Side-by-side boxplots

ggplot(data = dds.discr, 
       aes(x = expenditures, y = ethnicity)) + 
  geom_boxplot() + 
  labs(x = "Annual Expenditures ($)", 
       y = "Race and ethnicity")

Can you determine the following using boxplots?

distribution shape
sample size

Side-by-side boxplots with data points

ggplot(data = dds.discr, 
       aes(x = expenditures, y = ethnicity)) + 
  geom_boxplot(color="darkgrey") + 
  labs(x = "Annual Expenditures ($)",
       y = "Race and ethnicity") +
  geom_jitter(aes(color = ethnicity),
              alpha = 0.3,
              show.legend = FALSE,
              position = position_jitter(height = 0.4)
              )

Can you determine the following using boxplots?

distribution shape
sample size

Density plots by group

ggplot(data = dds.discr, 
       aes(x = expenditures,
           color = ethnicity)) + 
  geom_density() + 
  labs(x = "Annual Expenditures ($)")

Ridgeline plot

# library(ggridges)
ggplot(data = dds.discr,
       aes(x = expenditures, y = ethnicity,      
           fill = ethnicity)) + 
  geom_density_ridges(
    alpha = 0.3,      
    show.legend = FALSE) +
  labs(x = "Annual Expenditures ($)",
       y = "Race and ethnicity",
       title = "Expenditures by race and ethnicity")

Transforming data (1.4.5)

We sometimes apply a transformation to highly skewed data to make it more symmetric
Log transformations are often used for skewed right data

x = expenditures

ggplot(data = dds.discr, 
       aes(x = expenditures)) +  
  geom_density()

x = log(expenditures)

ggplot(data = dds.discr, 
       aes(x = log(expenditures))) +  
  geom_density()

Relationships between two numerical variables (1.6.1)

Scatterplots

ggplot(data = dds.discr, 
       aes(x = age, y = expenditures)) + 
  geom_point() +       
  labs(x = "Age",
       y = "Annual Expenditures ($)")

Response vs. explanatory variables (Section 1.2.3) - A response variable measures the outcome of interest in a study - A study will typically examine whether the values of a response variable differ as values of an explanatory variable change

Describe the association between the variables

(Pearson) Correlation coefficient \(r\)

The (Peasron) correlation coefficient of variables \(x\) and \(y\) can be computed using the formula \[r = \frac{1}{n-1}\sum_{i=1}^{n}\Big(\frac{x_i - \bar{x}}{s_x}\Big)\Big(\frac{y_i - \bar{y}}{s_y}\Big)\] where * \((x_1,y_1),(x_2,y_2),...,(x_n,y_n)\) are the \(n\) paired values of the variables \(x\) and \(y\) * \(s_x\) and \(s_y\) are the sample standard deviations of the variables \(x\) and \(y\), respectively

cor(dds.discr$age, dds.discr$expenditures)

[1] 0.8432422

Scatterplots with color-coded dots

ggplot(data = dds.discr, 
       aes(x = age, y = expenditures,
           color = ethnicity)) +   
  geom_point(alpha = .5) +       
  labs(x = "Age",
       y = "Annual Expenditures ($)")

Categorical data (1.5) and Relationships between two categorical variables (1.6.2)

Barplots

Counts

ggplot(data = dds.discr, 
       aes(x = ethnicity)) +
  geom_bar()

percentages

ggplot(data = dds.discr, 
       aes(x = ethnicity)) +
  geom_bar(aes(y = stat(prop), group = 1)) + 
  scale_y_continuous(labels = scales::percent_format())

Barplots with 2 variables: segmented bar plots

ggplot(data = dds.discr, 
       aes(x = ethnicity,
           fill = age.cohort)) + 
  geom_bar()

ggplot(data = dds.discr, 
       aes(x = ethnicity,
           fill = age.cohort)) + 
  geom_bar(position = "fill")

Summarizing categorical data and some data wrangling

Frequency tables: `count()`

count is from the dplyr package
the output is a long tibble, and not a “nice” table

dds.discr %>% count(ethnicity)

# A tibble: 8 × 2
  ethnicity              n
  <fct>              <int>
1 American Indian        4
2 Asian                129
3 Black                 59
4 Hispanic             376
5 Multi Race            26
6 Native Hawaiian        3
7 Other                  2
8 White not Hispanic   401

dds.discr %>% 
  count(ethnicity, age.cohort)

# A tibble: 35 × 3
   ethnicity       age.cohort     n
   <fct>           <fct>      <int>
 1 American Indian 13-17          1
 2 American Indian 22-50          1
 3 American Indian 51+            2
 4 Asian           0-5            8
 5 Asian           6-12          18
 6 Asian           13-17         20
 7 Asian           18-21         41
 8 Asian           22-50         29
 9 Asian           51+           13
10 Black           0-5            3
# ℹ 25 more rows

How to use the pipe `%>%`

The pipe operator %>% strings together commands to be performed sequentially

dds.discr %>% head(n=3)      # pronounce %>% as "then"

# A tibble: 3 × 6
     id age.cohort   age gender expenditures ethnicity         
  <int> <fct>      <int> <fct>         <int> <fct>             
1 10210 13-17         17 Female         2113 White not Hispanic
2 10409 22-50         37 Male          41924 White not Hispanic
3 10486 0-5            3 Male           1454 Hispanic

Always first list the tibble that the commands are being applied to
Can use multiple pipes to run multiple commands in sequence
- What does the following code do?

dds.discr %>% head(n=3) %>% summary()

       id        age.cohort      age        gender   expenditures  
 Min.   :10210   0-5  :1    Min.   : 3   Female:1   Min.   : 1454  
 1st Qu.:10310   6-12 :0    1st Qu.:10   Male  :2   1st Qu.: 1784  
 Median :10409   13-17:1    Median :17              Median : 2113  
 Mean   :10368   18-21:0    Mean   :19              Mean   :15164  
 3rd Qu.:10448   22-50:1    3rd Qu.:27              3rd Qu.:22018  
 Max.   :10486   51+  :0    Max.   :37              Max.   :41924  
                                                                   
              ethnicity
 White not Hispanic:2  
 Hispanic          :1  
 American Indian   :0  
 Asian             :0  
 Black             :0  
 Multi Race        :0  
 (Other)           :0

Frequency tables: `janitor` package’s `tabyl` function

# default table
dds.discr %>% 
  tabyl(ethnicity)

          ethnicity   n percent
    American Indian   4   0.004
              Asian 129   0.129
              Black  59   0.059
           Hispanic 376   0.376
         Multi Race  26   0.026
    Native Hawaiian   3   0.003
              Other   2   0.002
 White not Hispanic 401   0.401

adorn_ your table!

dds.discr %>% 
  tabyl(ethnicity) %>%
  adorn_totals("row") %>% 
  adorn_pct_formatting(digits=2)

          ethnicity    n percent
    American Indian    4   0.40%
              Asian  129  12.90%
              Black   59   5.90%
           Hispanic  376  37.60%
         Multi Race   26   2.60%
    Native Hawaiian    3   0.30%
              Other    2   0.20%
 White not Hispanic  401  40.10%
              Total 1000 100.00%

Relative frequency table

A relative frequency table shows proportions (or percentages) instead of counts
Below I removed (deselected) the counts column (n) to create a relative frequency table

dds.discr %>% 
  tabyl(ethnicity) %>%
  adorn_totals("row") %>% 
  adorn_pct_formatting(digits=2) %>%   
  select(-n) # remove column with variable name n

          ethnicity percent
    American Indian   0.40%
              Asian  12.90%
              Black   5.90%
           Hispanic  37.60%
         Multi Race   2.60%
    Native Hawaiian   0.30%
              Other   0.20%
 White not Hispanic  40.10%
              Total 100.00%

Contingency tables (two-way tables)

Contingency tables summarize data for two categorical variables
- with each value in the table representing the number of times
  a particular combination of outcomes occurs
Row & column totals
are sometimes called marginal totals

dds.discr %>% 
  tabyl(ethnicity, gender) %>%    
  adorn_totals(c("row", "col"))

          ethnicity Female Male Total
    American Indian      3    1     4
              Asian     61   68   129
              Black     26   33    59
           Hispanic    192  184   376
         Multi Race     13   13    26
    Native Hawaiian      2    1     3
              Other      1    1     2
 White not Hispanic    205  196   401
              Total    503  497  1000

Contingency tables with percentages

dds.discr %>% 
  tabyl(ethnicity, age.cohort) %>%
  adorn_totals(c("row")) %>%
  adorn_percentages("row") %>%   
  adorn_pct_formatting(digits=0) %>%    
  adorn_ns()

          ethnicity      0-5      6-12      13-17     18-21     22-50       51+
    American Indian  0%  (0)  0%   (0)  25%   (1)  0%   (0) 25%   (1) 50%   (2)
              Asian  6%  (8) 14%  (18)  16%  (20) 32%  (41) 22%  (29) 10%  (13)
              Black  5%  (3) 19%  (11)  20%  (12) 15%   (9) 29%  (17) 12%   (7)
           Hispanic 12% (44) 24%  (91)  27% (103) 21%  (78) 11%  (43)  5%  (17)
         Multi Race 27%  (7) 35%   (9)  27%   (7)  8%   (2)  4%   (1)  0%   (0)
    Native Hawaiian  0%  (0)  0%   (0)   0%   (0)  0%   (0) 67%   (2) 33%   (1)
              Other  0%  (0)  0%   (0) 100%   (2)  0%   (0)  0%   (0)  0%   (0)
 White not Hispanic  5% (20) 11%  (46)  17%  (67) 17%  (69) 33% (133) 16%  (66)
              Total  8% (82) 18% (175)  21% (212) 20% (199) 23% (226) 11% (106)

Summarizing numeric data

Mean annual DDS expenditures by race/ethnicity

mean(dds.discr$expenditures)

[1] 18065.79

dds.discr %>% 
  summarize(
    ave = mean(expenditures),
    SD = sd(expenditures),
    med = median(expenditures))

# A tibble: 1 × 3
     ave     SD   med
   <dbl>  <dbl> <dbl>
1 18066. 19543.  7026

dds.discr %>% 
  group_by(ethnicity) %>% 
  summarize(
    ave = mean(expenditures),
    SD = sd(expenditures),
    med = median(expenditures))

# A tibble: 8 × 4
  ethnicity             ave     SD    med
  <fct>               <dbl>  <dbl>  <dbl>
1 American Indian    36438. 25694. 41818.
2 Asian              18392. 19209.  9369 
3 Black              20885. 20549.  8687 
4 Hispanic           11066. 15630.  3952 
5 Multi Race          4457.  7332.  2622 
6 Native Hawaiian    42782.  6576. 40727 
7 Other               3316.  1836.  3316.
8 White not Hispanic 24698. 20604. 15718

`get_summary_stats()` from `rstatix` package

dds.discr %>% get_summary_stats()

# A tibble: 3 × 13
  variable         n   min   max median     q1     q3   iqr    mad   mean     sd
  <fct>        <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1 id            1000 10210 99898 55384. 31809. 76135. 44326 3.27e4 5.47e4 2.56e4
2 age           1000     0    95    18     12     26     14 1.04e1 2.28e1 1.85e1
3 expenditures  1000   222 75098  7026   2899. 37713. 34814 7.76e3 1.81e4 1.95e4
# ℹ 2 more variables: se <dbl>, ci <dbl>

dds.discr %>% 
  group_by(ethnicity) %>%
  get_summary_stats(expenditures, type = "common")

# A tibble: 8 × 11
  ethnicity variable     n   min   max median    iqr   mean     sd     se     ci
  <fct>     <fct>    <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 American… expendi…     4  3726 58392 41818. 34085. 36438. 25694. 12847. 40885.
2 Asian     expendi…   129   374 75098  9369  30892  18392. 19209.  1691.  3346.
3 Black     expendi…    59   240 60808  8687  37987  20885. 20549.  2675.  5355.
4 Hispanic  expendi…   376   222 65581  3952   7961. 11066. 15630.   806.  1585.
5 Multi Ra… expendi…    26   669 38619  2622   2060.  4457.  7332.  1438.  2962.
6 Native H… expendi…     3 37479 50141 40727   6331  42782.  6576.  3797. 16337.
7 Other     expendi…     2  2018  4615  3316.  1298.  3316.  1836.  1298. 16499.
8 White no… expendi…   401   340 68890 15718  39157  24698. 20604.  1029.  2023.

How to force all output to be shown?

Use kable() from the knitr package.

dds.discr %>% get_summary_stats() %>% kable()

variable	n	min	max	median	q1	q3	iqr	mad	mean	sd	se	ci
id	1000	10210	99898	55384.5	31808.75	76134.75	44326	32734.325	54662.85	25643.673	810.924	1591.310
age	1000	0	95	18.0	12.00	26.00	14	10.378	22.80	18.462	0.584	1.146
expenditures	1000	222	75098	7026.0	2898.75	37712.75	34814	7760.670	18065.79	19542.831	617.999	1212.724

dds.discr %>% 
  group_by(ethnicity) %>%
  get_summary_stats(expenditures, type = "common") %>% 
  kable()

ethnicity	variable	n	min	max	median	iqr	mean	sd	se	ci
American Indian	expenditures	4	3726	58392	41817.5	34085.25	36438.250	25693.912	12846.956	40884.748
Asian	expenditures	129	374	75098	9369.0	30892.00	18392.372	19209.225	1691.278	3346.482
Black	expenditures	59	240	60808	8687.0	37987.00	20884.593	20549.274	2675.288	5355.170
Hispanic	expenditures	376	222	65581	3952.0	7961.25	11065.569	15629.847	806.048	1584.940
Multi Race	expenditures	26	669	38619	2622.0	2059.75	4456.731	7332.135	1437.950	2961.514
Native Hawaiian	expenditures	3	37479	50141	40727.0	6331.00	42782.333	6576.462	3796.922	16336.838
Other	expenditures	2	2018	4615	3316.5	1298.50	3316.500	1836.356	1298.500	16499.007
White not Hispanic	expenditures	401	340	68890	15718.0	39157.00	24697.549	20604.376	1028.933	2022.793

Back to research question

Case study: discrimination in developmental disability support (1.7.1)

Previous research
- Researchers examined DDS expenditures for developmentally disabled residents by ethnicity
- Found that the mean annual expenditures on Hispanics was less than that on White non-Hispanics.
Result: an allegation of ethnic discrimination was brought against the California DDS.
Question: Are the data sufficient evidence of ethnic discrimination?

Load packages

Case study: discrimination in developmental disability support (1.7.1)

Load dds.discr dataset from oibiostat package

Getting to know the dataset

str()

glimpse()

summary()

Getting to know the dataset: tbl_summary()

Visualize numerical variables with ggplot

Histograms

Histograms showing proportions

Density plots

Dot plots (better for smaller samples)

Boxplots

Visualizing relationships between numerical and categorical variables (1.6.3)

Side-by-side boxplots

Side-by-side boxplots with data points

Density plots by group

Ridgeline plot

Transforming data (1.4.5)

Relationships between two numerical variables (1.6.1)

Scatterplots

(Pearson) Correlation coefficient \(r\)

Scatterplots with color-coded dots

Categorical data (1.5) and Relationships between two categorical variables (1.6.2)

Barplots

Barplots with 2 variables: segmented bar plots

Barplots with 2 variables: side-by-side bar plots

Summarizing categorical data and some data wrangling

Frequency tables: count()

How to use the pipe %>%

Frequency tables: janitor package’s tabyl function

Relative frequency table

Contingency tables (two-way tables)

Contingency tables with percentages

Summarizing numeric data

Mean annual DDS expenditures by race/ethnicity

get_summary_stats() from rstatix package

How to force all output to be shown?

Back to research question

Load `dds.discr` dataset from `oibiostat` package

`str()`

`glimpse()`

`summary()`

Getting to know the dataset: `tbl_summary()`

Visualize numerical variables with `ggplot`

Frequency tables: `count()`

How to use the pipe `%>%`

Frequency tables: `janitor` package’s `tabyl` function

`get_summary_stats()` from `rstatix` package