5: Data summarizing, reshaping, and wrangling with multiple tables

BSTA 526: R Programming for Health Data Science

Author
Affiliation

Meike Niederhausen, PhD & Jessica Minnier, PhD

OHSU-PSU School of Public Health

Published

February 5, 2026

Modified

February 12, 2026

1 Welcome to R Programming: Part 5!

Today we will cover creating new variables and summarizing existing variables.


Before you get started:

Remember to save this notebook under a new name, such as part_05_b526_YOURNAME.qmd.


  • Load the packages & smoke_complete.xlsx data in the setup code chunk

1.1 Learning Objectives

  • Learn and apply loading comma separated and tab separated datasets using the readr package
  • Learn different techniques for cleaning data using functions from the tidyr, forcats, and stringrpackages
    • Practice cleaning data with a real dataset
  • Learn and apply bind_rows() to combine rows from two or more datasets
  • Learn about ways to merge columns from different datasets
    • Apply inner_join() and left_join() to merge columns from different datasets
  • Learn about wide vs long data and how to reshape data
    • apply pivot_longer() to make a wide dataset long

2 Overview

  • §0: Read in data files
    • questionnaire data (1 file)
    • demographics data (3 files for different time periods)
  • §1 Wrangle questionnaire data
    • Goal 1: group cigarettes_per_day into 3 categories
      • case_when(), fct_collapse(), fct_other()
    • Goal 2: remove rows with NAs with drop_na()
  • §2 Wrangle demographics data
    • Merge together (stack) the 3 files with bind_rows()
    • Clean age_and_grade
      • separate into two columns
      • make values numeric (stringr package)
  • §3 Join questionnaire and demographics data
  • Making data long or wide

3 (§0) Reading in csv and other deliminated files

3.1 About the data YRBSS

  • Today we are moving away from the smoking data to use a new somewhat “messier” data set from the CDC’s Youth Risk Behavior Surveillance System (YRBSS).
    • From webpage: “measures health-related behaviors and experiences that can lead to death and disability among youth and adults”
  • The data we are working with are a subset of data that are available in the R package yrbss.
    • This subset of the package’s data have intentionally been made messy and broken into pieces for us to practice common data cleaning steps and joining different data tables together.
    • Note this sub-sample is actually from survey data, and we are ignoring the sampling weights since we aren’t really focused on analysis right now. So if you try to look at any associations between variables, don’t believe what the data is telling you!

3.2 Loading plain text files

  • Previously: Excel files
    • We’ve usually been reading in Excel files, which is a very common way to save data.
  • It’s actually preferred to save data as plain text files so that you don’t need a proprietary software (i.e. Excel) to open them, and they are more universal.
  • Common plain text files include
    • .csv (columns are comma separated values)
    • .tsv (columns are tab separated values)
    • Check out the csv and tsv files in the data folder.
  • We’ll be using functions from the readr package:
    • read_csv(), read_tsv(), read_delim()

3.3 Load the 3 demographic data files

  • yrbss_demo_1991.tsv
  • yrbss_demo_1993to2007.csv
  • yrbss_demo_2009to2013.csv
# data from one year, 1991, tsv (tab separated) file
yrbss_demo_1991 <- read_tsv(
  file = here("part5", "data","yrbss_demo_1991.tsv"))
Rows: 1000 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): sex, age_and_grade, race4, race7
dbl (2): record_id, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# we can also use read_delim and specify the delimiter that separates fields
yrbss_demo_1991 <- read_delim(
  file = here("part5", "data","yrbss_demo_1991.tsv"),
  delim = "\t" # tab delimiter
  )
Rows: 1000 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): sex, age_and_grade, race4, race7
dbl (2): record_id, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# data from 1993-2007, csv file
yrbss_demo_1993to2007 <- read_csv(
  file = here("part5", "data","yrbss_demo_1993to2007.csv"))
Rows: 8000 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): sex, age_and_grade, race4, race7
dbl (4): record_id, year, bmi, weight_kg

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# data from 2009-2013, csv file
yrbss_demo_2009to2013 <- read_csv(
  file = here("part5", "data","yrbss_demo_2009to2013.csv"))
Rows: 3000 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): age and grade, sex, Race 4 cat, Race 7 cat
dbl (4): Record id, year, BMI (kg/m^2), Weight (kg)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3.4 Challenge 1 (~5 minutes)

  1. Load the data file “yrbss_questions.csv” using here() and save it as the object yrbss_questions.
  1. Use the clean_names() function to clean up the names of yrbss_questions, and then rename the columns so that they are as follows:
    • record, year, bicycle_helmet, text_driving, ever_smoked, age_first_smoked, days_smoked, cigarettes_per_day, ever_been_bullied
  1. For yrbss_questions, make a cross table (2-way table) of cigarettes_per_day and ever_smoked. Do you see anything unusual or unexpected?

4 (§1) Wrangle questionnaire data yrbss_questions


  • Goal 1: group cigarettes_per_day into 3 categories
  • Goal 2: remove rows with NAs with drop_na()

4.1 Load the questionnaire data

The code below is loading the data from Challenge 1.

yrbss_questions <- 
  read_csv(file = here("part5", "data", "yrbss_questions.csv")) %>% 
  clean_names() %>% 
  rename(
    bicycle_helmet = how_often_wear_bicycle_helmet,
    text_driving = ever_text_while_driving_in_past_30_days,
    age_first_smoked = how_old_when_first_smoked,
    days_smoked = how_many_days_smoked_in_past_30_days,
    cigarettes_per_day = how_many_cigarettes_day_in_past_30_days
  )
Rows: 10000 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): How often wear bicycle helmet, Ever text while driving in past 30 d...
dbl (2): record, year
lgl (1): Ever been bullied

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

4.2 Goal 1: group cigarettes_per_day into 3 categories

  • Let’s group the cigarettes per day column into three categories:
    • “None”, “0-10”, “>10” cigarettes per day.
  • Below we do this in different ways:
    • Using case_when()
    • Using factor functions from the forcats package
# what are the current categories?
yrbss_questions %>% tabyl(cigarettes_per_day)
       cigarettes_per_day    n percent valid_percent
              1 cigarette  363  0.0363   0.038686987
      11 to 20 cigarettes  139  0.0139   0.014814025
        2 to 5 cigarettes  676  0.0676   0.072045188
       6 to 10 cigarettes  221  0.0221   0.023553235
 Did not smoke cigarettes 7441  0.7441   0.793029948
    Less than 1 cigarette  451  0.0451   0.048065651
  More than 20 cigarettes   92  0.0092   0.009804966
                     <NA>  617  0.0617            NA

4.3 Goal 1 with case_when()

  • Use case_when() to group the cigarettes per day column into three categories:
    • “None”, “0-10”, “>10” cigarettes per day.

Why does this code not work as intended?

# note that we're not saving this attempt
yrbss_questions %>%      
  mutate(
    cigarettes10 = case_when(
      cigarettes_per_day == "Did not smoke cigarettes" ~ "None",
      cigarettes_per_day %in% 
        c("11 to 20 cigarettes", "More than 20 cigarettes") ~ ">10",
      .default = "0-10"
      ),
    cigarettes10 = factor(
      cigarettes10,
      levels = c("None","0-10",">10")) # make sure levels are in order
    ) %>%
  tabyl(cigarettes_per_day, cigarettes10) %>%
  adorn_title()
                          cigarettes10         
       cigarettes_per_day         None 0-10 >10
              1 cigarette            0  363   0
      11 to 20 cigarettes            0    0 139
        2 to 5 cigarettes            0  676   0
       6 to 10 cigarettes            0  221   0
 Did not smoke cigarettes         7441    0   0
    Less than 1 cigarette            0  451   0
  More than 20 cigarettes            0    0  92
                     <NA>            0  617   0
# this time we're saving our changes:

yrbss_questions <- yrbss_questions %>% mutate(
  cigarettes10 = case_when(
    cigarettes_per_day == "Did not smoke cigarettes" ~ "None",
    cigarettes_per_day %in% 
      c("11 to 20 cigarettes", "More than 20 cigarettes") ~ ">10",
    is.na(cigarettes_per_day) ~ NA, # need to specify NA type
    # is.na(cigarettes_per_day) ~ as.character(NA), # also would work
    TRUE ~ "0-10"
  ),
  cigarettes10 = factor(
    cigarettes10,
    levels = c("None","0-10",">10")) # make sure levels are in order
) 

yrbss_questions %>%
  tabyl(cigarettes_per_day, cigarettes10) %>%
  adorn_title()
                          cigarettes10             
       cigarettes_per_day         None 0-10 >10 NA_
              1 cigarette            0  363   0   0
      11 to 20 cigarettes            0    0 139   0
        2 to 5 cigarettes            0  676   0   0
       6 to 10 cigarettes            0  221   0   0
 Did not smoke cigarettes         7441    0   0   0
    Less than 1 cigarette            0  451   0   0
  More than 20 cigarettes            0    0  92   0
                     <NA>            0    0   0 617
  • Starting with dplyr 1.1.0 we no longer need to specify the type of NA
    • such as NA_character_, NA_real_, or NA_integer_
  • See presentation on the new easier to use case_when() from posit::conference 2023, starting at 0:31.


Back in the old days we had to be careful… and you will likely still see case_when() examples on the internet using NA_character_, NA_real_, or NA_integer_

  • See this webpage for discussion of NAtypes when usingcase_when().

4.4 Goal 1 with factors using forcats

  • Use factor functions from the forcats package to group the cigarettes per day column into three categories:

    • “None”, “0-10”, “>10” cigarettes per day.
  • I encourage you to look at the intro to forcats vignette and cheatsheet here.

  • Instead of using case_when(),

    • we could instead collapse the cigarettes per day column into three categories by creating a factor, and
    • using the handy forcats functions fct_collapse() and fct_other().

Note: This is a somewhat contrived demonstration of how to use fct_collapse and fct_other together so you can see what each of these functions does, but you could also just use fct_collapse() by itself.

yrbss_questions %>% tabyl(cigarettes_per_day)
       cigarettes_per_day    n percent valid_percent
              1 cigarette  363  0.0363   0.038686987
      11 to 20 cigarettes  139  0.0139   0.014814025
        2 to 5 cigarettes  676  0.0676   0.072045188
       6 to 10 cigarettes  221  0.0221   0.023553235
 Did not smoke cigarettes 7441  0.7441   0.793029948
    Less than 1 cigarette  451  0.0451   0.048065651
  More than 20 cigarettes   92  0.0092   0.009804966
                     <NA>  617  0.0617            NA

4.5 Steps 1 & 2: fct_other() to create the 0-10 factor level

yrbss_questions <- yrbss_questions %>%
  mutate(
    # step 1: make cigarettes_per_day a factor
    cigarettes_per_day = factor(cigarettes_per_day),
    #
    # step 2: create an other group called "0-10" for the many 0-10 cig/day levels
    cigarettes10fac =
      fct_other(
        cigarettes_per_day,
        # specify levels to keep (there's also a drop option)
        keep = c(
          "Did not smoke cigarettes",
          "11 to 20 cigarettes",
          "More than 20 cigarettes"
        ),
        # specify name for collapsed remaining levels
        other_level = "0-10"  # default is "Other" if not specified
      )
  )

# check what happened
yrbss_questions %>%
  tabyl(cigarettes_per_day, cigarettes10fac) %>%
  adorn_title()
                              cigarettes10fac                         
       cigarettes_per_day 11 to 20 cigarettes Did not smoke cigarettes
              1 cigarette                   0                        0
      11 to 20 cigarettes                 139                        0
        2 to 5 cigarettes                   0                        0
       6 to 10 cigarettes                   0                        0
 Did not smoke cigarettes                   0                     7441
    Less than 1 cigarette                   0                        0
  More than 20 cigarettes                   0                        0
                     <NA>                   0                        0
                                 
 More than 20 cigarettes 0-10 NA_
                       0  363   0
                       0    0   0
                       0  676   0
                       0  221   0
                       0    0   0
                       0  451   0
                      92    0   0
                       0    0 617

4.6 Step 3: fct_collapse() to create the None and >10 factor levels

yrbss_questions <- yrbss_questions %>%
  mutate(
    # step 3: create "None" level and collapse the two "top" levels to one ">10" category
    cigarettes10fac =
      fct_collapse(
        cigarettes10fac,
        "None" = "Did not smoke cigarettes",
        ">10" = c("11 to 20 cigarettes", "More than 20 cigarettes")
      ))

# check what happened
yrbss_questions %>%
  tabyl(cigarettes_per_day, cigarettes10fac) %>%
  adorn_title()
                          cigarettes10fac              
       cigarettes_per_day             >10 None 0-10 NA_
              1 cigarette               0    0  363   0
      11 to 20 cigarettes             139    0    0   0
        2 to 5 cigarettes               0    0  676   0
       6 to 10 cigarettes               0    0  221   0
 Did not smoke cigarettes               0 7441    0   0
    Less than 1 cigarette               0    0  451   0
  More than 20 cigarettes              92    0    0   0
                     <NA>               0    0    0 617
# we could have done all of this with fct_collapse, with a tiny bit more typing

4.7 Steps 1-3 Visual representation

4.8 Goal 2: remove rows with NAs with drop_na()

Note that below we do not remove any rows with missing data from yrbss_questions. All changes are saved to a different tibble.

  • If we want to remove rows that have missingness, we can use the drop_na() function.
  • We can choose to
    • remove rows that have missingness only in a certain column or columns, or
    • we can remove rows that have missingness in any column.
  • See the drop_na() reference for examples.

4.9 drop_na() to remove rows with NA

nrow(yrbss_questions)
[1] 10000
# from the naniar package:
miss_var_summary(yrbss_questions)
# A tibble: 11 × 3
   variable           n_miss pct_miss
   <chr>               <int>    <num>
 1 text_driving         9206    92.1 
 2 ever_been_bullied    7608    76.1 
 3 cigarettes_per_day    617     6.17
 4 cigarettes10          617     6.17
 5 cigarettes10fac       617     6.17
 6 ever_smoked           584     5.84
 7 days_smoked           495     4.95
 8 bicycle_helmet        486     4.86
 9 age_first_smoked      457     4.57
10 record                  0     0   
11 year                    0     0   
# remove rows with missing bicycle_helmet:
tmpdata <- yrbss_questions %>% 
  drop_na(bicycle_helmet) 
nrow(tmpdata)             # how many rows left?
[1] 9514
miss_var_summary(tmpdata) # which rows do not have NA?
# A tibble: 11 × 3
   variable           n_miss pct_miss
   <chr>               <int>    <num>
 1 text_driving         8736    91.8 
 2 ever_been_bullied    7192    75.6 
 3 cigarettes_per_day    578     6.08
 4 cigarettes10          578     6.08
 5 cigarettes10fac       578     6.08
 6 ever_smoked           505     5.31
 7 days_smoked           457     4.80
 8 age_first_smoked      424     4.46
 9 record                  0     0   
10 year                    0     0   
11 bicycle_helmet          0     0   
# remove rows with missing smoking data
tmpdata <- yrbss_questions %>% 
  drop_na(ends_with("smoked")) 
nrow(tmpdata)             # how many rows left?
[1] 8931
miss_var_summary(tmpdata) # which rows do not have NA?
# A tibble: 11 × 3
   variable           n_miss pct_miss
   <chr>               <int>    <num>
 1 text_driving         8198   91.8  
 2 ever_been_bullied    6813   76.3  
 3 bicycle_helmet        375    4.20 
 4 cigarettes_per_day     34    0.381
 5 cigarettes10           34    0.381
 6 cigarettes10fac        34    0.381
 7 record                  0    0    
 8 year                    0    0    
 9 ever_smoked             0    0    
10 age_first_smoked        0    0    
11 days_smoked             0    0    
# remove rows with any missing data
tmpdata <- yrbss_questions %>% 
  drop_na() 
nrow(tmpdata)             # how many rows left?
[1] 711
miss_var_summary(tmpdata) # which rows do not have NA?
# A tibble: 11 × 3
   variable           n_miss pct_miss
   <chr>               <int>    <num>
 1 record                  0        0
 2 year                    0        0
 3 bicycle_helmet          0        0
 4 text_driving            0        0
 5 ever_smoked             0        0
 6 age_first_smoked        0        0
 7 days_smoked             0        0
 8 cigarettes_per_day      0        0
 9 ever_been_bullied       0        0
10 cigarettes10            0        0
11 cigarettes10fac         0        0

5 (§2) Wrangle demographics data

  • Merge together (stack) the 3 files with bind_rows()
  • Clean age_and_grade
    • separate into two columns
    • make values numeric (stringr package)

6 Stack multiple datasets with bind_rows()

6.1 When do we stack data?

  • Often we have data from multiple sources that we need to combine into one data frame.
    • In the most simple scenario, the data are measuring the same or similar things and are from different and distinct populations, i.e. from different labs, different clinics, etc.
  • In this case, we have YRBSS demographic data from 3 different ranges of sampling years.
    • They represent data from 3 different cohorts of students across different times (no overlapping student record ids! This is an important detail).
  • Goal: stack the data on top of each other into one long data set that contains every year’s data.
  • To do this we can use the dplyr function bind_rows().

6.2 Our 3 datasets to stack

Let’s look at our three data sets:

glimpse(yrbss_demo_1991)
Rows: 1,000
Columns: 6
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
glimpse(yrbss_demo_1993to2007)
Rows: 8,000
Columns: 8
$ record_id     <dbl> 121641, 107323, 116829, 116696, 109658, 118948, 107150, …
$ year          <dbl> 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 19…
$ sex           <chr> "Female", "Male", "Female", "Female", "Male", "Female", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ age_and_grade <chr> "17 years old;10th", "14 years old;9th", "15 years old;1…
$ race4         <chr> "Hispanic/Latino", "Black or African American", "White",…
$ race7         <chr> "Hispanic/Latino", "Black or African American", "White",…
glimpse(yrbss_demo_2009to2013)
Rows: 3,000
Columns: 8
$ `Record id`     <dbl> 931939, 927037, 924426, 923751, 929473, 938296, 931286…
$ year            <dbl> 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, …
$ `age and grade` <chr> "15 years old;9th", "17 years old;11th", "15 years old…
$ sex             <chr> "Female", "Male", "Male", "Male", "Male", "Female", "F…
$ `Race 4 cat`    <chr> "White", "White", NA, "White", "White", "All other rac…
$ `Race 7 cat`    <chr> "White", "White", NA, "White", "White", "Asian", "Mult…
$ `BMI (kg/m^2)`  <dbl> 18.2076, 22.7338, 24.4376, 21.4746, 25.0709, 20.8936, …
$ `Weight (kg)`   <dbl> 52.62, 68.04, 74.84, 68.04, 70.76, 58.97, 58.97, 70.76…

6.3 Stack 1991 and 1993-2007 data

  • We’ll start by binding together/stacking the data from 1991 and 1993-2007.
  • First, let’s check those column names again:
colnames(yrbss_demo_1991)
[1] "record_id"     "year"          "sex"           "age_and_grade"
[5] "race4"         "race7"        
dim(yrbss_demo_1991)
[1] 1000    6
colnames(yrbss_demo_1993to2007)
[1] "record_id"     "year"          "sex"           "bmi"          
[5] "weight_kg"     "age_and_grade" "race4"         "race7"        
dim(yrbss_demo_1993to2007)
[1] 8000    8
  • We can see that BMI and weight are missing from the 1991 data.
  • But bind_rows() still works as intended, filling in NA values in those columns.
yrbss_demo_1991to2007 <- bind_rows(yrbss_demo_1991, 
                                   yrbss_demo_1993to2007)

# note we now have 9000 rows, which is the sum of the 1000 and 8000 rows above
glimpse(yrbss_demo_1991to2007)
Rows: 9,000
Columns: 8
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Let’s check we have data from the years we want:

yrbss_demo_1991to2007 %>% 
  tabyl(year)
 year    n   percent
 1991 1000 0.1111111
 1993 1000 0.1111111
 1995 1000 0.1111111
 1997 1000 0.1111111
 1999 1000 0.1111111
 2001 1000 0.1111111
 2003 1000 0.1111111
 2005 1000 0.1111111
 2007 1000 0.1111111

6.4 Stack 1991-2007 with 2009-2013 data

  • Now we want to combine this dataset and the 2009-2013 data.
  • Let’s check out their info first:
colnames(yrbss_demo_1991to2007)
[1] "record_id"     "year"          "sex"           "age_and_grade"
[5] "race4"         "race7"         "bmi"           "weight_kg"    
dim(yrbss_demo_1991to2007)
[1] 9000    8
colnames(yrbss_demo_2009to2013)
[1] "Record id"     "year"          "age and grade" "sex"          
[5] "Race 4 cat"    "Race 7 cat"    "BMI (kg/m^2)"  "Weight (kg)"  
dim(yrbss_demo_2009to2013)
[1] 3000    8
  • Uh oh, these datasets have different names for the same thing.
  • Also, the columns are all out of order!
  • If we try to bind them together, what happens?
bind_rows(yrbss_demo_1991to2007, yrbss_demo_2009to2013) %>% glimpse()
Rows: 12,000
Columns: 14
$ record_id       <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873…
$ year            <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, …
$ sex             <chr> "Male", "Female", "Male", "Female", "Female", "Male", …
$ age_and_grade   <chr> "17 years old;12th", "17 years old;12th", "16 years ol…
$ race4           <chr> "White", "Black or African American", "White", "White"…
$ race7           <chr> "White", "Black or African American", "White", "White"…
$ bmi             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ weight_kg       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Record id`     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `age and grade` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Race 4 cat`    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Race 7 cat`    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `BMI (kg/m^2)`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Weight (kg)`   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
  • Yikes.
    • What extra columns do we have?
    • What matched and what didn’t?

We need to first do some data cleaning!

  • Let’s make sure our names in the 2009-2013 data match the first dataset’s column names.
    • First use janitor::clean_names to make things a little cleaner:
yrbss_demo_2009to2013 <- yrbss_demo_2009to2013 %>% 
  clean_names()

colnames(yrbss_demo_2009to2013)
[1] "record_id"     "year"          "age_and_grade" "sex"          
[5] "race_4_cat"    "race_7_cat"    "bmi_kg_m_2"    "weight_kg"    
yrbss_demo_2009to2013 <- yrbss_demo_2009to2013 %>%
  rename(race4 = race_4_cat,
         race7 = race_7_cat,
         bmi = bmi_kg_m_2)


# look at the column names now:
colnames(yrbss_demo_1991to2007)
[1] "record_id"     "year"          "sex"           "age_and_grade"
[5] "race4"         "race7"         "bmi"           "weight_kg"    
colnames(yrbss_demo_2009to2013)
[1] "record_id"     "year"          "age_and_grade" "sex"          
[5] "race4"         "race7"         "bmi"           "weight_kg"    
# or, for a side-by-side comparison using cbind (column bind)
cbind(names(yrbss_demo_1991to2007), names(yrbss_demo_2009to2013))
     [,1]            [,2]           
[1,] "record_id"     "record_id"    
[2,] "year"          "year"         
[3,] "sex"           "age_and_grade"
[4,] "age_and_grade" "sex"          
[5,] "race4"         "race4"        
[6,] "race7"         "race7"        
[7,] "bmi"           "bmi"          
[8,] "weight_kg"     "weight_kg"    
# we can check that we changed all the names to match names in the other data
colnames(yrbss_demo_2009to2013) %in% colnames(yrbss_demo_1991to2007)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
sum(colnames(yrbss_demo_2009to2013) %in% colnames(yrbss_demo_1991to2007))
[1] 8
  • Looks like all the names are the same, but they are still out of order.
  • This is ok! bind_rows() will use the ordering from the first data set, and knows how to match them up since they have identical names.

Now let’s bind the data:

yrbss_demo <- bind_rows(yrbss_demo_1991to2007, 
                        yrbss_demo_2009to2013)

# we could also do this with the pipe
yrbss_demo <- yrbss_demo_1991to2007 %>%
  bind_rows(yrbss_demo_2009to2013)

glimpse(yrbss_demo)
Rows: 12,000
Columns: 8
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
# check the years in the data:
yrbss_demo %>% tabyl(year)
 year    n    percent
 1991 1000 0.08333333
 1993 1000 0.08333333
 1995 1000 0.08333333
 1997 1000 0.08333333
 1999 1000 0.08333333
 2001 1000 0.08333333
 2003 1000 0.08333333
 2005 1000 0.08333333
 2007 1000 0.08333333
 2009 1000 0.08333333
 2011 1000 0.08333333
 2013 1000 0.08333333

😃 Success!

6.5 Examine the stacked data

  • Before we start with the data cleaning, we need to acquaint ourselves with the data:
glimpse(yrbss_demo)
Rows: 12,000
Columns: 8
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
# quick summary of the data
gtsummary::tbl_summary(yrbss_demo)
Characteristic N = 12,0001
record_id 454,489 (223,487, 848,375)
year 2,002 (1,996, 2,008)
sex
    Female 5,918 (50%)
    Male 6,012 (50%)
    Unknown 70
age_and_grade
    12 years old or younger;10th 4 (<0.1%)
    12 years old or younger;11th 4 (<0.1%)
    12 years old or younger;12th 6 (<0.1%)
    12 years old or younger;9th 6 (<0.1%)
    12 years old or younger;NA 9 (<0.1%)
    13 years old;10th 2 (<0.1%)
    13 years old;12th 1 (<0.1%)
    13 years old;9th 14 (0.1%)
    13 years old;NA 1 (<0.1%)
    14 years old;10th 19 (0.2%)
    14 years old;11th 2 (<0.1%)
    14 years old;12th 1 (<0.1%)
    14 years old;9th 1,258 (10%)
    14 years old;NA 7 (<0.1%)
    15 years old;10th 1,182 (9.9%)
    15 years old;11th 28 (0.2%)
    15 years old;12th 3 (<0.1%)
    15 years old;9th 1,457 (12%)
    15 years old;NA 9 (<0.1%)
    16 years old;10th 1,376 (11%)
    16 years old;11th 1,165 (9.7%)
    16 years old;12th 36 (0.3%)
    16 years old;9th 442 (3.7%)
    16 years old;NA 14 (0.1%)
    17 years old;10th 364 (3.0%)
    17 years old;11th 1,327 (11%)
    17 years old;12th 1,248 (10%)
    17 years old;9th 55 (0.5%)
    17 years old;NA 14 (0.1%)
    18 years old or older;10th 46 (0.4%)
    18 years old or older;11th 389 (3.2%)
    18 years old or older;12th 1,430 (12%)
    18 years old or older;9th 17 (0.1%)
    18 years old or older;NA 14 (0.1%)
    NA;10th 4 (<0.1%)
    NA;11th 3 (<0.1%)
    NA;12th 7 (<0.1%)
    NA;9th 2 (<0.1%)
    NA;NA 34 (0.3%)
race4
    All other races 1,697 (14%)
    Black or African American 2,905 (25%)
    Hispanic/Latino 3,102 (26%)
    White 4,072 (35%)
    Unknown 224
race7
    Am Indian / Alaska Native 230 (2.0%)
    Asian 696 (6.0%)
    Black or African American 2,905 (25%)
    Hispanic/Latino 3,102 (27%)
    Multiple - Non-Hispanic 415 (3.6%)
    Native Hawaiian/other PI 110 (1.0%)
    White 4,072 (35%)
    Unknown 470
bmi 22.3 (20.1, 25.6)
    Unknown 4,565
weight_kg 64 (56, 75)
    Unknown 4,629
1 Median (Q1, Q3); n (%)

All of these columns look pretty standard except age and grade. What even is?

7 Create age and grade columns

  • Goal: create two separate numeric columns with just that ages and grades
yrbss_demo %>% tabyl(age_and_grade)
                age_and_grade    n      percent
 12 years old or younger;10th    4 3.333333e-04
 12 years old or younger;11th    4 3.333333e-04
 12 years old or younger;12th    6 5.000000e-04
  12 years old or younger;9th    6 5.000000e-04
   12 years old or younger;NA    9 7.500000e-04
            13 years old;10th    2 1.666667e-04
            13 years old;12th    1 8.333333e-05
             13 years old;9th   14 1.166667e-03
              13 years old;NA    1 8.333333e-05
            14 years old;10th   19 1.583333e-03
            14 years old;11th    2 1.666667e-04
            14 years old;12th    1 8.333333e-05
             14 years old;9th 1258 1.048333e-01
              14 years old;NA    7 5.833333e-04
            15 years old;10th 1182 9.850000e-02
            15 years old;11th   28 2.333333e-03
            15 years old;12th    3 2.500000e-04
             15 years old;9th 1457 1.214167e-01
              15 years old;NA    9 7.500000e-04
            16 years old;10th 1376 1.146667e-01
            16 years old;11th 1165 9.708333e-02
            16 years old;12th   36 3.000000e-03
             16 years old;9th  442 3.683333e-02
              16 years old;NA   14 1.166667e-03
            17 years old;10th  364 3.033333e-02
            17 years old;11th 1327 1.105833e-01
            17 years old;12th 1248 1.040000e-01
             17 years old;9th   55 4.583333e-03
              17 years old;NA   14 1.166667e-03
   18 years old or older;10th   46 3.833333e-03
   18 years old or older;11th  389 3.241667e-02
   18 years old or older;12th 1430 1.191667e-01
    18 years old or older;9th   17 1.416667e-03
     18 years old or older;NA   14 1.166667e-03
                      NA;10th    4 3.333333e-04
                      NA;11th    3 2.500000e-04
                      NA;12th    7 5.833333e-04
                       NA;9th    2 1.666667e-04
                        NA;NA   34 2.833333e-03

7.1 Separate age_and_grade into age and grade

  • Ok, yes, this combined column was combined on purpose, so that we can introduce the very useful function separate_wider_delim().
  • Check out ?separate_wider_delim.
    • Until recently this function was known as just separate(), which has been superseded with separate_wider_delim() andseparate_wider_position().

separate_wider_delim(): Split a string into columns by delimiter

  • In our case, age and grade were merged together with a semicolon (;)
    • which makes it easy to split into two columns
  • Note that these data are NOT “tidy” because they contain two pieces of information in one column. Let’s make them tidy.

(Side note: the opposite of separate_wider_delim() is unite(), look at some examples of this.)

7.2 separate_wider_delim()

yrbss_demo <- yrbss_demo %>%
  separate_wider_delim(
    # the column we are separating
    cols = age_and_grade,
    
    # names of new columns, we know we need two columns, but this can be variable!
    names = c("age", "grade"),
    
    # delimiter to separate into two columns
    delim = ";",
    
    # keep the original column
    cols_remove = FALSE 
    ) %>%
  
  # make sure "NA" turns into R's missing NA value
  mutate(age = na_if(age, "NA"),
         grade = na_if(grade, "NA"))
  • That’s a start. We now have tidy data in the sense that each cell has one piece of information in it.
  • However, both age and grade are character vectors, but we want them to be numeric.
glimpse(yrbss_demo)
Rows: 12,000
Columns: 10
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age           <chr> "17 years old", "17 years old", "16 years old", "16 year…
$ grade         <chr> "12th", "12th", "9th", "10th", "11th", "11th", "11th", "…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
yrbss_demo %>% tabyl(age_and_grade, age) %>% adorn_title()
                                                  age                          
                age_and_grade 12 years old or younger 13 years old 14 years old
 12 years old or younger;10th                       4            0            0
 12 years old or younger;11th                       4            0            0
 12 years old or younger;12th                       6            0            0
  12 years old or younger;9th                       6            0            0
   12 years old or younger;NA                       9            0            0
            13 years old;10th                       0            2            0
            13 years old;12th                       0            1            0
             13 years old;9th                       0           14            0
              13 years old;NA                       0            1            0
            14 years old;10th                       0            0           19
            14 years old;11th                       0            0            2
            14 years old;12th                       0            0            1
             14 years old;9th                       0            0         1258
              14 years old;NA                       0            0            7
            15 years old;10th                       0            0            0
            15 years old;11th                       0            0            0
            15 years old;12th                       0            0            0
             15 years old;9th                       0            0            0
              15 years old;NA                       0            0            0
            16 years old;10th                       0            0            0
            16 years old;11th                       0            0            0
            16 years old;12th                       0            0            0
             16 years old;9th                       0            0            0
              16 years old;NA                       0            0            0
            17 years old;10th                       0            0            0
            17 years old;11th                       0            0            0
            17 years old;12th                       0            0            0
             17 years old;9th                       0            0            0
              17 years old;NA                       0            0            0
   18 years old or older;10th                       0            0            0
   18 years old or older;11th                       0            0            0
   18 years old or older;12th                       0            0            0
    18 years old or older;9th                       0            0            0
     18 years old or older;NA                       0            0            0
                      NA;10th                       0            0            0
                      NA;11th                       0            0            0
                      NA;12th                       0            0            0
                       NA;9th                       0            0            0
                        NA;NA                       0            0            0
                                                                 
 15 years old 16 years old 17 years old 18 years old or older NA_
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
         1182            0            0                     0   0
           28            0            0                     0   0
            3            0            0                     0   0
         1457            0            0                     0   0
            9            0            0                     0   0
            0         1376            0                     0   0
            0         1165            0                     0   0
            0           36            0                     0   0
            0          442            0                     0   0
            0           14            0                     0   0
            0            0          364                     0   0
            0            0         1327                     0   0
            0            0         1248                     0   0
            0            0           55                     0   0
            0            0           14                     0   0
            0            0            0                    46   0
            0            0            0                   389   0
            0            0            0                  1430   0
            0            0            0                    17   0
            0            0            0                    14   0
            0            0            0                     0   4
            0            0            0                     0   3
            0            0            0                     0   7
            0            0            0                     0   2
            0            0            0                     0  34
yrbss_demo %>% tabyl(age_and_grade, grade) %>% adorn_title()
                              grade                   
                age_and_grade  10th 11th 12th  9th NA_
 12 years old or younger;10th     4    0    0    0   0
 12 years old or younger;11th     0    4    0    0   0
 12 years old or younger;12th     0    0    6    0   0
  12 years old or younger;9th     0    0    0    6   0
   12 years old or younger;NA     0    0    0    0   9
            13 years old;10th     2    0    0    0   0
            13 years old;12th     0    0    1    0   0
             13 years old;9th     0    0    0   14   0
              13 years old;NA     0    0    0    0   1
            14 years old;10th    19    0    0    0   0
            14 years old;11th     0    2    0    0   0
            14 years old;12th     0    0    1    0   0
             14 years old;9th     0    0    0 1258   0
              14 years old;NA     0    0    0    0   7
            15 years old;10th  1182    0    0    0   0
            15 years old;11th     0   28    0    0   0
            15 years old;12th     0    0    3    0   0
             15 years old;9th     0    0    0 1457   0
              15 years old;NA     0    0    0    0   9
            16 years old;10th  1376    0    0    0   0
            16 years old;11th     0 1165    0    0   0
            16 years old;12th     0    0   36    0   0
             16 years old;9th     0    0    0  442   0
              16 years old;NA     0    0    0    0  14
            17 years old;10th   364    0    0    0   0
            17 years old;11th     0 1327    0    0   0
            17 years old;12th     0    0 1248    0   0
             17 years old;9th     0    0    0   55   0
              17 years old;NA     0    0    0    0  14
   18 years old or older;10th    46    0    0    0   0
   18 years old or older;11th     0  389    0    0   0
   18 years old or older;12th     0    0 1430    0   0
    18 years old or older;9th     0    0    0   17   0
     18 years old or older;NA     0    0    0    0  14
                      NA;10th     4    0    0    0   0
                      NA;11th     0    3    0    0   0
                      NA;12th     0    0    7    0   0
                       NA;9th     0    0    0    2   0
                        NA;NA     0    0    0    0  34

8 Working with text: stringr package

  • Both of the age and grade columns are character vectors with “extra text”, and there’s a tidyverse package that is ALL about working with characters or “strings”, called stringr.
    • It’s automatically loaded when you load the tidyverse package (library(tidyverse) or pacman::p_load(tidyverse)).
    • I highly recommend that you read the introduction to stringr vignette and look at the cheatsheet.

8.1 Goal 1: clean the grade variable

For the column grade, we can see that the values are numbers followed by “th”:

yrbss_demo %>% tabyl(grade)
 grade    n   percent valid_percent
  10th 2997 0.2497500     0.2518911
  11th 2918 0.2431667     0.2452513
  12th 2732 0.2276667     0.2296184
   9th 3251 0.2709167     0.2732392
  <NA>  102 0.0085000            NA

We can remove that string with the function str_remove_all():

yrbss_demo <- yrbss_demo %>%
  mutate(grade_num = str_remove_all(grade, "th") )

yrbss_demo %>% tabyl(grade_num, grade) %>% adorn_title()
           grade                   
 grade_num  10th 11th 12th  9th NA_
        10  2997    0    0    0   0
        11     0 2918    0    0   0
        12     0    0 2732    0   0
         9     0    0    0 3251   0
      <NA>     0    0    0    0 102
  • We successfully removed th from those values,
    • however, grade_num is still character type:
glimpse(yrbss_demo)
Rows: 12,000
Columns: 11
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age           <chr> "17 years old", "17 years old", "16 years old", "16 year…
$ grade         <chr> "12th", "12th", "9th", "10th", "11th", "11th", "11th", "…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ grade_num     <chr> "12", "12", "9", "10", "11", "11", "11", "12", "9", "11"…
  • We want grade to be a numeric variable, and we can use as.numeric to change the character vector of numbers into a numeric vector of numbers.

  • See how as.numeric() works:

c("1", "2", "3", NA)
[1] "1" "2" "3" NA 
as.numeric(c("1", "2", "3", NA))
[1]  1  2  3 NA
  • Careful, though, we will get a warning if there are NA values “introduced by coercion” which means some of our character values are not obviously numbers!
    • See example below.
    • (Side note: this is how I realized I needed the na_if() statements above to convert “NA” to NA in age and grade).
c("1", "2", "3", NA, "a")
[1] "1" "2" "3" NA  "a"
as.numeric(c("1", "2", "3", NA, "a"))
[1]  1  2  3 NA NA
  • It’s a good idea to use tabyl() on your character string first before converting to numeric, to make sure there are no surprises,
    • and a cross-tabyl after to double check it worked properly.
yrbss_demo <- yrbss_demo %>%
  mutate(grade_num = as.numeric(grade_num))

yrbss_demo %>% tabyl(grade, grade_num) %>% adorn_title()
       grade_num                   
 grade         9   10   11   12 NA_
  10th         0 2997    0    0   0
  11th         0    0 2918    0   0
  12th         0    0    0 2732   0
   9th      3251    0    0    0   0
  <NA>         0    0    0    0 102
glimpse(yrbss_demo)
Rows: 12,000
Columns: 11
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age           <chr> "17 years old", "17 years old", "16 years old", "16 year…
$ grade         <chr> "12th", "12th", "9th", "10th", "11th", "11th", "11th", "…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ grade_num     <dbl> 12, 12, 9, 10, 11, 11, 11, 12, 9, 11, 12, 9, 11, 9, 9, 1…

8.2 Goal 2: clean the age variable

yrbss_demo %>% tabyl(age)
                     age    n     percent valid_percent
 12 years old or younger   29 0.002416667   0.002426778
            13 years old   18 0.001500000   0.001506276
            14 years old 1287 0.107250000   0.107698745
            15 years old 2679 0.223250000   0.224184100
            16 years old 3033 0.252750000   0.253807531
            17 years old 3008 0.250666667   0.251715481
   18 years old or older 1896 0.158000000   0.158661088
                    <NA>   50 0.004166667            NA
  • We see that there are some complications:
    • ≤ 12 are grouped together and
    • ≥ 18 are grouped together.
  • Let’s pretend that we want the categories to just be the numeric values 12-18,
    • so group ≤ 12 as 12, and group ≥ 18 as 18.
    • Obviously this isn’t ideal, but for learning purposes let’s do this.
  • There are several ways to do that (including using case_when() for example, or even separate_wider_delim()), but let’s use the stringr package first.

8.3 G2 Option 1: str_remove_all() to clean age

# try removing the string " years old"
yrbss_demo %>% 
  mutate(
    age_num = str_remove_all(age, " years old")
    ) %>%
  tabyl(age_num) # almost there
       age_num    n     percent valid_percent
 12 or younger   29 0.002416667   0.002426778
            13   18 0.001500000   0.001506276
            14 1287 0.107250000   0.107698745
            15 2679 0.223250000   0.224184100
            16 3033 0.252750000   0.253807531
            17 3008 0.250666667   0.251715481
   18 or older 1896 0.158000000   0.158661088
          <NA>   50 0.004166667            NA
# remove all the strings manually
yrbss_demo %>% 
  mutate(
    age_num = str_remove_all(age, " years old"),
    age_num = str_remove_all(age_num, " or younger"),
    age_num = str_remove_all(age_num, " or older")
    ) %>%
  tabyl(age_num) # looks good, now we just need to make it numeric
 age_num    n     percent valid_percent
      12   29 0.002416667   0.002426778
      13   18 0.001500000   0.001506276
      14 1287 0.107250000   0.107698745
      15 2679 0.223250000   0.224184100
      16 3033 0.252750000   0.253807531
      17 3008 0.250666667   0.251715481
      18 1896 0.158000000   0.158661088
    <NA>   50 0.004166667            NA
# remove all the strings manually & make age numeric
yrbss_demo %>% 
  mutate(
    age_num = str_remove_all(age, " years old"),
    age_num = str_remove_all(age_num, " or younger"),
    age_num = str_remove_all(age_num, " or older"),
    age_num = as.numeric(age_num)
    ) %>%
  tabyl(age_num) # perfect
 age_num    n     percent valid_percent
      12   29 0.002416667   0.002426778
      13   18 0.001500000   0.001506276
      14 1287 0.107250000   0.107698745
      15 2679 0.223250000   0.224184100
      16 3033 0.252750000   0.253807531
      17 3008 0.250666667   0.251715481
      18 1896 0.158000000   0.158661088
      NA   50 0.004166667            NA

8.4 G2 Option 2: str_remove_all() with regex

  • Regex (short for regular expressions) is a way to search strings using special syntax.
  • We could have removed all of the alphabet letters in one step with regex:
# remove all the alphabet using regex
yrbss_demo %>% 
  mutate(
    age_num = str_remove_all(age, "[a-z]"),
    age_num = as.numeric(age_num)
    ) %>%
  tabyl(age_num) 
 age_num    n     percent valid_percent
      12   29 0.002416667   0.002426778
      13   18 0.001500000   0.001506276
      14 1287 0.107250000   0.107698745
      15 2679 0.223250000   0.224184100
      16 3033 0.252750000   0.253807531
      17 3008 0.250666667   0.251715481
      18 1896 0.158000000   0.158661088
      NA   50 0.004166667            NA
  • Or, we can remove specific words with regex (recall | means or):
# remove a set of words with regex |
yrbss_demo %>% 
  mutate(
    age_num = str_remove_all(age, "years|old|or|younger|older"),
    age_num = as.numeric(age_num)
    ) %>%
  tabyl(age_num) 
 age_num    n     percent valid_percent
      12   29 0.002416667   0.002884424
      13   18 0.001500000   0.001790332
      14 1287 0.107250000   0.128008753
      15 2679 0.223250000   0.266461110
      16 3033 0.252750000   0.301670977
      17 3008 0.250666667   0.299184404
      NA 1946 0.162166667            NA
  • What went wrong with the code above?
    • How do we fix it?

8.5 G2 Option 3: Clever separate for age

We could have used separate_wider_delim by being a bit clever!

What to use as the “separator” to keep the numbers in their own column?

yrbss_demo %>% tabyl(age)
                     age    n     percent valid_percent
 12 years old or younger   29 0.002416667   0.002426778
            13 years old   18 0.001500000   0.001506276
            14 years old 1287 0.107250000   0.107698745
            15 years old 2679 0.223250000   0.224184100
            16 years old 3033 0.252750000   0.253807531
            17 years old 3008 0.250666667   0.251715481
   18 years old or older 1896 0.158000000   0.158661088
                    <NA>   50 0.004166667            NA
yrbss_demo %>% 
  separate_wider_delim(
    cols = age, 
    names = c("age_num","old"),
    delim = " years",  # our data has this phrase after the number
    cols_remove = FALSE # keep the age column
    ) %>%
  glimpse()
Rows: 12,000
Columns: 13
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age_num       <chr> "17", "17", "16", "16", "18", "15", "17", "18", "15", "1…
$ old           <chr> " old", " old", " old", " old", " old or older", " old",…
$ age           <chr> "17 years old", "17 years old", "16 years old", "16 year…
$ grade         <chr> "12th", "12th", "9th", "10th", "11th", "11th", "11th", "…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ grade_num     <dbl> 12, 12, 9, 10, 11, 11, 11, 12, 9, 11, 12, 9, 11, 9, 9, 1…
# separate, and let's reassign it to yrbss_demo this time
yrbss_demo <- yrbss_demo %>% 
  separate_wider_delim(
    cols = age, 
    names = c("age_num","old"),
    delim = " years", 
    cols_remove = FALSE # keep the age column
    ) %>%
  mutate(age_num = as.numeric(age_num)) %>%
  select(-old) # remove the old column that we do not need
yrbss_demo %>% tabyl(age, age_num) # yep!
                     age 12 13   14   15   16   17   18 NA_
 12 years old or younger 29  0    0    0    0    0    0   0
            13 years old  0 18    0    0    0    0    0   0
            14 years old  0  0 1287    0    0    0    0   0
            15 years old  0  0    0 2679    0    0    0   0
            16 years old  0  0    0    0 3033    0    0   0
            17 years old  0  0    0    0    0 3008    0   0
   18 years old or older  0  0    0    0    0    0 1896   0
                    <NA>  0  0    0    0    0    0    0  50
glimpse(yrbss_demo)
Rows: 12,000
Columns: 12
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age_num       <dbl> 17, 17, 16, 16, 18, 15, 17, 18, 15, 17, 17, 15, 18, 15, …
$ age           <chr> "17 years old", "17 years old", "16 years old", "16 year…
$ grade         <chr> "12th", "12th", "9th", "10th", "11th", "11th", "11th", "…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ grade_num     <dbl> 12, 12, 9, 10, 11, 11, 11, 12, 9, 11, 12, 9, 11, 9, 9, 1…

8.6 G2 Option 4: parse_number() to clean age

  • parse_number() “parses the first number it finds, dropping any non-numeric characters before the first number and all characters after the first number.”
  • parse_number() is from the reader package, which gets loaded as a one of the core tidyverse packages when tidyverse is loaded.
glimpse(yrbss_demo)
Rows: 12,000
Columns: 12
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age_num       <dbl> 17, 17, 16, 16, 18, 15, 17, 18, 15, 17, 17, 15, 18, 15, …
$ age           <chr> "17 years old", "17 years old", "16 years old", "16 year…
$ grade         <chr> "12th", "12th", "9th", "10th", "11th", "11th", "11th", "…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ grade_num     <dbl> 12, 12, 9, 10, 11, 11, 11, 12, 9, 11, 12, 9, 11, 9, 9, 1…
yrbss_demo <- yrbss_demo %>% 
  mutate(age_parse = parse_number(age))
# check:
yrbss_demo %>% tabyl(age, age_parse) %>% adorn_title()
                         age_parse                                
                     age        12 13   14   15   16   17   18 NA_
 12 years old or younger        29  0    0    0    0    0    0   0
            13 years old         0 18    0    0    0    0    0   0
            14 years old         0  0 1287    0    0    0    0   0
            15 years old         0  0    0 2679    0    0    0   0
            16 years old         0  0    0    0 3033    0    0   0
            17 years old         0  0    0    0    0 3008    0   0
   18 years old or older         0  0    0    0    0    0 1896   0
                    <NA>         0  0    0    0    0    0    0  50
glimpse(yrbss_demo) # age_parse data type?
Rows: 12,000
Columns: 13
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age_num       <dbl> 17, 17, 16, 16, 18, 15, 17, 18, 15, 17, 17, 15, 18, 15, …
$ age           <chr> "17 years old", "17 years old", "16 years old", "16 year…
$ grade         <chr> "12th", "12th", "9th", "10th", "11th", "11th", "11th", "…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ grade_num     <dbl> 12, 12, 9, 10, 11, 11, 11, 12, 9, 11, 12, 9, 11, 9, 9, 1…
$ age_parse     <dbl> 17, 17, 16, 16, 18, 15, 17, 18, 15, 17, 17, 15, 18, 15, …
  • Note that age_parse is already a numeric variable!

8.7 Challenge 2 (~10 minutes)

  1. For yrbss_demo, use separate_wider_delim() to split the column grade into two columns by the letter t, that is, “12th” would become “12” and “h”, then remove that second column and convert the grade number column from a character column into a numeric column. Check your work with tabyl().
  1. For yrbss_demo, use str_replace_all() to replace the appropriate characters to change
    • “Am” to “American” in the category “Am Indian / Alaska Native”, and
    • “PI” to “Pacific Islander” in the “Native Hawaiian/other PI” category.
    • Check your work with tabyl().

9 (§3) Join questionnaire and demographics data

Now that we’ve wrangled the questionnaire and demographics datasets, we’re ready to merge (join) them together.

  • We have data on patients from two sources:
    • one contains demographics,
    • the other contains answers to survey questionnaires.
  • How do we combine these two data sets into one table?

9.1 Naming conventions

  • Whenever we join two tables, we will have - a left table (in this case, the x table) and - a right table (in this case, the y table).
  • key:
    • In order to join the two tables, we have to somehow map the rows of our left table (x) with the rows of our other table (y).
    • We do this by joining together rows with a common variable.
    • In our case, we need to join rows based on the first column in both tables.
    • The column that we join on is called a key.
Note

In the above example, we see that there is a row in x whose key is 2, and there is a row in y whose key is 2 as well. So it makes sense to join these two rows together into one row.

So, by matching the rows with identical keys, we can put together a table that incorporates information from both tables.

9.2 Types of joins

  • There are several different types of joins:
    • inner join
    • left join
    • right join
    • full join
    • anti join

The most common usages require an inner join, a left join, or a full join.

9.3 Inner Joins

  • With an inner join, we only keep rows that have matching keys in both tables.


  • If there is no match in both tables, we don’t include the row with that key.
    • In the example to the left, we don’t keep row 3 in table x or row 4 in table y.

9.4 Joining Syntax

inner_join(
  table_x, 
  table_y,
  join_by(key_column_x == key_column_y)
  )
  • We first specify the left table (table_x) and then the right table (table_y).
  • Note the join_by argument.
    • This is the argument where we specify which column in each table contains our key.


Older R code syntax

inner_join(
  table_x, 
  table_y,
  by = c("key_column_x" = "key_column_y")
  )

9.5 The more dplyr way to do joins

The more dplyr way to do joins is below:

#start with left table
table_x %>% 

  #join left table with right table
  inner_join(
    y = table_y, 
    
  #point out the key column in each table
    join_by(key_column_x == key_column_y)
  )
  • We start with our left table, table_x.
  • The main difference is that we don’t need to specify the x argument in inner_join(), because we are piping table_x as the first argument of inner_join().

9.6 Full Joins

  • A full join keeps all the data from both tables.

  • You could use this as your default join, and then filter your data how you want it later. However, I tend to use left_join the most.

  • When there is no information from the left or right table, these rows will have an NA. That is, the information is missing for the columns that come from the other table.

9.7 Left Joins

  • In a left join, we keep all the rows in the left table regardless of whether there is a match in the right table.

  • In the example to the left, we keep row 3 in table x even though it doesn’t have a match in table y.

    • Because there is no information from the right table, these rows will have an NA. That is, the information is missing for the columns that come from the right table.
  • I usually think of my left table as defining my study cohort. I only want to keep data on subjects in my left table, and the rest are not useful.

9.8 Toy example

  • Say we have a set of patients for whom we want to join their demographic information with their lab data, in particular their white blood cell count (WBC).
  • We also need to know whether there are patients in our set who haven’t gotten a WBC.
  • Because the lab system isn’t part of the electronic health record, we’ll need to make a join on the patient table to the lab table.
patient_table <- read_excel(here("part5", "data", "patient_example.xlsx"), sheet = 1)

wbc_table <- read_excel(here("part5", "data", "patient_example.xlsx"), sheet = 2)

Let’s look at patient_table, which we’ll use as our left table:

patient_table
# A tibble: 4 × 5
  patient_id first_name last_name   age date_of_birth      
       <dbl> <chr>      <chr>     <dbl> <dttm>             
1       2333 Scarlet    O'Hara       75 1945-11-14 00:00:00
2       7359 Neil       O'Hara       30 1990-11-20 00:00:00
3       1414 Ivy        Lee          43 1977-05-02 00:00:00
4       8424 May        Lynne        14 2006-02-08 00:00:00

Here is wbc_table, which we’ll use as our right table:

wbc_table
# A tibble: 4 × 2
  patient_id wbc_value
       <dbl>     <dbl>
1       2333      5000
2       7359     12000
3       1414      6000
4       4409      4000
  • Below we join the data with an inner_join().
    • Why was an inner_join() chosen instead of the other join options?
patient_table %>%
  inner_join(y = wbc_table,
             join_by(patient_id == patient_id)
             )
# A tibble: 3 × 6
  patient_id first_name last_name   age date_of_birth       wbc_value
       <dbl> <chr>      <chr>     <dbl> <dttm>                  <dbl>
1       2333 Scarlet    O'Hara       75 1945-11-14 00:00:00      5000
2       7359 Neil       O'Hara       30 1990-11-20 00:00:00     12000
3       1414 Ivy        Lee          43 1977-05-02 00:00:00      6000

9.9 When do we use which join?

See this two-table verbs vignette and this cheatsheet for some extra explanations.

These joins are created to match SQL joins for databases.

  • inner_join = You only want rows that are in both tables
  • left_join = You only want rows that are in one table.
    • Often the right table is a subset of the left table, so it’s easy to use this to keep everything in the bigger table, and join on the smaller table
    • If the left table contains a cohort of interest, i.e. everyone that has been given a specific treatment, and you want to get their lab values from another table, use left_join() to add those lab values in the cohort defined by the left table
  • right_join = maybe never
    • right_join does the same thing as left_join but backwards, I find left join easier to think about (personal preference)
  • full_join = does not remove any rows, you might want to use this as your default and filter later
  • anti_join and semi_join = filtering joins, probably use rarely,
    • anti_join: use right table as an exclusion criteria and find unmatched keys in left table
    • semi_join: filter left table based on keys in right table, and keep only columns from left table

9.10 Warning

Watch out for duplicated “keys” when you join (multiple observations), or you will end up with duplicated data rows.

9.11 Your Turn

  • Modify the code below to do a left_join() on patient_table and wbc_table.
    • Which patient in patient_table didn’t get a WBC?
patient_table %>%
  inner_join(y = wbc_table,
            join_by(patient_id == patient_id)
            )

9.12 Challenge 3: Join the YRBSS data (~5 minutes)

  1. Use left_join() to join yrbss_questions and yrbss_demo into a dataset called yrbss_all so that we retain all data combined into one table.
    • Remember, yrbss_questions is a random sample of the demographic data,
    • so think about which is your left table and your right table.


  1. Double check your join by looking at glimpse(), did you select the right variable to join by?

10 Reshaping data

10.1 Motivation

The last thing we’ll talk about today is the difference between the wide data format and the long data format.

  • A really common format in bioinformatics is the expression matrix.
  • In an expression matrix,
    • the rows correspond to different genes and
    • the columns correspond to different samples.
expression_example <- read_csv(here("part5", "data", "expression_example.csv"))

expression_example
# A tibble: 4 × 5
  gene_symbol tcga_110 tcga_332 tcga_548 tcga_523
  <chr>          <dbl>    <dbl>    <dbl>    <dbl>
1 PTEN             4.1      5.6      2.2      2.1
2 MTOR             5.6      6.6      3.1      3.2
3 RKIP             1        1.4      1.2      1.2
4 KCNJ9            2.3      1.1      0.5      0.2

Note: TCGA is short for The Cancer Genome Atlas.

10.2 Why make our data long?

  • This format isn’t super useful for a lot of data manipulations.

  • It’s because the columns correspond to samples, and this prevents us from plotting things like PTEN expression across the samples.

  • This kind of data is in a format that is called wide format. What we need to do is collapse the numerical values into a single column.

  • Wide data are typical for longitudinal data, where columns correspond to time points.

    • Often we need to reshape this data into long format in order to analyze it or plot it, where each row corresponds to an observation of a subject at one time point.
    • We would have a column “time” that tells us what time point the observation was taken (i.e. baseline, 6 months, 12 months, etc).

10.3 Wide vs. long data

10.4 The magic of pivot_longer()

pivot_longer() is the function that will let us make our wide data long.

It takes three arguments:

  • cols - the columns to collapse.

    • We talked about tidyselect verbs (the language we use for select() and across()), and we will use those here to help select columns based on naming criteria. Below we’re using the starts_with() selector to grab the columns that begin with tcga
  • names_to - the name of the column where the labels will be consolidated to

  • values_to - the name of the column where the values in the columns in the wide data will be consolidated to.

10.5 R code for pivot_longer()

expression_example  # wide data
# A tibble: 4 × 5
  gene_symbol tcga_110 tcga_332 tcga_548 tcga_523
  <chr>          <dbl>    <dbl>    <dbl>    <dbl>
1 PTEN             4.1      5.6      2.2      2.1
2 MTOR             5.6      6.6      3.1      3.2
3 RKIP             1        1.4      1.2      1.2
4 KCNJ9            2.3      1.1      0.5      0.2
expression_long <- 
  expression_example %>%
    pivot_longer(
      cols= starts_with("tcga"), 
      # column names are moved to a column that we are choosing to call "sample"
      names_to = "sample", 
      # the data points from the "body" of the table are moved to a column 
      # that we are choosing to call "expression"
      values_to = "expression" 
    )

expression_long
# A tibble: 16 × 3
   gene_symbol sample   expression
   <chr>       <chr>         <dbl>
 1 PTEN        tcga_110        4.1
 2 PTEN        tcga_332        5.6
 3 PTEN        tcga_548        2.2
 4 PTEN        tcga_523        2.1
 5 MTOR        tcga_110        5.6
 6 MTOR        tcga_332        6.6
 7 MTOR        tcga_548        3.1
 8 MTOR        tcga_523        3.2
 9 RKIP        tcga_110        1  
10 RKIP        tcga_332        1.4
11 RKIP        tcga_548        1.2
12 RKIP        tcga_523        1.2
13 KCNJ9       tcga_110        2.3
14 KCNJ9       tcga_332        1.1
15 KCNJ9       tcga_548        0.5
16 KCNJ9       tcga_523        0.2

Below is a different way of selecting the columns:

expression_long <- 
  expression_example %>%
    pivot_longer(
      cols= -gene_symbol,  # everything except gene_symbol
      names_to = "sample", # column names are moved to a column called sample
      values_to = "expression" # the data points go to a column called expression
    )

expression_long
# A tibble: 16 × 3
   gene_symbol sample   expression
   <chr>       <chr>         <dbl>
 1 PTEN        tcga_110        4.1
 2 PTEN        tcga_332        5.6
 3 PTEN        tcga_548        2.2
 4 PTEN        tcga_523        2.1
 5 MTOR        tcga_110        5.6
 6 MTOR        tcga_332        6.6
 7 MTOR        tcga_548        3.1
 8 MTOR        tcga_523        3.2
 9 RKIP        tcga_110        1  
10 RKIP        tcga_332        1.4
11 RKIP        tcga_548        1.2
12 RKIP        tcga_523        1.2
13 KCNJ9       tcga_110        2.3
14 KCNJ9       tcga_332        1.1
15 KCNJ9       tcga_548        0.5
16 KCNJ9       tcga_523        0.2

10.6 ggplot loves long data

  • Creating figures with ggplot works best with long data, so that we can make use of different variables for different specifications in figures.
ggplot(expression_long) +
  aes(x = gene_symbol, 
      y = expression, 
      fill = gene_symbol) +
  geom_boxplot()

expression_long <- expression_long %>%
  mutate(
    gene_symbol = factor(
      gene_symbol, 
      levels = c("PTEN", "MTOR", "RKIP", "KCNJ9")
      ),
    # use forcats::fct_rev to reverse levels for plotting
    gene_symbol = fct_rev(gene_symbol)
    )

ggplot(expression_long) +
  aes(x = sample, 
      y = gene_symbol, 
      fill = log2(expression)) +
  geom_tile() +
  # new scale! this is a 3 color gradient (low-mid-high)
  scale_fill_gradient2(
    low = "yellow", 
    mid = "white", 
    high = "darkorchid4")

10.7 facet_wrap() - another reason to make long data!

Goal: Make a barplot of every categorical column in yrbss_all.

# from Challenge above, but, making sure we have this joined data
yrbss_all <- left_join(
  yrbss_demo, 
  yrbss_questions, 
  join_by(record_id == record, 
          year == year)
  ) 

glimpse(yrbss_all)
Rows: 12,000
Columns: 22
$ record_id          <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41…
$ year               <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 199…
$ sex                <chr> "Male", "Female", "Male", "Female", "Female", "Male…
$ age_num            <dbl> 17, 17, 16, 16, 18, 15, 17, 18, 15, 17, 17, 15, 18,…
$ age                <chr> "17 years old", "17 years old", "16 years old", "16…
$ grade              <chr> "12th", "12th", "9th", "10th", "11th", "11th", "11t…
$ age_and_grade      <chr> "17 years old;12th", "17 years old;12th", "16 years…
$ race4              <chr> "White", "Black or African American", "White", "Whi…
$ race7              <chr> "White", "Black or African American", "White", "Whi…
$ bmi                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ weight_kg          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ grade_num          <dbl> 12, 12, 9, 10, 11, 11, 11, 12, 9, 11, 12, 9, 11, 9,…
$ age_parse          <dbl> 17, 17, 16, 16, 18, 15, 17, 18, 15, 17, 17, 15, 18,…
$ bicycle_helmet     <chr> "Most of the time wore a helmet", "Never wore a hel…
$ text_driving       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ ever_smoked        <chr> "Yes", "No", NA, NA, "Yes", "Yes", "Yes", "Yes", "N…
$ age_first_smoked   <chr> "13 or 14 years old", "Never smoked a cigarette", N…
$ days_smoked        <chr> "0 days", "0 days", NA, NA, "0 days", "1 or 2 days"…
$ cigarettes_per_day <fct> Did not smoke cigarettes, Did not smoke cigarettes,…
$ ever_been_bullied  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ cigarettes10       <fct> None, None, NA, NA, None, 0-10, NA, 0-10, None, Non…
$ cigarettes10fac    <fct> None, None, NA, NA, None, 0-10, NA, 0-10, None, Non…

We can make an individual barplot of just one variable with geom_bar():

ggplot(yrbss_all, aes(x=race4)) + 
  geom_bar()

10.8 Make the data long

  • If we make the data long(er) we can create many barplots at once for many different variables in the dataset.
  • We will use pivot_longer() combined with facet_wrap():
yrbss_long <- yrbss_all %>%
  mutate(across(where(is.factor), as.character)) %>%
  pivot_longer(
    cols = where(is.character), # all our character vectors
    names_to = "variable_name", # column names are moved to a column = variable_name
    values_to = "value" # the data points are in a column = value
  )

# check:
glimpse(yrbss_long) # note the number or rows!
Rows: 168,000
Columns: 10
$ record_id         <dbl> 34784, 34784, 34784, 34784, 34784, 34784, 34784, 347…
$ year              <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991…
$ age_num           <dbl> 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, …
$ bmi               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ grade_num         <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, …
$ age_parse         <dbl> 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, …
$ ever_been_bullied <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ variable_name     <chr> "sex", "age", "grade", "age_and_grade", "race4", "ra…
$ value             <chr> "Male", "17 years old", "12th", "17 years old;12th",…
yrbss_long %>% tabyl(variable_name)
      variable_name     n    percent
                age 12000 0.07142857
      age_and_grade 12000 0.07142857
   age_first_smoked 12000 0.07142857
     bicycle_helmet 12000 0.07142857
       cigarettes10 12000 0.07142857
    cigarettes10fac 12000 0.07142857
 cigarettes_per_day 12000 0.07142857
        days_smoked 12000 0.07142857
        ever_smoked 12000 0.07142857
              grade 12000 0.07142857
              race4 12000 0.07142857
              race7 12000 0.07142857
                sex 12000 0.07142857
       text_driving 12000 0.07142857
  • This may seem a strange thing to do since we now have duplicated numeric data. But that’s ok.

10.9 Create many barplots with long data

What if we just make a simple bar plot with these data?

ggplot(yrbss_long, 
       aes(x = value)) +
  geom_bar()

Oops, that’s not what we want! We want each variable to have it’s own separate barplot, so we facet by the column variable_name.

ggplot(yrbss_long, 
       aes(x = value)) +
  geom_bar() +
  facet_wrap(vars(variable_name))

  • Whoa. What’s going on here?
  • It’s hard to see, but for every barplot, every possible “value” is on the x-axis.
    • But for sex, for example, there are only two options; for example we probably wouldn’t choose “17 years old” as our sex.
  • So we want to change the scale of the facet to only fit the data.
ggplot(yrbss_long, aes(x = value)) +
  geom_bar() +
  facet_wrap(vars(variable_name), 
             scales = "free_x")

  • The labels are still a bit cramped.
  • We could rotate the x axis labels perhaps or use a method to dodge them to see them better,
    • but I actually would probably flip the axes, so we can use y = value instead of x = value.
ggplot(yrbss_long, 
       aes(y = value)) +
  geom_bar() +
  facet_wrap(vars(variable_name), 
             scales = "free_y",
             ncol = 3) +
  labs(y= "", 
       x= "", 
       title = "Barplots of all categorical variables in YRBSS") +
  theme_bw()

Remove missing values:

ggplot(
  # we can use tidyverse inside the data argument
  data = yrbss_long %>% filter(!is.na(value)), 
       aes(y = value)) +
  geom_bar() +
  facet_wrap(vars(variable_name), 
             scales = "free_y",
             ncol = 3) +
  labs(y= "", 
       x= "", 
       title = "Barplots of all categorical variables in YRBSS",
       subtitle = "Missing values removed") +
  theme_bw()

Ok, let’s get even more advanced. What if we wanted to show this barplot, but comparing male and female?

Let’s re-make our long data first.

yrbss_long <- yrbss_all %>%
  mutate(across(where(is.factor), as.character)) %>%
  pivot_longer(
    cols = c(where(is.character), -sex), # all our character vectors EXCEPT sex
    names_to = "variable_name", # column names are moved to a column = variable_name
    values_to = "value" # the data points are in a column = value
  )

# check: 
glimpse(yrbss_long)
Rows: 156,000
Columns: 11
$ record_id         <dbl> 34784, 34784, 34784, 34784, 34784, 34784, 34784, 347…
$ year              <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991…
$ sex               <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Mal…
$ age_num           <dbl> 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, …
$ bmi               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ grade_num         <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, …
$ age_parse         <dbl> 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, …
$ ever_been_bullied <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ variable_name     <chr> "age", "grade", "age_and_grade", "race4", "race7", "…
$ value             <chr> "17 years old", "12th", "17 years old;12th", "White"…
yrbss_long %>% tabyl(variable_name)
      variable_name     n    percent
                age 12000 0.07692308
      age_and_grade 12000 0.07692308
   age_first_smoked 12000 0.07692308
     bicycle_helmet 12000 0.07692308
       cigarettes10 12000 0.07692308
    cigarettes10fac 12000 0.07692308
 cigarettes_per_day 12000 0.07692308
        days_smoked 12000 0.07692308
        ever_smoked 12000 0.07692308
              grade 12000 0.07692308
              race4 12000 0.07692308
              race7 12000 0.07692308
       text_driving 12000 0.07692308

Now create the facetted bargraphs stratified by sex:

ggplot(
  # remove NA
  data = yrbss_long %>% filter(!is.na(value)), 
       aes(y = value,
           # use sex to fill colors
           fill = sex)) +
  # default is stacked bars, so to show side by side include position = "dodge"
  geom_bar(position = "dodge") + 
  facet_wrap(vars(variable_name), 
             scales = "free_y",
             ncol = 3) +
  labs(y= "", 
       x= "", 
       title = "Barplots of all categorical variables in YRBSS by sex",
       subtitle = "Missing values removed") +
  theme_bw()

(A reminder that this is a subsample of survey data and we are ignoring all the sampling weights, including the way I resampled it, so we can’t learn much from a comparison like this without actually doing the real stats work!)

10.10 pivot_wider() exists too!

… and part of your assignment will be to figure out how to use it.

11 Post Class Survey

Please fill out the post-class survey.

Your responses are anonymous in that I separate your names from the survey answers before compiling/reading.

You may want to review previous years’ feedback here.

12 Acknowledgements

  • Part 5 is based on the BSTA 505 Winter 2023 course, taught by Jessica Minnier.
    • I made modifications to update the material from RMarkdown to Quarto, and streamlined/edited content for slides.
    • Also made updates for new R functions/ specifications, such as separate_wider_delim()
  • Minnier’s Acknowledgements:
    • Originally written by Jessica Minnier, Aaron Coyner and Ted Laderas.
    • The example data using YRBSS surveys were partially created when developing the OHSU BERD workshop on data wrangling with Meike Niederhausen.