5: Data summarizing, reshaping, and wrangling with multiple tables

BSTA 526: R Programming for Health Data Science

Author

Affiliation

Meike Niederhausen, PhD & Jessica Minnier, PhD

OHSU-PSU School of Public Health

Published

February 5, 2026

Modified

February 12, 2026

1 Welcome to R Programming: Part 5!

Today we will cover creating new variables and summarizing existing variables.

Before you get started:

Remember to save this notebook under a new name, such as part_05_b526_YOURNAME.qmd.

Load the packages & smoke_complete.xlsx data in the setup code chunk

1.1 Learning Objectives

Learn and apply loading comma separated and tab separated datasets using the readr package
Learn different techniques for cleaning data using functions from the tidyr, forcats, and stringrpackages
- Practice cleaning data with a real dataset
Learn and apply bind_rows() to combine rows from two or more datasets
Learn about ways to merge columns from different datasets
- Apply inner_join() and left_join() to merge columns from different datasets
Learn about wide vs long data and how to reshape data
- apply pivot_longer() to make a wide dataset long

2 Overview

§0: Read in data files
- questionnaire data (1 file)
- demographics data (3 files for different time periods)
§1 Wrangle questionnaire data
- Goal 1: group cigarettes_per_day into 3 categories
  - case_when(), fct_collapse(), fct_other()
- Goal 2: remove rows with NAs with drop_na()

§2 Wrangle demographics data
- Merge together (stack) the 3 files with bind_rows()
- Clean age_and_grade
  - separate into two columns
  - make values numeric (stringr package)
§3 Join questionnaire and demographics data
Making data long or wide

3 (§0) Reading in csv and other deliminated files

3.1 About the data YRBSS

Today we are moving away from the smoking data to use a new somewhat “messier” data set from the CDC’s Youth Risk Behavior Surveillance System (YRBSS).
- From webpage: “measures health-related behaviors and experiences that can lead to death and disability among youth and adults”
The data we are working with are a subset of data that are available in the R package yrbss.
- This subset of the package’s data have intentionally been made messy and broken into pieces for us to practice common data cleaning steps and joining different data tables together.
- Note this sub-sample is actually from survey data, and we are ignoring the sampling weights since we aren’t really focused on analysis right now. So if you try to look at any associations between variables, don’t believe what the data is telling you!

3.2 Loading plain text files

Previously: Excel files
- We’ve usually been reading in Excel files, which is a very common way to save data.
It’s actually preferred to save data as plain text files so that you don’t need a proprietary software (i.e. Excel) to open them, and they are more universal.
Common plain text files include
- .csv (columns are comma separated values)
- .tsv (columns are tab separated values)
- Check out the csv and tsv files in the data folder.
We’ll be using functions from the readr package:
- read_csv(), read_tsv(), read_delim()

3.3 Load the 3 demographic data files

yrbss_demo_1991.tsv
yrbss_demo_1993to2007.csv
yrbss_demo_2009to2013.csv

tsv: read_tsv() or read_delim()
csv: read_csv()

# data from one year, 1991, tsv (tab separated) file
yrbss_demo_1991 <- read_tsv(
  file = here("part5", "data","yrbss_demo_1991.tsv"))

Rows: 1000 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): sex, age_and_grade, race4, race7
dbl (2): record_id, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# we can also use read_delim and specify the delimiter that separates fields
yrbss_demo_1991 <- read_delim(
  file = here("part5", "data","yrbss_demo_1991.tsv"),
  delim = "\t" # tab delimiter
  )

Rows: 1000 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): sex, age_and_grade, race4, race7
dbl (2): record_id, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# data from 1993-2007, csv file
yrbss_demo_1993to2007 <- read_csv(
  file = here("part5", "data","yrbss_demo_1993to2007.csv"))

Rows: 8000 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): sex, age_and_grade, race4, race7
dbl (4): record_id, year, bmi, weight_kg

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# data from 2009-2013, csv file
yrbss_demo_2009to2013 <- read_csv(
  file = here("part5", "data","yrbss_demo_2009to2013.csv"))

Rows: 3000 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): age and grade, sex, Race 4 cat, Race 7 cat
dbl (4): Record id, year, BMI (kg/m^2), Weight (kg)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3.4 Challenge 1 (~5 minutes)

Load the data file “yrbss_questions.csv” using here() and save it as the object yrbss_questions.

Use the clean_names() function to clean up the names of yrbss_questions, and then rename the columns so that they are as follows:
- record, year, bicycle_helmet, text_driving, ever_smoked, age_first_smoked, days_smoked, cigarettes_per_day, ever_been_bullied

For yrbss_questions, make a cross table (2-way table) of cigarettes_per_day and ever_smoked. Do you see anything unusual or unexpected?

4 (§1) Wrangle questionnaire data `yrbss_questions`

Goal 1: group cigarettes_per_day into 3 categories
Goal 2: remove rows with NAs with drop_na()

4.1 Load the questionnaire data

The code below is loading the data from Challenge 1.

yrbss_questions <- 
  read_csv(file = here("part5", "data", "yrbss_questions.csv")) %>% 
  clean_names() %>% 
  rename(
    bicycle_helmet = how_often_wear_bicycle_helmet,
    text_driving = ever_text_while_driving_in_past_30_days,
    age_first_smoked = how_old_when_first_smoked,
    days_smoked = how_many_days_smoked_in_past_30_days,
    cigarettes_per_day = how_many_cigarettes_day_in_past_30_days
  )

Rows: 10000 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): How often wear bicycle helmet, Ever text while driving in past 30 d...
dbl (2): record, year
lgl (1): Ever been bullied

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

4.2 Goal 1: group `cigarettes_per_day` into 3 categories

Let’s group the cigarettes per day column into three categories:
- “None”, “0-10”, “>10” cigarettes per day.
Below we do this in different ways:
- Using case_when()
- Using factor functions from the forcats package

# what are the current categories?
yrbss_questions %>% tabyl(cigarettes_per_day)

       cigarettes_per_day    n percent valid_percent
              1 cigarette  363  0.0363   0.038686987
      11 to 20 cigarettes  139  0.0139   0.014814025
        2 to 5 cigarettes  676  0.0676   0.072045188
       6 to 10 cigarettes  221  0.0221   0.023553235
 Did not smoke cigarettes 7441  0.7441   0.793029948
    Less than 1 cigarette  451  0.0451   0.048065651
  More than 20 cigarettes   92  0.0092   0.009804966
                     <NA>  617  0.0617            NA

4.3 Goal 1 with `case_when()`

Use case_when() to group the cigarettes per day column into three categories:
- “None”, “0-10”, “>10” cigarettes per day.

Why does this code not work as intended?

# note that we're not saving this attempt
yrbss_questions %>%      
  mutate(
    cigarettes10 = case_when(
      cigarettes_per_day == "Did not smoke cigarettes" ~ "None",
      cigarettes_per_day %in% 
        c("11 to 20 cigarettes", "More than 20 cigarettes") ~ ">10",
      .default = "0-10"
      ),
    cigarettes10 = factor(
      cigarettes10,
      levels = c("None","0-10",">10")) # make sure levels are in order
    ) %>%
  tabyl(cigarettes_per_day, cigarettes10) %>%
  adorn_title()

                          cigarettes10         
       cigarettes_per_day         None 0-10 >10
              1 cigarette            0  363   0
      11 to 20 cigarettes            0    0 139
        2 to 5 cigarettes            0  676   0
       6 to 10 cigarettes            0  221   0
 Did not smoke cigarettes         7441    0   0
    Less than 1 cigarette            0  451   0
  More than 20 cigarettes            0    0  92
                     <NA>            0  617   0

# this time we're saving our changes:

yrbss_questions <- yrbss_questions %>% mutate(
  cigarettes10 = case_when(
    cigarettes_per_day == "Did not smoke cigarettes" ~ "None",
    cigarettes_per_day %in% 
      c("11 to 20 cigarettes", "More than 20 cigarettes") ~ ">10",
    is.na(cigarettes_per_day) ~ NA, # need to specify NA type
    # is.na(cigarettes_per_day) ~ as.character(NA), # also would work
    TRUE ~ "0-10"
  ),
  cigarettes10 = factor(
    cigarettes10,
    levels = c("None","0-10",">10")) # make sure levels are in order
) 

yrbss_questions %>%
  tabyl(cigarettes_per_day, cigarettes10) %>%
  adorn_title()

                          cigarettes10             
       cigarettes_per_day         None 0-10 >10 NA_
              1 cigarette            0  363   0   0
      11 to 20 cigarettes            0    0 139   0
        2 to 5 cigarettes            0  676   0   0
       6 to 10 cigarettes            0  221   0   0
 Did not smoke cigarettes         7441    0   0   0
    Less than 1 cigarette            0  451   0   0
  More than 20 cigarettes            0    0  92   0
                     <NA>            0    0   0 617

Starting with dplyr 1.1.0 we no longer need to specify the type of NA
- such as NA_character_, NA_real_, or NA_integer_
See presentation on the new easier to use case_when() from posit::conference 2023, starting at 0:31.

Back in the old days we had to be careful… and you will likely still see case_when() examples on the internet using NA_character_, NA_real_, or NA_integer_

See this webpage for discussion of NAtypes when usingcase_when().

4.4 Goal 1 with factors using `forcats`

Use factor functions from the forcats package to group the cigarettes per day column into three categories:
- “None”, “0-10”, “>10” cigarettes per day.
I encourage you to look at the intro to forcats vignette and cheatsheet here.
Instead of using case_when(),
- we could instead collapse the cigarettes per day column into three categories by creating a factor, and
- using the handy forcats functions fct_collapse() and fct_other().

Note: This is a somewhat contrived demonstration of how to use fct_collapse and fct_other together so you can see what each of these functions does, but you could also just use fct_collapse() by itself.

yrbss_questions %>% tabyl(cigarettes_per_day)

       cigarettes_per_day    n percent valid_percent
              1 cigarette  363  0.0363   0.038686987
      11 to 20 cigarettes  139  0.0139   0.014814025
        2 to 5 cigarettes  676  0.0676   0.072045188
       6 to 10 cigarettes  221  0.0221   0.023553235
 Did not smoke cigarettes 7441  0.7441   0.793029948
    Less than 1 cigarette  451  0.0451   0.048065651
  More than 20 cigarettes   92  0.0092   0.009804966
                     <NA>  617  0.0617            NA

4.5 Steps 1 & 2: `fct_other()` to create the `0-10` factor level

yrbss_questions <- yrbss_questions %>%
  mutate(
    # step 1: make cigarettes_per_day a factor
    cigarettes_per_day = factor(cigarettes_per_day),
    #
    # step 2: create an other group called "0-10" for the many 0-10 cig/day levels
    cigarettes10fac =
      fct_other(
        cigarettes_per_day,
        # specify levels to keep (there's also a drop option)
        keep = c(
          "Did not smoke cigarettes",
          "11 to 20 cigarettes",
          "More than 20 cigarettes"
        ),
        # specify name for collapsed remaining levels
        other_level = "0-10"  # default is "Other" if not specified
      )
  )

# check what happened
yrbss_questions %>%
  tabyl(cigarettes_per_day, cigarettes10fac) %>%
  adorn_title()

                              cigarettes10fac                         
       cigarettes_per_day 11 to 20 cigarettes Did not smoke cigarettes
              1 cigarette                   0                        0
      11 to 20 cigarettes                 139                        0
        2 to 5 cigarettes                   0                        0
       6 to 10 cigarettes                   0                        0
 Did not smoke cigarettes                   0                     7441
    Less than 1 cigarette                   0                        0
  More than 20 cigarettes                   0                        0
                     <NA>                   0                        0
                                 
 More than 20 cigarettes 0-10 NA_
                       0  363   0
                       0    0   0
                       0  676   0
                       0  221   0
                       0    0   0
                       0  451   0
                      92    0   0
                       0    0 617

4.6 Step 3: `fct_collapse()` to create the `None` and `>10` factor levels

yrbss_questions <- yrbss_questions %>%
  mutate(
    # step 3: create "None" level and collapse the two "top" levels to one ">10" category
    cigarettes10fac =
      fct_collapse(
        cigarettes10fac,
        "None" = "Did not smoke cigarettes",
        ">10" = c("11 to 20 cigarettes", "More than 20 cigarettes")
      ))

# check what happened
yrbss_questions %>%
  tabyl(cigarettes_per_day, cigarettes10fac) %>%
  adorn_title()

                          cigarettes10fac              
       cigarettes_per_day             >10 None 0-10 NA_
              1 cigarette               0    0  363   0
      11 to 20 cigarettes             139    0    0   0
        2 to 5 cigarettes               0    0  676   0
       6 to 10 cigarettes               0    0  221   0
 Did not smoke cigarettes               0 7441    0   0
    Less than 1 cigarette               0    0  451   0
  More than 20 cigarettes              92    0    0   0
                     <NA>               0    0    0 617

# we could have done all of this with fct_collapse, with a tiny bit more typing

4.7 Steps 1-3 Visual representation

4.8 Goal 2: remove rows with `NA`s with `drop_na()`

Note that below we do not remove any rows with missing data from yrbss_questions. All changes are saved to a different tibble.

If we want to remove rows that have missingness, we can use the drop_na() function.
We can choose to
- remove rows that have missingness only in a certain column or columns, or
- we can remove rows that have missingness in any column.
See the drop_na() reference for examples.

4.9 `drop_na()` to remove rows with NA

nrow(yrbss_questions)

[1] 10000

# from the naniar package:
miss_var_summary(yrbss_questions)

# A tibble: 11 × 3
   variable           n_miss pct_miss
   <chr>               <int>    <num>
 1 text_driving         9206    92.1 
 2 ever_been_bullied    7608    76.1 
 3 cigarettes_per_day    617     6.17
 4 cigarettes10          617     6.17
 5 cigarettes10fac       617     6.17
 6 ever_smoked           584     5.84
 7 days_smoked           495     4.95
 8 bicycle_helmet        486     4.86
 9 age_first_smoked      457     4.57
10 record                  0     0   
11 year                    0     0

# remove rows with missing bicycle_helmet:
tmpdata <- yrbss_questions %>% 
  drop_na(bicycle_helmet) 
nrow(tmpdata)             # how many rows left?

[1] 9514

miss_var_summary(tmpdata) # which rows do not have NA?

# A tibble: 11 × 3
   variable           n_miss pct_miss
   <chr>               <int>    <num>
 1 text_driving         8736    91.8 
 2 ever_been_bullied    7192    75.6 
 3 cigarettes_per_day    578     6.08
 4 cigarettes10          578     6.08
 5 cigarettes10fac       578     6.08
 6 ever_smoked           505     5.31
 7 days_smoked           457     4.80
 8 age_first_smoked      424     4.46
 9 record                  0     0   
10 year                    0     0   
11 bicycle_helmet          0     0

# remove rows with missing smoking data
tmpdata <- yrbss_questions %>% 
  drop_na(ends_with("smoked")) 
nrow(tmpdata)             # how many rows left?

[1] 8931

miss_var_summary(tmpdata) # which rows do not have NA?

# A tibble: 11 × 3
   variable           n_miss pct_miss
   <chr>               <int>    <num>
 1 text_driving         8198   91.8  
 2 ever_been_bullied    6813   76.3  
 3 bicycle_helmet        375    4.20 
 4 cigarettes_per_day     34    0.381
 5 cigarettes10           34    0.381
 6 cigarettes10fac        34    0.381
 7 record                  0    0    
 8 year                    0    0    
 9 ever_smoked             0    0    
10 age_first_smoked        0    0    
11 days_smoked             0    0

# remove rows with any missing data
tmpdata <- yrbss_questions %>% 
  drop_na() 
nrow(tmpdata)             # how many rows left?

[1] 711

miss_var_summary(tmpdata) # which rows do not have NA?

# A tibble: 11 × 3
   variable           n_miss pct_miss
   <chr>               <int>    <num>
 1 record                  0        0
 2 year                    0        0
 3 bicycle_helmet          0        0
 4 text_driving            0        0
 5 ever_smoked             0        0
 6 age_first_smoked        0        0
 7 days_smoked             0        0
 8 cigarettes_per_day      0        0
 9 ever_been_bullied       0        0
10 cigarettes10            0        0
11 cigarettes10fac         0        0

5 (§2) Wrangle demographics data

Merge together (stack) the 3 files with bind_rows()
Clean age_and_grade
- separate into two columns
- make values numeric (stringr package)

6 Stack multiple datasets with `bind_rows()`

6.1 When do we stack data?

Often we have data from multiple sources that we need to combine into one data frame.
- In the most simple scenario, the data are measuring the same or similar things and are from different and distinct populations, i.e. from different labs, different clinics, etc.
In this case, we have YRBSS demographic data from 3 different ranges of sampling years.
- They represent data from 3 different cohorts of students across different times (no overlapping student record ids! This is an important detail).
Goal: stack the data on top of each other into one long data set that contains every year’s data.
To do this we can use the dplyr function bind_rows().

6.2 Our 3 datasets to stack

Let’s look at our three data sets:

glimpse(yrbss_demo_1991)

Rows: 1,000
Columns: 6
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …

glimpse(yrbss_demo_1993to2007)

Rows: 8,000
Columns: 8
$ record_id     <dbl> 121641, 107323, 116829, 116696, 109658, 118948, 107150, …
$ year          <dbl> 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 19…
$ sex           <chr> "Female", "Male", "Female", "Female", "Male", "Female", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ age_and_grade <chr> "17 years old;10th", "14 years old;9th", "15 years old;1…
$ race4         <chr> "Hispanic/Latino", "Black or African American", "White",…
$ race7         <chr> "Hispanic/Latino", "Black or African American", "White",…

glimpse(yrbss_demo_2009to2013)

Rows: 3,000
Columns: 8
$ `Record id`     <dbl> 931939, 927037, 924426, 923751, 929473, 938296, 931286…
$ year            <dbl> 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, …
$ `age and grade` <chr> "15 years old;9th", "17 years old;11th", "15 years old…
$ sex             <chr> "Female", "Male", "Male", "Male", "Male", "Female", "F…
$ `Race 4 cat`    <chr> "White", "White", NA, "White", "White", "All other rac…
$ `Race 7 cat`    <chr> "White", "White", NA, "White", "White", "Asian", "Mult…
$ `BMI (kg/m^2)`  <dbl> 18.2076, 22.7338, 24.4376, 21.4746, 25.0709, 20.8936, …
$ `Weight (kg)`   <dbl> 52.62, 68.04, 74.84, 68.04, 70.76, 58.97, 58.97, 70.76…

6.3 Stack 1991 and 1993-2007 data

We’ll start by binding together/stacking the data from 1991 and 1993-2007.
First, let’s check those column names again:

colnames(yrbss_demo_1991)

[1] "record_id"     "year"          "sex"           "age_and_grade"
[5] "race4"         "race7"

dim(yrbss_demo_1991)

[1] 1000    6

colnames(yrbss_demo_1993to2007)

[1] "record_id"     "year"          "sex"           "bmi"          
[5] "weight_kg"     "age_and_grade" "race4"         "race7"

dim(yrbss_demo_1993to2007)

[1] 8000    8

We can see that BMI and weight are missing from the 1991 data.
But bind_rows() still works as intended, filling in NA values in those columns.

yrbss_demo_1991to2007 <- bind_rows(yrbss_demo_1991, 
                                   yrbss_demo_1993to2007)

# note we now have 9000 rows, which is the sum of the 1000 and 8000 rows above
glimpse(yrbss_demo_1991to2007)

Rows: 9,000
Columns: 8
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Let’s check we have data from the years we want:

yrbss_demo_1991to2007 %>% 
  tabyl(year)

 year    n   percent
 1991 1000 0.1111111
 1993 1000 0.1111111
 1995 1000 0.1111111
 1997 1000 0.1111111
 1999 1000 0.1111111
 2001 1000 0.1111111
 2003 1000 0.1111111
 2005 1000 0.1111111
 2007 1000 0.1111111

6.4 Stack 1991-2007 with 2009-2013 data

Now we want to combine this dataset and the 2009-2013 data.

Let’s check out their info first:

colnames(yrbss_demo_1991to2007)

[1] "record_id"     "year"          "sex"           "age_and_grade"
[5] "race4"         "race7"         "bmi"           "weight_kg"

dim(yrbss_demo_1991to2007)

[1] 9000    8

colnames(yrbss_demo_2009to2013)

[1] "Record id"     "year"          "age and grade" "sex"          
[5] "Race 4 cat"    "Race 7 cat"    "BMI (kg/m^2)"  "Weight (kg)"

dim(yrbss_demo_2009to2013)

[1] 3000    8

Uh oh, these datasets have different names for the same thing.
Also, the columns are all out of order!

If we try to bind them together, what happens?

bind_rows(yrbss_demo_1991to2007, yrbss_demo_2009to2013) %>% glimpse()

Rows: 12,000
Columns: 14
$ record_id       <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873…
$ year            <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, …
$ sex             <chr> "Male", "Female", "Male", "Female", "Female", "Male", …
$ age_and_grade   <chr> "17 years old;12th", "17 years old;12th", "16 years ol…
$ race4           <chr> "White", "Black or African American", "White", "White"…
$ race7           <chr> "White", "Black or African American", "White", "White"…
$ bmi             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ weight_kg       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Record id`     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `age and grade` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Race 4 cat`    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Race 7 cat`    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `BMI (kg/m^2)`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Weight (kg)`   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

Yikes.
- What extra columns do we have?
- What matched and what didn’t?

We need to first do some data cleaning!

Let’s make sure our names in the 2009-2013 data match the first dataset’s column names.
- First use janitor::clean_names to make things a little cleaner:

yrbss_demo_2009to2013 <- yrbss_demo_2009to2013 %>% 
  clean_names()

colnames(yrbss_demo_2009to2013)

[1] "record_id"     "year"          "age_and_grade" "sex"          
[5] "race_4_cat"    "race_7_cat"    "bmi_kg_m_2"    "weight_kg"

yrbss_demo_2009to2013 <- yrbss_demo_2009to2013 %>%
  rename(race4 = race_4_cat,
         race7 = race_7_cat,
         bmi = bmi_kg_m_2)


# look at the column names now:
colnames(yrbss_demo_1991to2007)

[1] "record_id"     "year"          "sex"           "age_and_grade"
[5] "race4"         "race7"         "bmi"           "weight_kg"

colnames(yrbss_demo_2009to2013)

[1] "record_id"     "year"          "age_and_grade" "sex"          
[5] "race4"         "race7"         "bmi"           "weight_kg"

# or, for a side-by-side comparison using cbind (column bind)
cbind(names(yrbss_demo_1991to2007), names(yrbss_demo_2009to2013))

     [,1]            [,2]           
[1,] "record_id"     "record_id"    
[2,] "year"          "year"         
[3,] "sex"           "age_and_grade"
[4,] "age_and_grade" "sex"          
[5,] "race4"         "race4"        
[6,] "race7"         "race7"        
[7,] "bmi"           "bmi"          
[8,] "weight_kg"     "weight_kg"

# we can check that we changed all the names to match names in the other data
colnames(yrbss_demo_2009to2013) %in% colnames(yrbss_demo_1991to2007)

[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

sum(colnames(yrbss_demo_2009to2013) %in% colnames(yrbss_demo_1991to2007))

[1] 8

Looks like all the names are the same, but they are still out of order.
This is ok! bind_rows() will use the ordering from the first data set, and knows how to match them up since they have identical names.

Now let’s bind the data:

yrbss_demo <- bind_rows(yrbss_demo_1991to2007, 
                        yrbss_demo_2009to2013)

# we could also do this with the pipe
yrbss_demo <- yrbss_demo_1991to2007 %>%
  bind_rows(yrbss_demo_2009to2013)

glimpse(yrbss_demo)

Rows: 12,000
Columns: 8
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

# check the years in the data:
yrbss_demo %>% tabyl(year)

 year    n    percent
 1991 1000 0.08333333
 1993 1000 0.08333333
 1995 1000 0.08333333
 1997 1000 0.08333333
 1999 1000 0.08333333
 2001 1000 0.08333333
 2003 1000 0.08333333
 2005 1000 0.08333333
 2007 1000 0.08333333
 2009 1000 0.08333333
 2011 1000 0.08333333
 2013 1000 0.08333333

😃 Success!

6.5 Examine the stacked data

Before we start with the data cleaning, we need to acquaint ourselves with the data:

glimpse(yrbss_demo)

Rows: 12,000
Columns: 8
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

# quick summary of the data
gtsummary::tbl_summary(yrbss_demo)

Characteristic	N = 12,000¹
record_id	454,489 (223,487, 848,375)
year	2,002 (1,996, 2,008)
sex
Female	5,918 (50%)
Male	6,012 (50%)
Unknown	70
age_and_grade
12 years old or younger;10th	4 (<0.1%)
12 years old or younger;11th	4 (<0.1%)
12 years old or younger;12th	6 (<0.1%)
12 years old or younger;9th	6 (<0.1%)
12 years old or younger;NA	9 (<0.1%)
13 years old;10th	2 (<0.1%)
13 years old;12th	1 (<0.1%)
13 years old;9th	14 (0.1%)
13 years old;NA	1 (<0.1%)
14 years old;10th	19 (0.2%)
14 years old;11th	2 (<0.1%)
14 years old;12th	1 (<0.1%)
14 years old;9th	1,258 (10%)
14 years old;NA	7 (<0.1%)
15 years old;10th	1,182 (9.9%)
15 years old;11th	28 (0.2%)
15 years old;12th	3 (<0.1%)
15 years old;9th	1,457 (12%)
15 years old;NA	9 (<0.1%)
16 years old;10th	1,376 (11%)
16 years old;11th	1,165 (9.7%)
16 years old;12th	36 (0.3%)
16 years old;9th	442 (3.7%)
16 years old;NA	14 (0.1%)
17 years old;10th	364 (3.0%)
17 years old;11th	1,327 (11%)
17 years old;12th	1,248 (10%)
17 years old;9th	55 (0.5%)
17 years old;NA	14 (0.1%)
18 years old or older;10th	46 (0.4%)
18 years old or older;11th	389 (3.2%)
18 years old or older;12th	1,430 (12%)
18 years old or older;9th	17 (0.1%)
18 years old or older;NA	14 (0.1%)
NA;10th	4 (<0.1%)
NA;11th	3 (<0.1%)
NA;12th	7 (<0.1%)
NA;9th	2 (<0.1%)
NA;NA	34 (0.3%)
race4
All other races	1,697 (14%)
Black or African American	2,905 (25%)
Hispanic/Latino	3,102 (26%)
White	4,072 (35%)
Unknown	224
race7
Am Indian / Alaska Native	230 (2.0%)
Asian	696 (6.0%)
Black or African American	2,905 (25%)
Hispanic/Latino	3,102 (27%)
Multiple - Non-Hispanic	415 (3.6%)
Native Hawaiian/other PI	110 (1.0%)
White	4,072 (35%)
Unknown	470
bmi	22.3 (20.1, 25.6)
Unknown	4,565
weight_kg	64 (56, 75)
Unknown	4,629
¹ Median (Q1, Q3); n (%)

All of these columns look pretty standard except age and grade. What even is?

7 Create `age` and `grade` columns

Goal: create two separate numeric columns with just that ages and grades

yrbss_demo %>% tabyl(age_and_grade)

                age_and_grade    n      percent
 12 years old or younger;10th    4 3.333333e-04
 12 years old or younger;11th    4 3.333333e-04
 12 years old or younger;12th    6 5.000000e-04
  12 years old or younger;9th    6 5.000000e-04
   12 years old or younger;NA    9 7.500000e-04
            13 years old;10th    2 1.666667e-04
            13 years old;12th    1 8.333333e-05
             13 years old;9th   14 1.166667e-03
              13 years old;NA    1 8.333333e-05
            14 years old;10th   19 1.583333e-03
            14 years old;11th    2 1.666667e-04
            14 years old;12th    1 8.333333e-05
             14 years old;9th 1258 1.048333e-01
              14 years old;NA    7 5.833333e-04
            15 years old;10th 1182 9.850000e-02
            15 years old;11th   28 2.333333e-03
            15 years old;12th    3 2.500000e-04
             15 years old;9th 1457 1.214167e-01
              15 years old;NA    9 7.500000e-04
            16 years old;10th 1376 1.146667e-01
            16 years old;11th 1165 9.708333e-02
            16 years old;12th   36 3.000000e-03
             16 years old;9th  442 3.683333e-02
              16 years old;NA   14 1.166667e-03
            17 years old;10th  364 3.033333e-02
            17 years old;11th 1327 1.105833e-01
            17 years old;12th 1248 1.040000e-01
             17 years old;9th   55 4.583333e-03
              17 years old;NA   14 1.166667e-03
   18 years old or older;10th   46 3.833333e-03
   18 years old or older;11th  389 3.241667e-02
   18 years old or older;12th 1430 1.191667e-01
    18 years old or older;9th   17 1.416667e-03
     18 years old or older;NA   14 1.166667e-03
                      NA;10th    4 3.333333e-04
                      NA;11th    3 2.500000e-04
                      NA;12th    7 5.833333e-04
                       NA;9th    2 1.666667e-04
                        NA;NA   34 2.833333e-03

7.1 Separate `age_and_grade` into `age` and `grade`

Ok, yes, this combined column was combined on purpose, so that we can introduce the very useful function separate_wider_delim().
Check out ?separate_wider_delim.
- Until recently this function was known as just separate(), which has been superseded with separate_wider_delim() andseparate_wider_position().

separate_wider_delim(): Split a string into columns by delimiter

In our case, age and grade were merged together with a semicolon (;)
- which makes it easy to split into two columns
Note that these data are NOT “tidy” because they contain two pieces of information in one column. Let’s make them tidy.

(Side note: the opposite of separate_wider_delim() is unite(), look at some examples of this.)

7.2 `separate_wider_delim()`

yrbss_demo <- yrbss_demo %>%
  separate_wider_delim(
    # the column we are separating
    cols = age_and_grade,
    
    # names of new columns, we know we need two columns, but this can be variable!
    names = c("age", "grade"),
    
    # delimiter to separate into two columns
    delim = ";",
    
    # keep the original column
    cols_remove = FALSE 
    ) %>%
  
  # make sure "NA" turns into R's missing NA value
  mutate(age = na_if(age, "NA"),
         grade = na_if(grade, "NA"))

That’s a start. We now have tidy data in the sense that each cell has one piece of information in it.
However, both age and grade are character vectors, but we want them to be numeric.

glimpse(yrbss_demo)

Rows: 12,000
Columns: 10
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age           <chr> "17 years old", "17 years old", "16 years old", "16 year…
$ grade         <chr> "12th", "12th", "9th", "10th", "11th", "11th", "11th", "…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

yrbss_demo %>% tabyl(age_and_grade, age) %>% adorn_title()

                                                  age                          
                age_and_grade 12 years old or younger 13 years old 14 years old
 12 years old or younger;10th                       4            0            0
 12 years old or younger;11th                       4            0            0
 12 years old or younger;12th                       6            0            0
  12 years old or younger;9th                       6            0            0
   12 years old or younger;NA                       9            0            0
            13 years old;10th                       0            2            0
            13 years old;12th                       0            1            0
             13 years old;9th                       0           14            0
              13 years old;NA                       0            1            0
            14 years old;10th                       0            0           19
            14 years old;11th                       0            0            2
            14 years old;12th                       0            0            1
             14 years old;9th                       0            0         1258
              14 years old;NA                       0            0            7
            15 years old;10th                       0            0            0
            15 years old;11th                       0            0            0
            15 years old;12th                       0            0            0
             15 years old;9th                       0            0            0
              15 years old;NA                       0            0            0
            16 years old;10th                       0            0            0
            16 years old;11th                       0            0            0
            16 years old;12th                       0            0            0
             16 years old;9th                       0            0            0
              16 years old;NA                       0            0            0
            17 years old;10th                       0            0            0
            17 years old;11th                       0            0            0
            17 years old;12th                       0            0            0
             17 years old;9th                       0            0            0
              17 years old;NA                       0            0            0
   18 years old or older;10th                       0            0            0
   18 years old or older;11th                       0            0            0
   18 years old or older;12th                       0            0            0
    18 years old or older;9th                       0            0            0
     18 years old or older;NA                       0            0            0
                      NA;10th                       0            0            0
                      NA;11th                       0            0            0
                      NA;12th                       0            0            0
                       NA;9th                       0            0            0
                        NA;NA                       0            0            0
                                                                 
 15 years old 16 years old 17 years old 18 years old or older NA_
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
            0            0            0                     0   0
         1182            0            0                     0   0
           28            0            0                     0   0
            3            0            0                     0   0
         1457            0            0                     0   0
            9            0            0                     0   0
            0         1376            0                     0   0
            0         1165            0                     0   0
            0           36            0                     0   0
            0          442            0                     0   0
            0           14            0                     0   0
            0            0          364                     0   0
            0            0         1327                     0   0
            0            0         1248                     0   0
            0            0           55                     0   0
            0            0           14                     0   0
            0            0            0                    46   0
            0            0            0                   389   0
            0            0            0                  1430   0
            0            0            0                    17   0
            0            0            0                    14   0
            0            0            0                     0   4
            0            0            0                     0   3
            0            0            0                     0   7
            0            0            0                     0   2
            0            0            0                     0  34

yrbss_demo %>% tabyl(age_and_grade, grade) %>% adorn_title()

                              grade                   
                age_and_grade  10th 11th 12th  9th NA_
 12 years old or younger;10th     4    0    0    0   0
 12 years old or younger;11th     0    4    0    0   0
 12 years old or younger;12th     0    0    6    0   0
  12 years old or younger;9th     0    0    0    6   0
   12 years old or younger;NA     0    0    0    0   9
            13 years old;10th     2    0    0    0   0
            13 years old;12th     0    0    1    0   0
             13 years old;9th     0    0    0   14   0
              13 years old;NA     0    0    0    0   1
            14 years old;10th    19    0    0    0   0
            14 years old;11th     0    2    0    0   0
            14 years old;12th     0    0    1    0   0
             14 years old;9th     0    0    0 1258   0
              14 years old;NA     0    0    0    0   7
            15 years old;10th  1182    0    0    0   0
            15 years old;11th     0   28    0    0   0
            15 years old;12th     0    0    3    0   0
             15 years old;9th     0    0    0 1457   0
              15 years old;NA     0    0    0    0   9
            16 years old;10th  1376    0    0    0   0
            16 years old;11th     0 1165    0    0   0
            16 years old;12th     0    0   36    0   0
             16 years old;9th     0    0    0  442   0
              16 years old;NA     0    0    0    0  14
            17 years old;10th   364    0    0    0   0
            17 years old;11th     0 1327    0    0   0
            17 years old;12th     0    0 1248    0   0
             17 years old;9th     0    0    0   55   0
              17 years old;NA     0    0    0    0  14
   18 years old or older;10th    46    0    0    0   0
   18 years old or older;11th     0  389    0    0   0
   18 years old or older;12th     0    0 1430    0   0
    18 years old or older;9th     0    0    0   17   0
     18 years old or older;NA     0    0    0    0  14
                      NA;10th     4    0    0    0   0
                      NA;11th     0    3    0    0   0
                      NA;12th     0    0    7    0   0
                       NA;9th     0    0    0    2   0
                        NA;NA     0    0    0    0  34

8 Working with text: `stringr` package

Both of the age and grade columns are character vectors with “extra text”, and there’s a tidyverse package that is ALL about working with characters or “strings”, called stringr.
- It’s automatically loaded when you load the tidyverse package (library(tidyverse) or pacman::p_load(tidyverse)).
- I highly recommend that you read the introduction to stringr vignette and look at the cheatsheet.

8.1 Goal 1: clean the `grade` variable

For the column grade, we can see that the values are numbers followed by “th”:

yrbss_demo %>% tabyl(grade)

 grade    n   percent valid_percent
  10th 2997 0.2497500     0.2518911
  11th 2918 0.2431667     0.2452513
  12th 2732 0.2276667     0.2296184
   9th 3251 0.2709167     0.2732392
  <NA>  102 0.0085000            NA

We can remove that string with the function str_remove_all():

yrbss_demo <- yrbss_demo %>%
  mutate(grade_num = str_remove_all(grade, "th") )

yrbss_demo %>% tabyl(grade_num, grade) %>% adorn_title()

           grade                   
 grade_num  10th 11th 12th  9th NA_
        10  2997    0    0    0   0
        11     0 2918    0    0   0
        12     0    0 2732    0   0
         9     0    0    0 3251   0
      <NA>     0    0    0    0 102

We successfully removed th from those values,
- however, grade_num is still character type:

glimpse(yrbss_demo)

Rows: 12,000
Columns: 11
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age           <chr> "17 years old", "17 years old", "16 years old", "16 year…
$ grade         <chr> "12th", "12th", "9th", "10th", "11th", "11th", "11th", "…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ grade_num     <chr> "12", "12", "9", "10", "11", "11", "11", "12", "9", "11"…

We want grade to be a numeric variable, and we can use as.numeric to change the character vector of numbers into a numeric vector of numbers.
See how as.numeric() works:

c("1", "2", "3", NA)

[1] "1" "2" "3" NA

as.numeric(c("1", "2", "3", NA))

[1]  1  2  3 NA

Careful, though, we will get a warning if there are NA values “introduced by coercion” which means some of our character values are not obviously numbers!
- See example below.
- (Side note: this is how I realized I needed the na_if() statements above to convert “NA” to NA in age and grade).

c("1", "2", "3", NA, "a")

[1] "1" "2" "3" NA  "a"

as.numeric(c("1", "2", "3", NA, "a"))

[1]  1  2  3 NA NA

It’s a good idea to use tabyl() on your character string first before converting to numeric, to make sure there are no surprises,
- and a cross-tabyl after to double check it worked properly.

yrbss_demo <- yrbss_demo %>%
  mutate(grade_num = as.numeric(grade_num))

yrbss_demo %>% tabyl(grade, grade_num) %>% adorn_title()

       grade_num                   
 grade         9   10   11   12 NA_
  10th         0 2997    0    0   0
  11th         0    0 2918    0   0
  12th         0    0    0 2732   0
   9th      3251    0    0    0   0
  <NA>         0    0    0    0 102

glimpse(yrbss_demo)

Rows: 12,000
Columns: 11
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age           <chr> "17 years old", "17 years old", "16 years old", "16 year…
$ grade         <chr> "12th", "12th", "9th", "10th", "11th", "11th", "11th", "…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ grade_num     <dbl> 12, 12, 9, 10, 11, 11, 11, 12, 9, 11, 12, 9, 11, 9, 9, 1…

8.2 Goal 2: clean the `age` variable

yrbss_demo %>% tabyl(age)

                     age    n     percent valid_percent
 12 years old or younger   29 0.002416667   0.002426778
            13 years old   18 0.001500000   0.001506276
            14 years old 1287 0.107250000   0.107698745
            15 years old 2679 0.223250000   0.224184100
            16 years old 3033 0.252750000   0.253807531
            17 years old 3008 0.250666667   0.251715481
   18 years old or older 1896 0.158000000   0.158661088
                    <NA>   50 0.004166667            NA

We see that there are some complications:
- ≤ 12 are grouped together and
- ≥ 18 are grouped together.
Let’s pretend that we want the categories to just be the numeric values 12-18,
- so group ≤ 12 as 12, and group ≥ 18 as 18.
- Obviously this isn’t ideal, but for learning purposes let’s do this.
There are several ways to do that (including using case_when() for example, or even separate_wider_delim()), but let’s use the stringr package first.

8.3 G2 Option 1: `str_remove_all()` to clean `age`

# try removing the string " years old"
yrbss_demo %>% 
  mutate(
    age_num = str_remove_all(age, " years old")
    ) %>%
  tabyl(age_num) # almost there

       age_num    n     percent valid_percent
 12 or younger   29 0.002416667   0.002426778
            13   18 0.001500000   0.001506276
            14 1287 0.107250000   0.107698745
            15 2679 0.223250000   0.224184100
            16 3033 0.252750000   0.253807531
            17 3008 0.250666667   0.251715481
   18 or older 1896 0.158000000   0.158661088
          <NA>   50 0.004166667            NA

# remove all the strings manually
yrbss_demo %>% 
  mutate(
    age_num = str_remove_all(age, " years old"),
    age_num = str_remove_all(age_num, " or younger"),
    age_num = str_remove_all(age_num, " or older")
    ) %>%
  tabyl(age_num) # looks good, now we just need to make it numeric

 age_num    n     percent valid_percent
      12   29 0.002416667   0.002426778
      13   18 0.001500000   0.001506276
      14 1287 0.107250000   0.107698745
      15 2679 0.223250000   0.224184100
      16 3033 0.252750000   0.253807531
      17 3008 0.250666667   0.251715481
      18 1896 0.158000000   0.158661088
    <NA>   50 0.004166667            NA

# remove all the strings manually & make age numeric
yrbss_demo %>% 
  mutate(
    age_num = str_remove_all(age, " years old"),
    age_num = str_remove_all(age_num, " or younger"),
    age_num = str_remove_all(age_num, " or older"),
    age_num = as.numeric(age_num)
    ) %>%
  tabyl(age_num) # perfect

 age_num    n     percent valid_percent
      12   29 0.002416667   0.002426778
      13   18 0.001500000   0.001506276
      14 1287 0.107250000   0.107698745
      15 2679 0.223250000   0.224184100
      16 3033 0.252750000   0.253807531
      17 3008 0.250666667   0.251715481
      18 1896 0.158000000   0.158661088
      NA   50 0.004166667            NA

8.4 G2 Option 2: `str_remove_all()` with regex

Regex (short for regular expressions) is a way to search strings using special syntax.

Remove all letters "[a-z]"
Remove specified words

We could have removed all of the alphabet letters in one step with regex:

# remove all the alphabet using regex
yrbss_demo %>% 
  mutate(
    age_num = str_remove_all(age, "[a-z]"),
    age_num = as.numeric(age_num)
    ) %>%
  tabyl(age_num)

 age_num    n     percent valid_percent
      12   29 0.002416667   0.002426778
      13   18 0.001500000   0.001506276
      14 1287 0.107250000   0.107698745
      15 2679 0.223250000   0.224184100
      16 3033 0.252750000   0.253807531
      17 3008 0.250666667   0.251715481
      18 1896 0.158000000   0.158661088
      NA   50 0.004166667            NA

Or, we can remove specific words with regex (recall | means or):

# remove a set of words with regex |
yrbss_demo %>% 
  mutate(
    age_num = str_remove_all(age, "years|old|or|younger|older"),
    age_num = as.numeric(age_num)
    ) %>%
  tabyl(age_num)

 age_num    n     percent valid_percent
      12   29 0.002416667   0.002884424
      13   18 0.001500000   0.001790332
      14 1287 0.107250000   0.128008753
      15 2679 0.223250000   0.266461110
      16 3033 0.252750000   0.301670977
      17 3008 0.250666667   0.299184404
      NA 1946 0.162166667            NA

What went wrong with the code above?
- How do we fix it?

8.5 G2 Option 3: Clever separate for `age`

We could have used separate_wider_delim by being a bit clever!

What to use as the “separator” to keep the numbers in their own column?

yrbss_demo %>% tabyl(age)

                     age    n     percent valid_percent
 12 years old or younger   29 0.002416667   0.002426778
            13 years old   18 0.001500000   0.001506276
            14 years old 1287 0.107250000   0.107698745
            15 years old 2679 0.223250000   0.224184100
            16 years old 3033 0.252750000   0.253807531
            17 years old 3008 0.250666667   0.251715481
   18 years old or older 1896 0.158000000   0.158661088
                    <NA>   50 0.004166667            NA

yrbss_demo %>% 
  separate_wider_delim(
    cols = age, 
    names = c("age_num","old"),
    delim = " years",  # our data has this phrase after the number
    cols_remove = FALSE # keep the age column
    ) %>%
  glimpse()

Rows: 12,000
Columns: 13
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age_num       <chr> "17", "17", "16", "16", "18", "15", "17", "18", "15", "1…
$ old           <chr> " old", " old", " old", " old", " old or older", " old",…
$ age           <chr> "17 years old", "17 years old", "16 years old", "16 year…
$ grade         <chr> "12th", "12th", "9th", "10th", "11th", "11th", "11th", "…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ grade_num     <dbl> 12, 12, 9, 10, 11, 11, 11, 12, 9, 11, 12, 9, 11, 9, 9, 1…

# separate, and let's reassign it to yrbss_demo this time
yrbss_demo <- yrbss_demo %>% 
  separate_wider_delim(
    cols = age, 
    names = c("age_num","old"),
    delim = " years", 
    cols_remove = FALSE # keep the age column
    ) %>%
  mutate(age_num = as.numeric(age_num)) %>%
  select(-old) # remove the old column that we do not need

yrbss_demo %>% tabyl(age, age_num) # yep!

                     age 12 13   14   15   16   17   18 NA_
 12 years old or younger 29  0    0    0    0    0    0   0
            13 years old  0 18    0    0    0    0    0   0
            14 years old  0  0 1287    0    0    0    0   0
            15 years old  0  0    0 2679    0    0    0   0
            16 years old  0  0    0    0 3033    0    0   0
            17 years old  0  0    0    0    0 3008    0   0
   18 years old or older  0  0    0    0    0    0 1896   0
                    <NA>  0  0    0    0    0    0    0  50

glimpse(yrbss_demo)

Rows: 12,000
Columns: 12
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age_num       <dbl> 17, 17, 16, 16, 18, 15, 17, 18, 15, 17, 17, 15, 18, 15, …
$ age           <chr> "17 years old", "17 years old", "16 years old", "16 year…
$ grade         <chr> "12th", "12th", "9th", "10th", "11th", "11th", "11th", "…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ grade_num     <dbl> 12, 12, 9, 10, 11, 11, 11, 12, 9, 11, 12, 9, 11, 9, 9, 1…

8.6 G2 Option 4: `parse_number()` to clean `age`

parse_number() “parses the first number it finds, dropping any non-numeric characters before the first number and all characters after the first number.”
- Copied from help file.
parse_number() is from the reader package, which gets loaded as a one of the core tidyverse packages when tidyverse is loaded.

R code
check

glimpse(yrbss_demo)

Rows: 12,000
Columns: 12
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age_num       <dbl> 17, 17, 16, 16, 18, 15, 17, 18, 15, 17, 17, 15, 18, 15, …
$ age           <chr> "17 years old", "17 years old", "16 years old", "16 year…
$ grade         <chr> "12th", "12th", "9th", "10th", "11th", "11th", "11th", "…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ grade_num     <dbl> 12, 12, 9, 10, 11, 11, 11, 12, 9, 11, 12, 9, 11, 9, 9, 1…

yrbss_demo <- yrbss_demo %>% 
  mutate(age_parse = parse_number(age))

# check:
yrbss_demo %>% tabyl(age, age_parse) %>% adorn_title()

                         age_parse                                
                     age        12 13   14   15   16   17   18 NA_
 12 years old or younger        29  0    0    0    0    0    0   0
            13 years old         0 18    0    0    0    0    0   0
            14 years old         0  0 1287    0    0    0    0   0
            15 years old         0  0    0 2679    0    0    0   0
            16 years old         0  0    0    0 3033    0    0   0
            17 years old         0  0    0    0    0 3008    0   0
   18 years old or older         0  0    0    0    0    0 1896   0
                    <NA>         0  0    0    0    0    0    0  50

glimpse(yrbss_demo) # age_parse data type?

Rows: 12,000
Columns: 13
$ record_id     <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41873, …
$ year          <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ sex           <chr> "Male", "Female", "Male", "Female", "Female", "Male", "F…
$ age_num       <dbl> 17, 17, 16, 16, 18, 15, 17, 18, 15, 17, 17, 15, 18, 15, …
$ age           <chr> "17 years old", "17 years old", "16 years old", "16 year…
$ grade         <chr> "12th", "12th", "9th", "10th", "11th", "11th", "11th", "…
$ age_and_grade <chr> "17 years old;12th", "17 years old;12th", "16 years old;…
$ race4         <chr> "White", "Black or African American", "White", "White", …
$ race7         <chr> "White", "Black or African American", "White", "White", …
$ bmi           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ grade_num     <dbl> 12, 12, 9, 10, 11, 11, 11, 12, 9, 11, 12, 9, 11, 9, 9, 1…
$ age_parse     <dbl> 17, 17, 16, 16, 18, 15, 17, 18, 15, 17, 17, 15, 18, 15, …

Note that age_parse is already a numeric variable!

8.7 Challenge 2 (~10 minutes)

For yrbss_demo, use separate_wider_delim() to split the column grade into two columns by the letter t, that is, “12th” would become “12” and “h”, then remove that second column and convert the grade number column from a character column into a numeric column. Check your work with tabyl().

For yrbss_demo, use str_replace_all() to replace the appropriate characters to change
- “Am” to “American” in the category “Am Indian / Alaska Native”, and
- “PI” to “Pacific Islander” in the “Native Hawaiian/other PI” category.
- Check your work with tabyl().

9 (§3) Join questionnaire and demographics data

Now that we’ve wrangled the questionnaire and demographics datasets, we’re ready to merge (join) them together.

We have data on patients from two sources:
- one contains demographics,
- the other contains answers to survey questionnaires.
How do we combine these two data sets into one table?

9.1 Naming conventions

Whenever we join two tables, we will have - a left table (in this case, the x table) and - a right table (in this case, the y table).
key:
- In order to join the two tables, we have to somehow map the rows of our left table (x) with the rows of our other table (y).
- We do this by joining together rows with a common variable.
- In our case, we need to join rows based on the first column in both tables.
- The column that we join on is called a key.

Note

In the above example, we see that there is a row in x whose key is 2, and there is a row in y whose key is 2 as well. So it makes sense to join these two rows together into one row.

So, by matching the rows with identical keys, we can put together a table that incorporates information from both tables.

9.2 Types of joins

There are several different types of joins:
- inner join
- left join
- right join
- full join
- anti join

The most common usages require an inner join, a left join, or a full join.

9.3 Inner Joins

With an inner join, we only keep rows that have matching keys in both tables.

If there is no match in both tables, we don’t include the row with that key.
- In the example to the left, we don’t keep row 3 in table x or row 4 in table y.

9.4 Joining Syntax

inner_join(
  table_x, 
  table_y,
  join_by(key_column_x == key_column_y)
  )

We first specify the left table (table_x) and then the right table (table_y).
Note the join_by argument.
- This is the argument where we specify which column in each table contains our key.

Older R code syntax

inner_join(
  table_x, 
  table_y,
  by = c("key_column_x" = "key_column_y")
  )

9.5 The more `dplyr` way to do joins

The more dplyr way to do joins is below:

#start with left table
table_x %>% 

  #join left table with right table
  inner_join(
    y = table_y, 
    
  #point out the key column in each table
    join_by(key_column_x == key_column_y)
  )

We start with our left table, table_x.
The main difference is that we don’t need to specify the x argument in inner_join(), because we are piping table_x as the first argument of inner_join().

9.6 Full Joins

A full join keeps all the data from both tables.
You could use this as your default join, and then filter your data how you want it later. However, I tend to use left_join the most.
When there is no information from the left or right table, these rows will have an NA. That is, the information is missing for the columns that come from the other table.

9.7 Left Joins

In a left join, we keep all the rows in the left table regardless of whether there is a match in the right table.
In the example to the left, we keep row 3 in table x even though it doesn’t have a match in table y.
- Because there is no information from the right table, these rows will have an NA. That is, the information is missing for the columns that come from the right table.
I usually think of my left table as defining my study cohort. I only want to keep data on subjects in my left table, and the rest are not useful.

9.8 Toy example

Say we have a set of patients for whom we want to join their demographic information with their lab data, in particular their white blood cell count (WBC).
We also need to know whether there are patients in our set who haven’t gotten a WBC.
Because the lab system isn’t part of the electronic health record, we’ll need to make a join on the patient table to the lab table.

patient_table <- read_excel(here("part5", "data", "patient_example.xlsx"), sheet = 1)

wbc_table <- read_excel(here("part5", "data", "patient_example.xlsx"), sheet = 2)

Let’s look at patient_table, which we’ll use as our left table:

patient_table

# A tibble: 4 × 5
  patient_id first_name last_name   age date_of_birth      
       <dbl> <chr>      <chr>     <dbl> <dttm>             
1       2333 Scarlet    O'Hara       75 1945-11-14 00:00:00
2       7359 Neil       O'Hara       30 1990-11-20 00:00:00
3       1414 Ivy        Lee          43 1977-05-02 00:00:00
4       8424 May        Lynne        14 2006-02-08 00:00:00

Here is wbc_table, which we’ll use as our right table:

wbc_table

# A tibble: 4 × 2
  patient_id wbc_value
       <dbl>     <dbl>
1       2333      5000
2       7359     12000
3       1414      6000
4       4409      4000

Below we join the data with an inner_join().
- Why was an inner_join() chosen instead of the other join options?

patient_table %>%
  inner_join(y = wbc_table,
             join_by(patient_id == patient_id)
             )

# A tibble: 3 × 6
  patient_id first_name last_name   age date_of_birth       wbc_value
       <dbl> <chr>      <chr>     <dbl> <dttm>                  <dbl>
1       2333 Scarlet    O'Hara       75 1945-11-14 00:00:00      5000
2       7359 Neil       O'Hara       30 1990-11-20 00:00:00     12000
3       1414 Ivy        Lee          43 1977-05-02 00:00:00      6000

9.9 When do we use which join?

See this two-table verbs vignette and this cheatsheet for some extra explanations.

These joins are created to match SQL joins for databases.

inner_join = You only want rows that are in both tables
left_join = You only want rows that are in one table.
- Often the right table is a subset of the left table, so it’s easy to use this to keep everything in the bigger table, and join on the smaller table
- If the left table contains a cohort of interest, i.e. everyone that has been given a specific treatment, and you want to get their lab values from another table, use left_join() to add those lab values in the cohort defined by the left table
right_join = maybe never
- right_join does the same thing as left_join but backwards, I find left join easier to think about (personal preference)
full_join = does not remove any rows, you might want to use this as your default and filter later
anti_join and semi_join = filtering joins, probably use rarely,
- anti_join: use right table as an exclusion criteria and find unmatched keys in left table
- semi_join: filter left table based on keys in right table, and keep only columns from left table

9.10 Warning

Watch out for duplicated “keys” when you join (multiple observations), or you will end up with duplicated data rows.

See R for Data Science 1st Ed for a section on duplicated keys.

9.11 Your Turn

Modify the code below to do a left_join() on patient_table and wbc_table.
- Which patient in patient_table didn’t get a WBC?

patient_table %>%
  inner_join(y = wbc_table,
            join_by(patient_id == patient_id)
            )

9.12 Challenge 3: Join the YRBSS data (~5 minutes)

Use left_join() to join yrbss_questions and yrbss_demo into a dataset called yrbss_all so that we retain all data combined into one table.
- Remember, yrbss_questions is a random sample of the demographic data,
- so think about which is your left table and your right table.

Double check your join by looking at glimpse(), did you select the right variable to join by?

10 Reshaping data

10.1 Motivation

The last thing we’ll talk about today is the difference between the wide data format and the long data format.

A really common format in bioinformatics is the expression matrix.
In an expression matrix,
- the rows correspond to different genes and
- the columns correspond to different samples.

expression_example <- read_csv(here("part5", "data", "expression_example.csv"))

expression_example

# A tibble: 4 × 5
  gene_symbol tcga_110 tcga_332 tcga_548 tcga_523
  <chr>          <dbl>    <dbl>    <dbl>    <dbl>
1 PTEN             4.1      5.6      2.2      2.1
2 MTOR             5.6      6.6      3.1      3.2
3 RKIP             1        1.4      1.2      1.2
4 KCNJ9            2.3      1.1      0.5      0.2

Note: TCGA is short for The Cancer Genome Atlas.

10.2 Why make our data long?

This format isn’t super useful for a lot of data manipulations.
It’s because the columns correspond to samples, and this prevents us from plotting things like PTEN expression across the samples.
This kind of data is in a format that is called wide format. What we need to do is collapse the numerical values into a single column.
Wide data are typical for longitudinal data, where columns correspond to time points.
- Often we need to reshape this data into long format in order to analyze it or plot it, where each row corresponds to an observation of a subject at one time point.
- We would have a column “time” that tells us what time point the observation was taken (i.e. baseline, 6 months, 12 months, etc).

10.3 Wide vs. long data

10.4 The magic of `pivot_longer()`

pivot_longer() is the function that will let us make our wide data long.

It takes three arguments:

cols - the columns to collapse.
- We talked about tidyselect verbs (the language we use for select() and across()), and we will use those here to help select columns based on naming criteria. Below we’re using the starts_with() selector to grab the columns that begin with tcga
names_to - the name of the column where the labels will be consolidated to
values_to - the name of the column where the values in the columns in the wide data will be consolidated to.

10.5 R code for `pivot_longer()`

expression_example  # wide data

# A tibble: 4 × 5
  gene_symbol tcga_110 tcga_332 tcga_548 tcga_523
  <chr>          <dbl>    <dbl>    <dbl>    <dbl>
1 PTEN             4.1      5.6      2.2      2.1
2 MTOR             5.6      6.6      3.1      3.2
3 RKIP             1        1.4      1.2      1.2
4 KCNJ9            2.3      1.1      0.5      0.2

expression_long <- 
  expression_example %>%
    pivot_longer(
      cols= starts_with("tcga"), 
      # column names are moved to a column that we are choosing to call "sample"
      names_to = "sample", 
      # the data points from the "body" of the table are moved to a column 
      # that we are choosing to call "expression"
      values_to = "expression" 
    )

expression_long

# A tibble: 16 × 3
   gene_symbol sample   expression
   <chr>       <chr>         <dbl>
 1 PTEN        tcga_110        4.1
 2 PTEN        tcga_332        5.6
 3 PTEN        tcga_548        2.2
 4 PTEN        tcga_523        2.1
 5 MTOR        tcga_110        5.6
 6 MTOR        tcga_332        6.6
 7 MTOR        tcga_548        3.1
 8 MTOR        tcga_523        3.2
 9 RKIP        tcga_110        1  
10 RKIP        tcga_332        1.4
11 RKIP        tcga_548        1.2
12 RKIP        tcga_523        1.2
13 KCNJ9       tcga_110        2.3
14 KCNJ9       tcga_332        1.1
15 KCNJ9       tcga_548        0.5
16 KCNJ9       tcga_523        0.2

Below is a different way of selecting the columns:

expression_long <- 
  expression_example %>%
    pivot_longer(
      cols= -gene_symbol,  # everything except gene_symbol
      names_to = "sample", # column names are moved to a column called sample
      values_to = "expression" # the data points go to a column called expression
    )

expression_long

# A tibble: 16 × 3
   gene_symbol sample   expression
   <chr>       <chr>         <dbl>
 1 PTEN        tcga_110        4.1
 2 PTEN        tcga_332        5.6
 3 PTEN        tcga_548        2.2
 4 PTEN        tcga_523        2.1
 5 MTOR        tcga_110        5.6
 6 MTOR        tcga_332        6.6
 7 MTOR        tcga_548        3.1
 8 MTOR        tcga_523        3.2
 9 RKIP        tcga_110        1  
10 RKIP        tcga_332        1.4
11 RKIP        tcga_548        1.2
12 RKIP        tcga_523        1.2
13 KCNJ9       tcga_110        2.3
14 KCNJ9       tcga_332        1.1
15 KCNJ9       tcga_548        0.5
16 KCNJ9       tcga_523        0.2

10.6 ggplot loves long data

Creating figures with ggplot works best with long data, so that we can make use of different variables for different specifications in figures.

Boxplot with fill color
Heatmaps with geom_tile

ggplot(expression_long) +
  aes(x = gene_symbol, 
      y = expression, 
      fill = gene_symbol) +
  geom_boxplot()

expression_long <- expression_long %>%
  mutate(
    gene_symbol = factor(
      gene_symbol, 
      levels = c("PTEN", "MTOR", "RKIP", "KCNJ9")
      ),
    # use forcats::fct_rev to reverse levels for plotting
    gene_symbol = fct_rev(gene_symbol)
    )

ggplot(expression_long) +
  aes(x = sample, 
      y = gene_symbol, 
      fill = log2(expression)) +
  geom_tile() +
  # new scale! this is a 3 color gradient (low-mid-high)
  scale_fill_gradient2(
    low = "yellow", 
    mid = "white", 
    high = "darkorchid4")

10.7 `facet_wrap()` - another reason to make long data!

Goal: Make a barplot of every categorical column in yrbss_all.

join the data
one barplot

# from Challenge above, but, making sure we have this joined data
yrbss_all <- left_join(
  yrbss_demo, 
  yrbss_questions, 
  join_by(record_id == record, 
          year == year)
  ) 

glimpse(yrbss_all)

Rows: 12,000
Columns: 22
$ record_id          <dbl> 34784, 40522, 32434, 38814, 42208, 31229, 40294, 41…
$ year               <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 199…
$ sex                <chr> "Male", "Female", "Male", "Female", "Female", "Male…
$ age_num            <dbl> 17, 17, 16, 16, 18, 15, 17, 18, 15, 17, 17, 15, 18,…
$ age                <chr> "17 years old", "17 years old", "16 years old", "16…
$ grade              <chr> "12th", "12th", "9th", "10th", "11th", "11th", "11t…
$ age_and_grade      <chr> "17 years old;12th", "17 years old;12th", "16 years…
$ race4              <chr> "White", "Black or African American", "White", "Whi…
$ race7              <chr> "White", "Black or African American", "White", "Whi…
$ bmi                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ weight_kg          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ grade_num          <dbl> 12, 12, 9, 10, 11, 11, 11, 12, 9, 11, 12, 9, 11, 9,…
$ age_parse          <dbl> 17, 17, 16, 16, 18, 15, 17, 18, 15, 17, 17, 15, 18,…
$ bicycle_helmet     <chr> "Most of the time wore a helmet", "Never wore a hel…
$ text_driving       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ ever_smoked        <chr> "Yes", "No", NA, NA, "Yes", "Yes", "Yes", "Yes", "N…
$ age_first_smoked   <chr> "13 or 14 years old", "Never smoked a cigarette", N…
$ days_smoked        <chr> "0 days", "0 days", NA, NA, "0 days", "1 or 2 days"…
$ cigarettes_per_day <fct> Did not smoke cigarettes, Did not smoke cigarettes,…
$ ever_been_bullied  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ cigarettes10       <fct> None, None, NA, NA, None, 0-10, NA, 0-10, None, Non…
$ cigarettes10fac    <fct> None, None, NA, NA, None, 0-10, NA, 0-10, None, Non…

We can make an individual barplot of just one variable with geom_bar():

ggplot(yrbss_all, aes(x=race4)) + 
  geom_bar()

10.8 Make the data long

If we make the data long(er) we can create many barplots at once for many different variables in the dataset.
We will use pivot_longer() combined with facet_wrap():

yrbss_long <- yrbss_all %>%
  mutate(across(where(is.factor), as.character)) %>%
  pivot_longer(
    cols = where(is.character), # all our character vectors
    names_to = "variable_name", # column names are moved to a column = variable_name
    values_to = "value" # the data points are in a column = value
  )

# check:
glimpse(yrbss_long) # note the number or rows!

Rows: 168,000
Columns: 10
$ record_id         <dbl> 34784, 34784, 34784, 34784, 34784, 34784, 34784, 347…
$ year              <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991…
$ age_num           <dbl> 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, …
$ bmi               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ grade_num         <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, …
$ age_parse         <dbl> 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, …
$ ever_been_bullied <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ variable_name     <chr> "sex", "age", "grade", "age_and_grade", "race4", "ra…
$ value             <chr> "Male", "17 years old", "12th", "17 years old;12th",…

yrbss_long %>% tabyl(variable_name)

      variable_name     n    percent
                age 12000 0.07142857
      age_and_grade 12000 0.07142857
   age_first_smoked 12000 0.07142857
     bicycle_helmet 12000 0.07142857
       cigarettes10 12000 0.07142857
    cigarettes10fac 12000 0.07142857
 cigarettes_per_day 12000 0.07142857
        days_smoked 12000 0.07142857
        ever_smoked 12000 0.07142857
              grade 12000 0.07142857
              race4 12000 0.07142857
              race7 12000 0.07142857
                sex 12000 0.07142857
       text_driving 12000 0.07142857

This may seem a strange thing to do since we now have duplicated numeric data. But that’s ok.

10.9 Create many barplots with long data

What if we just make a simple bar plot with these data?

ggplot(yrbss_long, 
       aes(x = value)) +
  geom_bar()

Oops, that’s not what we want! We want each variable to have it’s own separate barplot, so we facet by the column variable_name.

ggplot(yrbss_long, 
       aes(x = value)) +
  geom_bar() +
  facet_wrap(vars(variable_name))

Whoa. What’s going on here?
It’s hard to see, but for every barplot, every possible “value” is on the x-axis.
- But for sex, for example, there are only two options; for example we probably wouldn’t choose “17 years old” as our sex.
So we want to change the scale of the facet to only fit the data.

ggplot(yrbss_long, aes(x = value)) +
  geom_bar() +
  facet_wrap(vars(variable_name), 
             scales = "free_x")

The labels are still a bit cramped.
We could rotate the x axis labels perhaps or use a method to dodge them to see them better,
- but I actually would probably flip the axes, so we can use y = value instead of x = value.

ggplot(yrbss_long, 
       aes(y = value)) +
  geom_bar() +
  facet_wrap(vars(variable_name), 
             scales = "free_y",
             ncol = 3) +
  labs(y= "", 
       x= "", 
       title = "Barplots of all categorical variables in YRBSS") +
  theme_bw()

Remove missing values:

ggplot(
  # we can use tidyverse inside the data argument
  data = yrbss_long %>% filter(!is.na(value)), 
       aes(y = value)) +
  geom_bar() +
  facet_wrap(vars(variable_name), 
             scales = "free_y",
             ncol = 3) +
  labs(y= "", 
       x= "", 
       title = "Barplots of all categorical variables in YRBSS",
       subtitle = "Missing values removed") +
  theme_bw()

Ok, let’s get even more advanced. What if we wanted to show this barplot, but comparing male and female?

Let’s re-make our long data first.

yrbss_long <- yrbss_all %>%
  mutate(across(where(is.factor), as.character)) %>%
  pivot_longer(
    cols = c(where(is.character), -sex), # all our character vectors EXCEPT sex
    names_to = "variable_name", # column names are moved to a column = variable_name
    values_to = "value" # the data points are in a column = value
  )

# check: 
glimpse(yrbss_long)

Rows: 156,000
Columns: 11
$ record_id         <dbl> 34784, 34784, 34784, 34784, 34784, 34784, 34784, 347…
$ year              <dbl> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991…
$ sex               <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Mal…
$ age_num           <dbl> 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, …
$ bmi               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ weight_kg         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ grade_num         <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, …
$ age_parse         <dbl> 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, …
$ ever_been_bullied <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ variable_name     <chr> "age", "grade", "age_and_grade", "race4", "race7", "…
$ value             <chr> "17 years old", "12th", "17 years old;12th", "White"…

yrbss_long %>% tabyl(variable_name)

      variable_name     n    percent
                age 12000 0.07692308
      age_and_grade 12000 0.07692308
   age_first_smoked 12000 0.07692308
     bicycle_helmet 12000 0.07692308
       cigarettes10 12000 0.07692308
    cigarettes10fac 12000 0.07692308
 cigarettes_per_day 12000 0.07692308
        days_smoked 12000 0.07692308
        ever_smoked 12000 0.07692308
              grade 12000 0.07692308
              race4 12000 0.07692308
              race7 12000 0.07692308
       text_driving 12000 0.07692308

Now create the facetted bargraphs stratified by sex:

ggplot(
  # remove NA
  data = yrbss_long %>% filter(!is.na(value)), 
       aes(y = value,
           # use sex to fill colors
           fill = sex)) +
  # default is stacked bars, so to show side by side include position = "dodge"
  geom_bar(position = "dodge") + 
  facet_wrap(vars(variable_name), 
             scales = "free_y",
             ncol = 3) +
  labs(y= "", 
       x= "", 
       title = "Barplots of all categorical variables in YRBSS by sex",
       subtitle = "Missing values removed") +
  theme_bw()

(A reminder that this is a subsample of survey data and we are ignoring all the sampling weights, including the way I resampled it, so we can’t learn much from a comparison like this without actually doing the real stats work!)

10.10 `pivot_wider()` exists too!

… and part of your assignment will be to figure out how to use it.

11 Post Class Survey

Please fill out the post-class survey.

Your responses are anonymous in that I separate your names from the survey answers before compiling/reading.

You may want to review previous years’ feedback here.

12 Acknowledgements

Part 5 is based on the BSTA 505 Winter 2023 course, taught by Jessica Minnier.
- I made modifications to update the material from RMarkdown to Quarto, and streamlined/edited content for slides.
- Also made updates for new R functions/ specifications, such as separate_wider_delim()
Minnier’s Acknowledgements:
- Originally written by Jessica Minnier, Aaron Coyner and Ted Laderas.
- The example data using YRBSS surveys were partially created when developing the OHSU BERD workshop on data wrangling with Meike Niederhausen.