Day 3 part 3 code: Data wrangling

Author

Meike Niederhausen

Published

Invalid Date

library(tidyverse)
library(janitor)

Slides pdf

Link to pdf of slides

Goals for today

More data wrangling examples and tools

  • Subsetting data
    • filtering rows
    • selecting columns
  • More wrangling for columns:
    • relocate columns
    • rename columns
  • Creating new variables
    • mutate
  • Reshaping data
    • wide vs long data
      • make wide data long
      • make long data wide

Check out the data transformation cheatsheet for many more data wrangling commands!

Case study: discrimination in developmental disability support (1.7.1)

  • In the US, individuals with developmental disabilities typically receive services and support from state governments.

    • California allocates funds to developmentally disabled residents through the Department of Developmental Services (DDS)
    • Recipients of DDS funds are referred to as “consumers.”
  • Dataset dds.discr

    • sample of 1,000 DDS consumers (out of a total of ~ 250,000)
    • age, gender, race/ethnicity, and DDS annual financial support per consumer
  • Previous research

    • Researchers examined expenditures on consumers by ethnicity
    • Found that the mean annual expenditures on Hispanics was less than that on White non-Hispanics.
  • Result: an allegation of ethnic discrimination was brought against the California DDS.

  • Question: Are the data sufficient evidence of ethnic discrimination?

  • See Section 1.7.1 for more details

Load dataset dds.discr from package oibiostat

  • The textbook’s datasets are in the R package oibiostat

  • If you haven’t already installed the package oibiostat, then first do so using directions in previous slide.

  • Load the oibiostat package and the dataset dds.discr

    • the code below needs to be run every time you restart R or knit an Rmd file
library(oibiostat)
data("dds.discr")
  • After loading the dataset dds.discr using data("dds.discr"), you will see dds.discr in the Data list of the Environment window.

Getting to know the dataset

dim(dds.discr)
[1] 1000    6
names(dds.discr)
[1] "id"           "age.cohort"   "age"          "gender"       "expenditures"
[6] "ethnicity"   
length(unique(dds.discr$id)) # How many unique id's are there?
[1] 1000

Subsetting data

filter() rows that satisfy specified conditions

filter() to select rows

filter data based on rows

  • math: >, <, >=, <=

  • double = for “is equal to”: ==

  • != (not equal)

  • & (and)

  • | (or)

  • is.na() to filter based on missing values

  • %in% to filter based on group membership

  • ! in front negates the statement, as in

    • !is.na(age)
    • !(ethnicity %in% c("Asian","Black"))
# Note: the output from the command below is not being saved 
# since it's not being assigned to a variable using <-
dds.discr %>% filter(age > 90)   
# A tibble: 4 × 6
     id age.cohort   age gender expenditures ethnicity
  <int> <fct>      <int> <fct>         <int> <fct>    
1 19250 51+           94 Female        60871 Hispanic 
2 46726 51+           95 Male          55187 Hispanic 
3 55056 51+           95 Female        54680 Black    
4 87737 51+           91 Male          54481 Asian    

filter() practice

What do these commands do? Try them out:

dds.discr %>% filter(age < 5)
dds.discr %>% filter(age/expenditures < 0.5)    # can do math within filter command
dds.discr %>% filter((age < 15) | (age > 50))

# simultaneously filter on multiple variables
dds.discr %>% filter(age < 20, expenditures > 1000, gender == "Male") 

dds.discr %>% filter(id == 10210) # note the use of == instead of just =
dds.discr %>% filter(gender == "Female")
dds.discr %>% filter(!(age.cohort == "51+"))
dds.discr %>% filter(age.cohort %in% c("0-5", "6-12"))

dds.discr %>% filter(is.na(age))
dds.discr %>% filter(!is.na(age))

Subset by columns

select() to choose columns

  • select columns (variables)
  • no quotes needed around variable names
  • can be used to rearrange columns
  • uses special syntax that is flexible and has many options
dds.discr %>% select(id, expenditures, ethnicity)
# A tibble: 1,000 × 3
      id expenditures ethnicity         
   <int>        <int> <fct>             
 1 10210         2113 White not Hispanic
 2 10409        41924 White not Hispanic
 3 10486         1454 Hispanic          
 4 10538         6400 Hispanic          
 5 10568         4412 White not Hispanic
 6 10690         4566 Hispanic          
 7 10711         3915 White not Hispanic
 8 10778         3873 Black             
 9 10820         5021 White not Hispanic
10 10823         2887 Hispanic          
# ℹ 990 more rows

Column selection syntax options

There are many ways to select a set of variable names (columns):

  • var1:var20: all columns from var1 to var20

  • one_of(c("a", "b", "c")): all columns with names in the specified character vector of names

  • Removing columns

    • -var1: remove the columnvar1
    • -(var1:var20): remove all columns from var1 to var20
  • Select by specifying text within column names

    • contains("date"), contains("_"): all variable names that contain the specified string of characters
    • starts_with("a") or ends_with("last"): all variable names that start or end with the specified string

See other examples in the data transformation cheatsheet.

select() practice

Which columns are selected & in what order using these commands?
First guess and then try them out.

dds.discr %>% select(id:gender)
dds.discr %>% select(one_of(c("age","expenditures", "notindata")))

dds.discr %>% select(-age.cohort,-gender)
dds.discr %>% select(-(id:gender))

dds.discr %>% select(contains("age"))
dds.discr %>% select(starts_with("a"))
dds.discr %>% select(-contains("a"))

relocate() columns to move them around

relocate() to change order of columns

  • change the order of columns in dataset
  • specified column names get put first,
    • and unspecified column names after that in original order
  • no quotes needed around variable names
  • similar options as with select(),
    • plus special ones such as .before and .after
dds.discr %>% relocate(age.cohort, ethnicity)
# A tibble: 1,000 × 6
   age.cohort ethnicity             id   age gender expenditures
   <fct>      <fct>              <int> <int> <fct>         <int>
 1 13-17      White not Hispanic 10210    17 Female         2113
 2 22-50      White not Hispanic 10409    37 Male          41924
 3 0-5        Hispanic           10486     3 Male           1454
 4 18-21      Hispanic           10538    19 Female         6400
 5 13-17      White not Hispanic 10568    13 Male           4412
 6 13-17      Hispanic           10690    15 Female         4566
 7 13-17      White not Hispanic 10711    13 Female         3915
 8 13-17      Black              10778    17 Male           3873
 9 13-17      White not Hispanic 10820    14 Female         5021
10 13-17      Hispanic           10823    13 Male           2887
# ℹ 990 more rows

relocate() practice

What order are the columns in using these commands?
First guess and then try them out.

dds.discr %>% relocate(age:ethnicity)

dds.discr %>% relocate(where(is.numeric))
dds.discr %>% relocate(where(is.factor))
# note: the next command doesn't do anything 
# since there are no character type variables in the dataset
dds.discr %>% relocate(where(is.character))

dds.discr %>% relocate(age,.before = ethnicity)
dds.discr %>% relocate(ethnicity, .after = age.cohort)
dds.discr %>% relocate(age, .after = last_col())

rename() columns

  • renames column variables
dds.discr %>% rename(IDnumber = id)   # order: new_name = old_name
# A tibble: 1,000 × 6
   IDnumber age.cohort   age gender expenditures ethnicity         
      <int> <fct>      <int> <fct>         <int> <fct>             
 1    10210 13-17         17 Female         2113 White not Hispanic
 2    10409 22-50         37 Male          41924 White not Hispanic
 3    10486 0-5            3 Male           1454 Hispanic          
 4    10538 18-21         19 Female         6400 Hispanic          
 5    10568 13-17         13 Male           4412 White not Hispanic
 6    10690 13-17         15 Female         4566 Hispanic          
 7    10711 13-17         13 Female         3915 White not Hispanic
 8    10778 13-17         17 Male           3873 Black             
 9    10820 13-17         14 Female         5021 White not Hispanic
10    10823 13-17         13 Male           2887 Hispanic          
# ℹ 990 more rows

Make new variables

mutate()

Use mutate() to add new columns to a tibble * Many options in how to define new column of data

# use = to define new a variable within mutate (not <- or ==) 
newdata <- dds.discr %>% 
  mutate(
    log_expenditures = log(expenditures),
    expend_per_yearage = expenditures / age)   


newdata %>% select(id, age, expenditures, log_expenditures, expend_per_yearage)
# A tibble: 1,000 × 5
      id   age expenditures log_expenditures expend_per_yearage
   <int> <int>        <int>            <dbl>              <dbl>
 1 10210    17         2113             7.66               124.
 2 10409    37        41924            10.6               1133.
 3 10486     3         1454             7.28               485.
 4 10538    19         6400             8.76               337.
 5 10568    13         4412             8.39               339.
 6 10690    15         4566             8.43               304.
 7 10711    13         3915             8.27               301.
 8 10778    17         3873             8.26               228.
 9 10820    14         5021             8.52               359.
10 10823    13         2887             7.97               222.
# ℹ 990 more rows

mutate() practice

What do the following commands do?
First guess and then try them out.

dds.discr %>% mutate(age_young = (age < 18))

dds.discr %>% mutate(male = (gender == "Male"))
dds.discr %>% mutate(male = 1 * (gender == "Male"))

case_when() to create multi-valued variables

  • Example: create age groups based off of the age variable
dds.discr2 <- dds.discr %>%
  mutate(
    age_group = case_when(
      age < 6 ~ "0 to 5",               # condition ~ new_value
      age >= 6 & age < 13 ~ "6 to 12",
      age >= 13 & age < 18 ~ "13 to 17",
      age >= 18 & age < 22 ~ "18 to 21",
      age >= 22 & age < 51 ~ "22 to 50",
      age >= 51 ~ "51 plus")
    )

dds.discr2 %>% select(age, age_group) %>% head()
# A tibble: 6 × 2
    age age_group
  <int> <chr>    
1    17 13 to 17 
2    37 22 to 50 
3     3 0 to 5   
4    19 18 to 21 
5    13 13 to 17 
6    15 13 to 17 
dds.discr2 %>% tabyl(age.cohort, age_group)
 age.cohort 0 to 5 13 to 17 18 to 21 22 to 50 51 plus 6 to 12
        0-5     82        0        0        0       0       0
       6-12      0        0        0        0       0     175
      13-17      0      212        0        0       0       0
      18-21      0        0      199        0       0       0
      22-50      0        0        0      226       0       0
        51+      0        0        0        0     106       0

arrange() to order rows

  • Use arrange() to order the rows by the values in specified columns
dds.discr %>% arrange(age, expenditures) %>% head(n=3)
# A tibble: 3 × 6
     id age.cohort   age gender expenditures ethnicity         
  <int> <fct>      <int> <fct>         <int> <fct>             
1 39131 0-5            0 Female          685 White not Hispanic
2 25613 0-5            0 Male            741 Hispanic          
3 19917 0-5            0 Male            904 White not Hispanic
dds.discr %>% arrange(desc(age), expenditures) %>% head(n=3)
# A tibble: 3 × 6
     id age.cohort   age gender expenditures ethnicity
  <int> <fct>      <int> <fct>         <int> <fct>    
1 55056 51+           95 Female        54680 Black    
2 46726 51+           95 Male          55187 Hispanic 
3 19250 51+           94 Female        60871 Hispanic 

Reshaping data

Wide vs. long data

  • Wide data has one row per individual,
    • with multiple columns for their repeated measurements
  • Long data has multiple rows per individual,
    • with one column for the measurement variable and
    • another indicating from when/where the repeated measures are from

DDS example

Mean expenditures by ethnicity and age cohort (from Day 3 slides)

Example wide toy dataset

Copy and paste the code below into R to create this example dataset

SBP_wide <- tibble(id = letters[1:4],
                     sex = c("F", "M", "M", "F"),
                     SBP_v1 = c(130, 120, 130, 119),
                     SBP_v2 = c(110, 116, 136, 106),
                     SBP_v3 = c(112, 122, 138, 118))
SBP_wide
# A tibble: 4 × 5
  id    sex   SBP_v1 SBP_v2 SBP_v3
  <chr> <chr>  <dbl>  <dbl>  <dbl>
1 a     F        130    110    112
2 b     M        120    116    122
3 c     M        130    136    138
4 d     F        119    106    118
  • What do you think the data in the table are measures of?
  • How can we tell the data are wide?

Wide to long: pivot_longer()

SBP_wide
# A tibble: 4 × 5
  id    sex   SBP_v1 SBP_v2 SBP_v3
  <chr> <chr>  <dbl>  <dbl>  <dbl>
1 a     F        130    110    112
2 b     M        120    116    122
3 c     M        130    136    138
4 d     F        119    106    118

For pivot_longer we need to specify: - cols: which columns to make long - names_to: the name of the variable that will be created from the data stored in the column names - values_to: the name of the variable that will be created from the data stored in the cell values

SBP_long <- SBP_wide %>% 
  pivot_longer(
    cols=c(SBP_v1,SBP_v2,SBP_v3), 
    names_to = "visit", 
    values_to = "SBP")
SBP_long
# A tibble: 12 × 4
   id    sex   visit    SBP
   <chr> <chr> <chr>  <dbl>
 1 a     F     SBP_v1   130
 2 a     F     SBP_v2   110
 3 a     F     SBP_v3   112
 4 b     M     SBP_v1   120
 5 b     M     SBP_v2   116
 6 b     M     SBP_v3   122
 7 c     M     SBP_v1   130
 8 c     M     SBP_v2   136
 9 c     M     SBP_v3   138
10 d     F     SBP_v1   119
11 d     F     SBP_v2   106
12 d     F     SBP_v3   118

Long to wide: pivot_wider()

SBP_long
# A tibble: 12 × 4
   id    sex   visit    SBP
   <chr> <chr> <chr>  <dbl>
 1 a     F     SBP_v1   130
 2 a     F     SBP_v2   110
 3 a     F     SBP_v3   112
 4 b     M     SBP_v1   120
 5 b     M     SBP_v2   116
 6 b     M     SBP_v3   122
 7 c     M     SBP_v1   130
 8 c     M     SBP_v2   136
 9 c     M     SBP_v3   138
10 d     F     SBP_v1   119
11 d     F     SBP_v2   106
12 d     F     SBP_v3   118

For pivot_wider we need to specify: - names_from: which column contains the names for the new columns - values_from: which column contains the values that will fill in the cell values

SBP_wide2 <- SBP_long %>% 
  pivot_wider(names_from = "visit", 
              values_from = "SBP")
SBP_wide2
# A tibble: 4 × 5
  id    sex   SBP_v1 SBP_v2 SBP_v3
  <chr> <chr>  <dbl>  <dbl>  <dbl>
1 a     F        130    110    112
2 b     M        120    116    122
3 c     M        130    136    138
4 d     F        119    106    118

Clean up visit column in the long data (1/3)

SBP_long
# A tibble: 12 × 4
   id    sex   visit    SBP
   <chr> <chr> <chr>  <dbl>
 1 a     F     SBP_v1   130
 2 a     F     SBP_v2   110
 3 a     F     SBP_v3   112
 4 b     M     SBP_v1   120
 5 b     M     SBP_v2   116
 6 b     M     SBP_v3   122
 7 c     M     SBP_v1   130
 8 c     M     SBP_v2   136
 9 c     M     SBP_v3   138
10 d     F     SBP_v1   119
11 d     F     SBP_v2   106
12 d     F     SBP_v3   118

Goal: remove the string “SBP_” from the visit variable’s values.

Method #1: tidy the visit column after making the data long

Method #2: tidy the visit column while making the data long

Clean up visit column in the long data (2/3)

Method #1: tidy the visit column after making the data long

SBP_long2 <- SBP_long %>% 
  mutate(
    visit = str_replace(
      visit,
      pattern = "SBP_",
      replacement = "")
    ) 
SBP_long2
# A tibble: 12 × 4
   id    sex   visit   SBP
   <chr> <chr> <chr> <dbl>
 1 a     F     v1      130
 2 a     F     v2      110
 3 a     F     v3      112
 4 b     M     v1      120
 5 b     M     v2      116
 6 b     M     v3      122
 7 c     M     v1      130
 8 c     M     v2      136
 9 c     M     v3      138
10 d     F     v1      119
11 d     F     v2      106
12 d     F     v3      118
  1. Note that mutate is replacing the existing visit column with new values
  2. If I wanted to keep the original visit column instead of overwriting it, I would call the new column something else, such as visit_clean
  3. Within str_replace(), double quotes need to be used around the characters specifying 1) the string of text to replace ("SBP_") and 2) what to replace the text with, where "" is used for no text
  4. Could instead use str_remove(): mutate(visit = str_remove(visit,"SBP_"))

Clean up visit column in the long data (3/3)

Method #2: tidy the visit column while making the data long

SBP_long3 <- SBP_wide %>% 
 pivot_longer(cols = c(SBP_v1, SBP_v2,SBP_v3), 
              names_to = "visit", 
              names_prefix = "SBP_", #<<
              values_to = "SBP")
SBP_long3
# A tibble: 12 × 4
   id    sex   visit   SBP
   <chr> <chr> <chr> <dbl>
 1 a     F     v1      130
 2 a     F     v2      110
 3 a     F     v3      112
 4 b     M     v1      120
 5 b     M     v2      116
 6 b     M     v3      122
 7 c     M     v1      130
 8 c     M     v2      136
 9 c     M     v3      138
10 d     F     v1      119
11 d     F     v2      106
12 d     F     v3      118

Remarks:

  1. Note the new parameter names_prefix specifying what the prefix is that needs to be stripped.
  2. More complex pivot_longer() examples are shown at https://tidyr.tidyverse.org/articles/pivot.html
    (the end of the url is articles/pivot.html)

Specifying cols in pivot_longer

In the example creating SBP_long, the columns to make the tibble longer by were explicitly listed using cols=c(SBP_v1,SBP_v2,SBP_v3).

SBP_long <- SBP_wide %>% 
 pivot_longer(cols = c(SBP_v1, SBP_v2,SBP_v3), #<<
              names_to = "visit", 
              values_to = "SBP")
SBP_long
# A tibble: 12 × 4
   id    sex   visit    SBP
   <chr> <chr> <chr>  <dbl>
 1 a     F     SBP_v1   130
 2 a     F     SBP_v2   110
 3 a     F     SBP_v3   112
 4 b     M     SBP_v1   120
 5 b     M     SBP_v2   116
 6 b     M     SBP_v3   122
 7 c     M     SBP_v1   130
 8 c     M     SBP_v2   136
 9 c     M     SBP_v3   138
10 d     F     SBP_v1   119
11 d     F     SBP_v2   106
12 d     F     SBP_v3   118

However, we can specify the columns in many different ways, just like with select():

  • cols = c(SBP_v1:SBP_v3)
  • cols = c(-id, -sex)
  • cols = starts_wth("SBP")
  • cols = contains("SBP")

Notes on pivot_*() commands

  • pivot_longer() and pivot_wider() are relatively new commands

  • previously we used gather() and spread(),

    • which have different function parameters and are less intuitive to use
  • if you search for help making data longer or wider,

    • you might still see references for gather() and spread()
  • see my workshop slides for gather() and spread() usage

  • see https://tidyr.tidyverse.org/articles/pivot.html for more pivot examples

Summary of data wrangling commands

command purpose
%>% join (pipe) together commands
filter() subset rows
arrange() sort rows
select() select or rearrange columns
rename() columns
mutate() create new columns
tabyl() summarize categorical data
summarize() summarize data; for both categorical and numerical data
group_by() group data by a categorical variable
pivot_longer make a wide dataset long;
pivot_wider to make a long dataset wide