library(tidyverse)
library(janitor)
Day 3 part 3 code: Data wrangling
Slides pdf
Link to pdf of slides
Goals for today
More data wrangling examples and tools
- Subsetting data
filter
ing rowsselect
ing columns
- More wrangling for columns:
relocate
columnsrename
columns
- Creating new variables
mutate
- Reshaping data
- wide vs long data
- make wide data long
- make long data wide
- wide vs long data
Check out the data transformation cheatsheet for many more data wrangling commands!
Case study: discrimination in developmental disability support (1.7.1)
In the US, individuals with developmental disabilities typically receive services and support from state governments.
- California allocates funds to developmentally disabled residents through the Department of Developmental Services (DDS)
- Recipients of DDS funds are referred to as “consumers.”
Dataset
dds.discr
- sample of 1,000 DDS consumers (out of a total of ~ 250,000)
- age, gender, race/ethnicity, and DDS annual financial support per consumer
Previous research
- Researchers examined expenditures on consumers by ethnicity
- Found that the mean annual expenditures on Hispanics was less than that on White non-Hispanics.
Result: an allegation of ethnic discrimination was brought against the California DDS.
Question: Are the data sufficient evidence of ethnic discrimination?
See Section 1.7.1 for more details
Load dataset dds.discr
from package oibiostat
The textbook’s datasets are in the R package
oibiostat
If you haven’t already installed the package
oibiostat
, then first do so using directions in previous slide.Load the
oibiostat
package and the datasetdds.discr
- the code below needs to be run every time you restart R or knit an Rmd file
library(oibiostat)
data("dds.discr")
- After loading the dataset
dds.discr
usingdata("dds.discr")
, you will seedds.discr
in the Data list of the Environment window.
Getting to know the dataset
dim(dds.discr)
[1] 1000 6
names(dds.discr)
[1] "id" "age.cohort" "age" "gender" "expenditures"
[6] "ethnicity"
length(unique(dds.discr$id)) # How many unique id's are there?
[1] 1000
Subsetting data
filter()
rows that satisfy specified conditions
filter()
to select rows
filter data based on rows
math:
>
,<
,>=
,<=
double = for “is equal to”:
==
!= (not equal)
&
(and)|
(or)is.na()
to filter based on missing values%in%
to filter based on group membership!
in front negates the statement, as in!is.na(age)
!(ethnicity %in% c("Asian","Black"))
# Note: the output from the command below is not being saved
# since it's not being assigned to a variable using <-
%>% filter(age > 90) dds.discr
# A tibble: 4 × 6
id age.cohort age gender expenditures ethnicity
<int> <fct> <int> <fct> <int> <fct>
1 19250 51+ 94 Female 60871 Hispanic
2 46726 51+ 95 Male 55187 Hispanic
3 55056 51+ 95 Female 54680 Black
4 87737 51+ 91 Male 54481 Asian
filter()
practice
What do these commands do? Try them out:
%>% filter(age < 5)
dds.discr %>% filter(age/expenditures < 0.5) # can do math within filter command
dds.discr %>% filter((age < 15) | (age > 50))
dds.discr
# simultaneously filter on multiple variables
%>% filter(age < 20, expenditures > 1000, gender == "Male")
dds.discr
%>% filter(id == 10210) # note the use of == instead of just =
dds.discr %>% filter(gender == "Female")
dds.discr %>% filter(!(age.cohort == "51+"))
dds.discr %>% filter(age.cohort %in% c("0-5", "6-12"))
dds.discr
%>% filter(is.na(age))
dds.discr %>% filter(!is.na(age)) dds.discr
Subset by columns
select()
to choose columns
- select columns (variables)
- no quotes needed around variable names
- can be used to rearrange columns
- uses special syntax that is flexible and has many options
%>% select(id, expenditures, ethnicity) dds.discr
# A tibble: 1,000 × 3
id expenditures ethnicity
<int> <int> <fct>
1 10210 2113 White not Hispanic
2 10409 41924 White not Hispanic
3 10486 1454 Hispanic
4 10538 6400 Hispanic
5 10568 4412 White not Hispanic
6 10690 4566 Hispanic
7 10711 3915 White not Hispanic
8 10778 3873 Black
9 10820 5021 White not Hispanic
10 10823 2887 Hispanic
# ℹ 990 more rows
Column selection syntax options
There are many ways to select a set of variable names (columns):
var1:var20
: all columns fromvar1
tovar20
one_of(c("a", "b", "c"))
: all columns with names in the specified character vector of namesRemoving columns
-var1
: remove the columnvar1
-(var1:var20)
: remove all columns fromvar1
tovar20
Select by specifying text within column names
contains("date")
,contains("_")
: all variable names that contain the specified string of charactersstarts_with("a")
orends_with("last")
: all variable names that start or end with the specified string
See other examples in the data transformation cheatsheet.
select()
practice
Which columns are selected & in what order using these commands?
First guess and then try them out.
%>% select(id:gender)
dds.discr %>% select(one_of(c("age","expenditures", "notindata")))
dds.discr
%>% select(-age.cohort,-gender)
dds.discr %>% select(-(id:gender))
dds.discr
%>% select(contains("age"))
dds.discr %>% select(starts_with("a"))
dds.discr %>% select(-contains("a")) dds.discr
relocate()
columns to move them around
relocate()
to change order of columns
- change the order of columns in dataset
- specified column names get put first,
- and unspecified column names after that in original order
- no quotes needed around variable names
- similar options as with
select()
,- plus special ones such as
.before
and.after
- plus special ones such as
%>% relocate(age.cohort, ethnicity) dds.discr
# A tibble: 1,000 × 6
age.cohort ethnicity id age gender expenditures
<fct> <fct> <int> <int> <fct> <int>
1 13-17 White not Hispanic 10210 17 Female 2113
2 22-50 White not Hispanic 10409 37 Male 41924
3 0-5 Hispanic 10486 3 Male 1454
4 18-21 Hispanic 10538 19 Female 6400
5 13-17 White not Hispanic 10568 13 Male 4412
6 13-17 Hispanic 10690 15 Female 4566
7 13-17 White not Hispanic 10711 13 Female 3915
8 13-17 Black 10778 17 Male 3873
9 13-17 White not Hispanic 10820 14 Female 5021
10 13-17 Hispanic 10823 13 Male 2887
# ℹ 990 more rows
relocate()
practice
What order are the columns in using these commands?
First guess and then try them out.
%>% relocate(age:ethnicity)
dds.discr
%>% relocate(where(is.numeric))
dds.discr %>% relocate(where(is.factor))
dds.discr # note: the next command doesn't do anything
# since there are no character type variables in the dataset
%>% relocate(where(is.character))
dds.discr
%>% relocate(age,.before = ethnicity)
dds.discr %>% relocate(ethnicity, .after = age.cohort)
dds.discr %>% relocate(age, .after = last_col()) dds.discr
rename()
columns
- renames column variables
%>% rename(IDnumber = id) # order: new_name = old_name dds.discr
# A tibble: 1,000 × 6
IDnumber age.cohort age gender expenditures ethnicity
<int> <fct> <int> <fct> <int> <fct>
1 10210 13-17 17 Female 2113 White not Hispanic
2 10409 22-50 37 Male 41924 White not Hispanic
3 10486 0-5 3 Male 1454 Hispanic
4 10538 18-21 19 Female 6400 Hispanic
5 10568 13-17 13 Male 4412 White not Hispanic
6 10690 13-17 15 Female 4566 Hispanic
7 10711 13-17 13 Female 3915 White not Hispanic
8 10778 13-17 17 Male 3873 Black
9 10820 13-17 14 Female 5021 White not Hispanic
10 10823 13-17 13 Male 2887 Hispanic
# ℹ 990 more rows
Make new variables
mutate()
Use mutate()
to add new columns to a tibble * Many options in how to define new column of data
# use = to define new a variable within mutate (not <- or ==)
<- dds.discr %>%
newdata mutate(
log_expenditures = log(expenditures),
expend_per_yearage = expenditures / age)
%>% select(id, age, expenditures, log_expenditures, expend_per_yearage) newdata
# A tibble: 1,000 × 5
id age expenditures log_expenditures expend_per_yearage
<int> <int> <int> <dbl> <dbl>
1 10210 17 2113 7.66 124.
2 10409 37 41924 10.6 1133.
3 10486 3 1454 7.28 485.
4 10538 19 6400 8.76 337.
5 10568 13 4412 8.39 339.
6 10690 15 4566 8.43 304.
7 10711 13 3915 8.27 301.
8 10778 17 3873 8.26 228.
9 10820 14 5021 8.52 359.
10 10823 13 2887 7.97 222.
# ℹ 990 more rows
mutate()
practice
What do the following commands do?
First guess and then try them out.
%>% mutate(age_young = (age < 18))
dds.discr
%>% mutate(male = (gender == "Male"))
dds.discr %>% mutate(male = 1 * (gender == "Male")) dds.discr
case_when()
to create multi-valued variables
- Example: create age groups based off of the
age
variable
<- dds.discr %>%
dds.discr2 mutate(
age_group = case_when(
< 6 ~ "0 to 5", # condition ~ new_value
age >= 6 & age < 13 ~ "6 to 12",
age >= 13 & age < 18 ~ "13 to 17",
age >= 18 & age < 22 ~ "18 to 21",
age >= 22 & age < 51 ~ "22 to 50",
age >= 51 ~ "51 plus")
age
)
%>% select(age, age_group) %>% head() dds.discr2
# A tibble: 6 × 2
age age_group
<int> <chr>
1 17 13 to 17
2 37 22 to 50
3 3 0 to 5
4 19 18 to 21
5 13 13 to 17
6 15 13 to 17
%>% tabyl(age.cohort, age_group) dds.discr2
age.cohort 0 to 5 13 to 17 18 to 21 22 to 50 51 plus 6 to 12
0-5 82 0 0 0 0 0
6-12 0 0 0 0 0 175
13-17 0 212 0 0 0 0
18-21 0 0 199 0 0 0
22-50 0 0 0 226 0 0
51+ 0 0 0 0 106 0
arrange()
to order rows
- Use
arrange()
to order the rows by the values in specified columns
%>% arrange(age, expenditures) %>% head(n=3) dds.discr
# A tibble: 3 × 6
id age.cohort age gender expenditures ethnicity
<int> <fct> <int> <fct> <int> <fct>
1 39131 0-5 0 Female 685 White not Hispanic
2 25613 0-5 0 Male 741 Hispanic
3 19917 0-5 0 Male 904 White not Hispanic
%>% arrange(desc(age), expenditures) %>% head(n=3) dds.discr
# A tibble: 3 × 6
id age.cohort age gender expenditures ethnicity
<int> <fct> <int> <fct> <int> <fct>
1 55056 51+ 95 Female 54680 Black
2 46726 51+ 95 Male 55187 Hispanic
3 19250 51+ 94 Female 60871 Hispanic
Reshaping data
Wide vs. long data
- Wide data has one row per individual,
- with multiple columns for their repeated measurements
- Long data has multiple rows per individual,
- with one column for the measurement variable and
- another indicating from when/where the repeated measures are from
DDS example
Mean expenditures by ethnicity and age cohort (from Day 3 slides)
Example wide toy dataset
Copy and paste the code below into R to create this example dataset
<- tibble(id = letters[1:4],
SBP_wide sex = c("F", "M", "M", "F"),
SBP_v1 = c(130, 120, 130, 119),
SBP_v2 = c(110, 116, 136, 106),
SBP_v3 = c(112, 122, 138, 118))
SBP_wide
# A tibble: 4 × 5
id sex SBP_v1 SBP_v2 SBP_v3
<chr> <chr> <dbl> <dbl> <dbl>
1 a F 130 110 112
2 b M 120 116 122
3 c M 130 136 138
4 d F 119 106 118
- What do you think the data in the table are measures of?
- How can we tell the data are wide?
Wide to long: pivot_longer()
SBP_wide
# A tibble: 4 × 5
id sex SBP_v1 SBP_v2 SBP_v3
<chr> <chr> <dbl> <dbl> <dbl>
1 a F 130 110 112
2 b M 120 116 122
3 c M 130 136 138
4 d F 119 106 118
For pivot_longer
we need to specify: - cols: which columns to make long - names_to: the name of the variable that will be created from the data stored in the column names - values_to: the name of the variable that will be created from the data stored in the cell values
<- SBP_wide %>%
SBP_long pivot_longer(
cols=c(SBP_v1,SBP_v2,SBP_v3),
names_to = "visit",
values_to = "SBP")
SBP_long
# A tibble: 12 × 4
id sex visit SBP
<chr> <chr> <chr> <dbl>
1 a F SBP_v1 130
2 a F SBP_v2 110
3 a F SBP_v3 112
4 b M SBP_v1 120
5 b M SBP_v2 116
6 b M SBP_v3 122
7 c M SBP_v1 130
8 c M SBP_v2 136
9 c M SBP_v3 138
10 d F SBP_v1 119
11 d F SBP_v2 106
12 d F SBP_v3 118
Long to wide: pivot_wider()
SBP_long
# A tibble: 12 × 4
id sex visit SBP
<chr> <chr> <chr> <dbl>
1 a F SBP_v1 130
2 a F SBP_v2 110
3 a F SBP_v3 112
4 b M SBP_v1 120
5 b M SBP_v2 116
6 b M SBP_v3 122
7 c M SBP_v1 130
8 c M SBP_v2 136
9 c M SBP_v3 138
10 d F SBP_v1 119
11 d F SBP_v2 106
12 d F SBP_v3 118
For pivot_wider
we need to specify: - names_from: which column contains the names for the new columns - values_from: which column contains the values that will fill in the cell values
<- SBP_long %>%
SBP_wide2 pivot_wider(names_from = "visit",
values_from = "SBP")
SBP_wide2
# A tibble: 4 × 5
id sex SBP_v1 SBP_v2 SBP_v3
<chr> <chr> <dbl> <dbl> <dbl>
1 a F 130 110 112
2 b M 120 116 122
3 c M 130 136 138
4 d F 119 106 118
Clean up visit
column in the long data (1/3)
SBP_long
# A tibble: 12 × 4
id sex visit SBP
<chr> <chr> <chr> <dbl>
1 a F SBP_v1 130
2 a F SBP_v2 110
3 a F SBP_v3 112
4 b M SBP_v1 120
5 b M SBP_v2 116
6 b M SBP_v3 122
7 c M SBP_v1 130
8 c M SBP_v2 136
9 c M SBP_v3 138
10 d F SBP_v1 119
11 d F SBP_v2 106
12 d F SBP_v3 118
Goal: remove the string “SBP_
” from the visit
variable’s values.
Method #1: tidy the visit
column after making the data long
Method #2: tidy the visit
column while making the data long
Clean up visit
column in the long data (2/3)
Method #1: tidy the visit
column after making the data long
<- SBP_long %>%
SBP_long2 mutate(
visit = str_replace(
visit,pattern = "SBP_",
replacement = "")
) SBP_long2
# A tibble: 12 × 4
id sex visit SBP
<chr> <chr> <chr> <dbl>
1 a F v1 130
2 a F v2 110
3 a F v3 112
4 b M v1 120
5 b M v2 116
6 b M v3 122
7 c M v1 130
8 c M v2 136
9 c M v3 138
10 d F v1 119
11 d F v2 106
12 d F v3 118
- Note that
mutate
is replacing the existingvisit
column with new values - If I wanted to keep the original
visit
column instead of overwriting it, I would call the new column something else, such asvisit_clean
- Within
str_replace()
, double quotes need to be used around the characters specifying 1) the string of text to replace ("SBP_"
) and 2) what to replace the text with, where""
is used for no text - Could instead use
str_remove()
:mutate(visit = str_remove(visit,"SBP_"))
Clean up visit
column in the long data (3/3)
Method #2: tidy the visit
column while making the data long
<- SBP_wide %>%
SBP_long3 pivot_longer(cols = c(SBP_v1, SBP_v2,SBP_v3),
names_to = "visit",
names_prefix = "SBP_", #<<
values_to = "SBP")
SBP_long3
# A tibble: 12 × 4
id sex visit SBP
<chr> <chr> <chr> <dbl>
1 a F v1 130
2 a F v2 110
3 a F v3 112
4 b M v1 120
5 b M v2 116
6 b M v3 122
7 c M v1 130
8 c M v2 136
9 c M v3 138
10 d F v1 119
11 d F v2 106
12 d F v3 118
Remarks:
- Note the new parameter
names_prefix
specifying what the prefix is that needs to be stripped. - More complex
pivot_longer()
examples are shown at https://tidyr.tidyverse.org/articles/pivot.html
(the end of the url is articles/pivot.html)
Specifying cols
in pivot_longer
In the example creating SBP_long
, the columns to make the tibble longer by were explicitly listed using cols=c(SBP_v1,SBP_v2,SBP_v3)
.
<- SBP_wide %>%
SBP_long pivot_longer(cols = c(SBP_v1, SBP_v2,SBP_v3), #<<
names_to = "visit",
values_to = "SBP")
SBP_long
# A tibble: 12 × 4
id sex visit SBP
<chr> <chr> <chr> <dbl>
1 a F SBP_v1 130
2 a F SBP_v2 110
3 a F SBP_v3 112
4 b M SBP_v1 120
5 b M SBP_v2 116
6 b M SBP_v3 122
7 c M SBP_v1 130
8 c M SBP_v2 136
9 c M SBP_v3 138
10 d F SBP_v1 119
11 d F SBP_v2 106
12 d F SBP_v3 118
However, we can specify the columns in many different ways, just like with select()
:
cols = c(SBP_v1:SBP_v3)
cols = c(-id, -sex)
cols = starts_wth("SBP")
cols = contains("SBP")
Notes on pivot_*()
commands
pivot_longer()
andpivot_wider()
are relatively new commandspreviously we used
gather()
andspread()
,- which have different function parameters and are less intuitive to use
if you search for help making data longer or wider,
- you might still see references for
gather()
andspread()
- you might still see references for
see my workshop slides for
gather()
andspread()
usagesee https://tidyr.tidyverse.org/articles/pivot.html for more
pivot
examples
Summary of data wrangling commands
command | purpose |
---|---|
%>% |
join (pipe) together commands |
filter() |
subset rows |
arrange() |
sort rows |
select() |
select or rearrange columns |
rename() |
columns |
mutate() |
create new columns |
tabyl() |
summarize categorical data |
summarize() |
summarize data; for both categorical and numerical data |
group_by() |
group data by a categorical variable |
pivot_longer |
make a wide dataset long; |
pivot_wider |
to make a long dataset wide |