# data <- data |>
# mutate(timepoint = factor(timepoint,
# levels = c(1, 2, 3),
# labels = c(“1 month”,
# “6 months”,
# “12 months”)))Week 7
Announcements
- HW 7 is in the part 6 folder
- HW 6: see the updated HW 6 assignment on OneDrive in the
pat06folder calledhw_06_b526-final-version.qmd - Midterm: due date extended to 2/25/24.
- See the updated midterm file on OneDrive with new yaml, due date, and links to previous midterm projects.
Reminder to fill out post-class surveys
- This is a reminder that 5% of your grade is based on filling out post-class surveys as a way of telling us that you came to class and engaged with the material for that week.
- You only need to fill out 5 surveys (of the 10 class sessions) for the full 5%. We encourage you to fill out as many surveys as possible to provide feedback on the class though.
- Please fill out surveys by 8 pm on Sunday evenings to guarantee that they will be counted. We usually download them some time on Sunday evening or Monday. If you turn it in before we download the responses, it will get counted.
Topics
Part 6
- Practice working with real data
- Practice joining and pivoting
- Practice ggplot and learn more geometries
- Learn how to deal with missing data
Class materials
- Week 6 Readings
- One Drive part_06 Project folders
Post-class survey
- Please fill out the post-class survey to provide feedback. Thank you!
Homework
- See OneDrive folder for homework assignment.
- HW 7 due on 2/28. Assignment is in the part 6 folder.
Recording
- In-class recording links are on Sakai. Navigate to Course Materials -> Schedule with links to in-class recordings. Note that the password to the recordings is at the top of the page.
Muddiest points from Week 7
- Week 7 feedback will be added during Week 6.
- See Week 6 page for Week 6 feedback.
Why we used full_join()
Why we used
full_join()in the class example instead of the other join options
- In this case both datasets being joined had the same ID’s, and thus it did not matter whether we used
left_join(),right_join(),full_join()orinner_join(). All of these would’ve given the same results.
Visualizing pivots and joins
pivot-longer is still hard for me to mentally visualize how it alters the dataset.
I always struggle to visualize pivots and joins.
- Reshaping data take lots of practice to get the hang of, and something where I still pause while coding to think through how it will work and how to code it. Especially for pivoting, I often refer back to existing code I am familiar with. It’s normal at this point to still be muddy on these topics. Keep practicing though and read through some more examples.
- In the week 6 muddiest points I listed some additional resources for visualizing these.
- See also Tidy Animated Verbs for visualizing joins and pivoting. The page also includes visualizing
union(),intersect(), andset_diff(). - Another great resource is the R for Epidemiology website.
- Jessica also addressed pivoting in last year’s muddiest points.
Please come to office hours or set up a time to meet if this is still muddy after looking at these resources!
mutate(factor ( ))
mutate(factor ( ))problem we ran into in class where Emile posted on Slack.
Below is the code Emile posted on Slack (commented out):
- At this point we were working through the code of Section 2.8 in the Part 6 notes.
Load the mouse_data dataset we were working with:
library(tidyverse)
library(here)
library(janitor)
mouse_data <- read_csv(here("data", "mouse_data_longitudinal_clean.csv"))
glimpse(mouse_data)Rows: 96
Columns: 18
$ sid <dbl> 137, 137, 137, 138, 138, 138, 139, …
$ strain <chr> "C3H", "C3H", "C3H", "C3H", "C3H", …
$ trt <chr> "-", "-", "-", "-", "-", "-", "-", …
$ sex <chr> "M", "M", "M", "M", "M", "M", "M", …
$ time <chr> "tp1", "tp2", "tp3", "tp1", "tp2", …
$ normalized_bdnf_amygdala_pg_mg <dbl> 492.4831, 275.1623, NA, 453.6635, 4…
$ normalized_bdnf_cortex_pg_mg <dbl> 720.0173, NA, 871.8286, 884.5668, N…
$ normalized_bdnf_hypothalamus_pg_mg <dbl> NA, 1169.2845, NA, 1215.8147, 1078.…
$ normalized_cd68_amygdala_pg_mg <dbl> 988.9628, 574.0655, NA, 775.5970, 4…
$ normalized_cd68_cortex_pg_mg <dbl> 8.393707, NA, NA, 7.901366, NA, 8.8…
$ normalized_cd68_hypothalamus_pg_mg <dbl> NA, 6800.870, NA, 4373.811, 4461.62…
$ normalized_map2_cortex_pg_mg <dbl> 352.9653, NA, 2693.9386, 1007.4147,…
$ mirna1 <dbl> 5.2630200, -0.0491371, -0.7367310, …
$ mirna2 <dbl> 1.6536200, -0.0773419, 0.1479940, -…
$ learning_outcome <dbl> 3.52, 19.81, 2.44, 1.56, 14.48, 1.1…
$ preference_obj1 <dbl> 41.72205, 37.51387, 55.96768, 74.11…
$ preference_obj2 <dbl> 58.27795, 62.48613, 44.03232, 25.88…
$ time_month <chr> "1 month", "6 months", "12 months",…
mouse_data %>% tabyl(time) time n percent
tp1 32 0.3333333
tp2 32 0.3333333
tp3 32 0.3333333
- The goal was to create a factor variable of the character time point column called
timewith the levels 1 month, 6 months, and 12 months, instead oftime’s values tp1, tp2, and tp3. - The code presented in class to accomplish this is below:
# create time_month factor
mouse_data <- mouse_data %>%
mutate(time_month = case_when(
time=="tp1" ~ "1 month",
time=="tp2" ~ "6 months",
time=="tp3" ~ "12 months"
),
time_month = factor(time_month,
levels = c("1 month", "6 months", "12 months")))- Compare the old and new time variables:
mouse_data %>% tabyl(time, time_month) time 1 month 6 months 12 months
tp1 32 0 0
tp2 0 32 0
tp3 0 0 32
- The question arose as to whether we could include
factor()in the same step ascase_when()when creatingtime_monthabove, instead of having to write it out as a second separate line in themutate(). - When using
case_when(), we can do this as follows by piping the factor after thecase_when():
mouse_data <- mouse_data %>%
mutate(time_month2 = case_when(
time=="tp1" ~ "1 month",
time=="tp2" ~ "6 months",
time=="tp3" ~ "12 months"
) %>% factor(., levels = c("1 month", "6 months", "12 months"))
)
mouse_data %>% tabyl(time_month, time_month2) time_month 1 month 6 months 12 months
1 month 32 0 0
6 months 0 32 0
12 months 0 0 32
- Another option that is similar, is to enclose the
case_when()within thefactor():
mouse_data <- mouse_data %>%
mutate(time_month3 = factor(
case_when(
time=="tp1" ~ "1 month",
time=="tp2" ~ "6 months",
time=="tp3" ~ "12 months"
),
levels = c("1 month", "6 months", "12 months")
))
mouse_data %>% tabyl(time_month, time_month3) time_month 1 month 6 months 12 months
1 month 32 0 0
6 months 0 32 0
12 months 0 0 32
levels vs. labels
- Emile suggested using
factor()on thetimevariable directly, and creating the new values using thelabelsoption withinfactor():
mouse_data <- mouse_data %>%
mutate(time_month4 = factor(time,
levels = c("tp1", "tp2", "tp3"),
labels = c("1 month", "6 months", "12 months")
))
mouse_data %>% tabyl(time_month, time_month4) time_month 1 month 6 months 12 months
1 month 32 0 0
6 months 0 32 0
12 months 0 0 32
- What is new her is that we have not previously discussed
labels. - You can think of the
levelsas the input for thefactor()function.- It’s how we specify what the different levels are for the variable we are converting to factor, as well as the order we want the levels to be in.
- If we do not specify the
levels, then R will automatically use the different values of the variable being converted and arrange them in alphanumeric order. Example:
mouse_data <- mouse_data %>%
mutate(time_month5 = factor(time))
mouse_data %>% tabyl(time_month, time_month5) time_month tp1 tp2 tp3
1 month 32 0 0
6 months 0 32 0
12 months 0 0 32
- While
levelsis an input for thefactor()function,labelsis an output for thefactor()function. - The values specified in
labelsare the new values for the levels:
# time_month4 added labels
# time_month5 did not add labels
mouse_data %>% tabyl(time_month4, time_month5) time_month4 tp1 tp2 tp3
1 month 32 0 0
6 months 0 32 0
12 months 0 0 32
levels(mouse_data$time_month4)[1] "1 month" "6 months" "12 months"
levels(mouse_data$time_month5)[1] "tp1" "tp2" "tp3"
- Note that both
time_month4andtime_month5started with the samelevels. - Instead of using the
labelsoption withinfactor()(the base R way), we can also accomplish this by usingfct_recode()from theforcatspackage (loaded as a part oftidyverse):
# original tp levels:
levels(mouse_data$time_month5)[1] "tp1" "tp2" "tp3"
mouse_data <- mouse_data %>%
mutate(time_month6 = fct_recode(time_month5,
# new_name = "old_name"
"1 month" = "tp1",
"6 months" = "tp2",
"12 months" = "tp3"))
levels(mouse_data$time_month6)[1] "1 month" "6 months" "12 months"
mouse_data %>% tabyl(time_month6, time_month5) time_month6 tp1 tp2 tp3
1 month 32 0 0
6 months 0 32 0
12 months 0 0 32
- Learn more about
fct_recode()here.
%in%
%in%command, I feel like I understand but have some confusion and think it might just be one of those things I have to work with/apply to fully understand
We’ve used the
%in%function in some examples, but I don’t think we’ve discussed it in detail.The %in% function is used to test whether elements of one vector are contained in another vector. It returns a logical vector indicating whether each element of the first vector is found in the second vector.
Below are some examples that ChatGPT generated (and I slightly edited).
# Example 1: Using %in% with two numeric vectors
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6)
x %in% y[1] FALSE TRUE FALSE TRUE FALSE
# Example 2: Using %in% with two character vectors
fruits <- c("apple", "banana", "orange", "grape")
selected_fruits <- c("banana", "grape", "kiwi")
selected_fruits %in% fruits[1] TRUE TRUE FALSE
# Example 3: Using %in% with dataframe columns
library(tidyverse)
# Create a dataframe
df <- tibble(
ID = c(1, 2, 3, 4, 5),
fruit = c("apple", "banana", "orange", "grape", "kiwi")
)
df# A tibble: 5 × 2
ID fruit
<dbl> <chr>
1 1 apple
2 2 banana
3 3 orange
4 4 grape
5 5 kiwi
# Filter rows where 'fruit' column contains values from selected_fruits
selected_fruits <- c("banana", "grape", "kiwi")
df_filtered <- df %>%
filter(fruit %in% selected_fruits)
df_filtered# A tibble: 3 × 2
ID fruit
<dbl> <chr>
1 2 banana
2 4 grape
3 5 kiwi
Clearest points
This class was all really clear. It was helpful to be reviewing some of the things we learned last week.
I appreciate the new codes on how to clean/reshape/combine messy data. I think that was the hardest parts to do in the other Biostatistics courses during projects.
Data cleaning
Most of the data cleaning exercises.
different strategies to clean data sets
The data cleaning made a lot of sense but I think I will struggle with solving problems in a really inefficient way.
Everything before Challenge 3
methods to merge datasets to create a table
inner join and full join are the same if all vectors are the same.
Pivot
ggplot and how to code data in to display what we want to display
Other comments
Is there a difference between summarize (with z) and summarise (with s)?
Great question!
- In English, summarize is American English and summarise is British English. * In R they work the same way. The reference page for
summarise()lists them as synonyms. - In R code I see summarise more, and now keep mixing up which is American and which is British.
- In general, R accepts both American and British English, such as both color and colour.
Thank you for the survey reminders! The pace of the class feels much better compared to the pace at the beginning of the term
Thanks for the feedback!
I really enjoyed the walk through from start to finish of how to clean the data sheet and it really helped clear up many of the commands I was previously confused about
Thanks for the feedback! Glad the data wrangling walk through was helpful.