Week 7

Start with your goal: more data wrangling
Published

February 21, 2024

Modified

February 28, 2024

Announcements

  • HW 7 is in the part 6 folder
  • HW 6: see the updated HW 6 assignment on OneDrive in the pat06 folder called hw_06_b526-final-version.qmd
  • Midterm: due date extended to 2/25/24.
    • See the updated midterm file on OneDrive with new yaml, due date, and links to previous midterm projects.

Reminder to fill out post-class surveys

  • This is a reminder that 5% of your grade is based on filling out post-class surveys as a way of telling us that you came to class and engaged with the material for that week.
  • You only need to fill out 5 surveys (of the 10 class sessions) for the full 5%. We encourage you to fill out as many surveys as possible to provide feedback on the class though.
  • Please fill out surveys by 8 pm on Sunday evenings to guarantee that they will be counted. We usually download them some time on Sunday evening or Monday. If you turn it in before we download the responses, it will get counted.

See syllabus section on Post-class surveys

Topics

Part 6

  • Practice working with real data
  • Practice joining and pivoting
  • Practice ggplot and learn more geometries
  • Learn how to deal with missing data

Class materials

Post-class survey

Homework

  • See OneDrive folder for homework assignment.
  • HW 7 due on 2/28. Assignment is in the part 6 folder.

Recording

  • In-class recording links are on Sakai. Navigate to Course Materials -> Schedule with links to in-class recordings. Note that the password to the recordings is at the top of the page.

Muddiest points from Week 7

  • Week 7 feedback will be added during Week 6.
  • See Week 6 page for Week 6 feedback.

Why we used full_join()

Why we used full_join() in the class example instead of the other join options

  • In this case both datasets being joined had the same ID’s, and thus it did not matter whether we used left_join(), right_join(), full_join() or inner_join(). All of these would’ve given the same results.

Visualizing pivots and joins

pivot-longer is still hard for me to mentally visualize how it alters the dataset.

I always struggle to visualize pivots and joins.

  • Reshaping data take lots of practice to get the hang of, and something where I still pause while coding to think through how it will work and how to code it. Especially for pivoting, I often refer back to existing code I am familiar with. It’s normal at this point to still be muddy on these topics. Keep practicing though and read through some more examples.
  • In the week 6 muddiest points I listed some additional resources for visualizing these.
  • See also Tidy Animated Verbs for visualizing joins and pivoting. The page also includes visualizing union(), intersect(), and set_diff().
  • Another great resource is the R for Epidemiology website.
  • Jessica also addressed pivoting in last year’s muddiest points.

Please come to office hours or set up a time to meet if this is still muddy after looking at these resources!

mutate(factor ( ))

mutate(factor ( )) problem we ran into in class where Emile posted on Slack.

Below is the code Emile posted on Slack (commented out):

# data <- data |>
#   mutate(timepoint = factor(timepoint,
#                             levels = c(1, 2, 3),
#                             labels = c(“1 month”,
#                                          “6 months”,
#                                          “12 months”)))
  • At this point we were working through the code of Section 2.8 in the Part 6 notes.

Load the mouse_data dataset we were working with:

library(tidyverse)
library(here)
library(janitor)

mouse_data <- read_csv(here("data", "mouse_data_longitudinal_clean.csv"))
glimpse(mouse_data)
Rows: 96
Columns: 18
$ sid                                <dbl> 137, 137, 137, 138, 138, 138, 139, …
$ strain                             <chr> "C3H", "C3H", "C3H", "C3H", "C3H", …
$ trt                                <chr> "-", "-", "-", "-", "-", "-", "-", …
$ sex                                <chr> "M", "M", "M", "M", "M", "M", "M", …
$ time                               <chr> "tp1", "tp2", "tp3", "tp1", "tp2", …
$ normalized_bdnf_amygdala_pg_mg     <dbl> 492.4831, 275.1623, NA, 453.6635, 4…
$ normalized_bdnf_cortex_pg_mg       <dbl> 720.0173, NA, 871.8286, 884.5668, N…
$ normalized_bdnf_hypothalamus_pg_mg <dbl> NA, 1169.2845, NA, 1215.8147, 1078.…
$ normalized_cd68_amygdala_pg_mg     <dbl> 988.9628, 574.0655, NA, 775.5970, 4…
$ normalized_cd68_cortex_pg_mg       <dbl> 8.393707, NA, NA, 7.901366, NA, 8.8…
$ normalized_cd68_hypothalamus_pg_mg <dbl> NA, 6800.870, NA, 4373.811, 4461.62…
$ normalized_map2_cortex_pg_mg       <dbl> 352.9653, NA, 2693.9386, 1007.4147,…
$ mirna1                             <dbl> 5.2630200, -0.0491371, -0.7367310, …
$ mirna2                             <dbl> 1.6536200, -0.0773419, 0.1479940, -…
$ learning_outcome                   <dbl> 3.52, 19.81, 2.44, 1.56, 14.48, 1.1…
$ preference_obj1                    <dbl> 41.72205, 37.51387, 55.96768, 74.11…
$ preference_obj2                    <dbl> 58.27795, 62.48613, 44.03232, 25.88…
$ time_month                         <chr> "1 month", "6 months", "12 months",…
mouse_data %>% tabyl(time)
 time  n   percent
  tp1 32 0.3333333
  tp2 32 0.3333333
  tp3 32 0.3333333
  • The goal was to create a factor variable of the character time point column called time with the levels 1 month, 6 months, and 12 months, instead of time’s values tp1, tp2, and tp3.
  • The code presented in class to accomplish this is below:
# create time_month factor
mouse_data <- mouse_data %>%
  mutate(time_month = case_when(
    time=="tp1" ~ "1 month",
    time=="tp2" ~ "6 months",
    time=="tp3" ~ "12 months"
  ),
  time_month = factor(time_month,
                      levels = c("1 month", "6 months", "12 months")))
  • Compare the old and new time variables:
mouse_data %>% tabyl(time, time_month)
 time 1 month 6 months 12 months
  tp1      32        0         0
  tp2       0       32         0
  tp3       0        0        32
  • The question arose as to whether we could include factor() in the same step as case_when() when creating time_month above, instead of having to write it out as a second separate line in the mutate().
  • When using case_when(), we can do this as follows by piping the factor after the case_when():
mouse_data <- mouse_data %>%
  mutate(time_month2 = case_when(
    time=="tp1" ~ "1 month",
    time=="tp2" ~ "6 months",
    time=="tp3" ~ "12 months"
  ) %>% factor(., levels = c("1 month", "6 months", "12 months"))
  )

mouse_data %>% tabyl(time_month, time_month2)
 time_month 1 month 6 months 12 months
    1 month      32        0         0
   6 months       0       32         0
  12 months       0        0        32
  • Another option that is similar, is to enclose the case_when() within the factor():
mouse_data <- mouse_data %>%
  mutate(time_month3 = factor(
    case_when(
      time=="tp1" ~ "1 month",
      time=="tp2" ~ "6 months",
      time=="tp3" ~ "12 months"
      ), 
    levels = c("1 month", "6 months", "12 months")
    ))

mouse_data %>% tabyl(time_month, time_month3)
 time_month 1 month 6 months 12 months
    1 month      32        0         0
   6 months       0       32         0
  12 months       0        0        32

levels vs. labels

  • Emile suggested using factor() on the time variable directly, and creating the new values using the labels option within factor():
mouse_data <- mouse_data %>% 
  mutate(time_month4 = factor(time,
                             levels = c("tp1", "tp2", "tp3"),
                             labels = c("1 month", "6 months", "12 months")
                             ))

mouse_data %>% tabyl(time_month, time_month4)
 time_month 1 month 6 months 12 months
    1 month      32        0         0
   6 months       0       32         0
  12 months       0        0        32
  • What is new her is that we have not previously discussed labels.
  • You can think of the levels as the input for the factor() function.
    • It’s how we specify what the different levels are for the variable we are converting to factor, as well as the order we want the levels to be in.
    • If we do not specify the levels, then R will automatically use the different values of the variable being converted and arrange them in alphanumeric order. Example:
mouse_data <- mouse_data %>% 
  mutate(time_month5 = factor(time))

mouse_data %>% tabyl(time_month, time_month5)
 time_month tp1 tp2 tp3
    1 month  32   0   0
   6 months   0  32   0
  12 months   0   0  32
  • While levels is an input for the factor() function, labels is an output for the factor() function.
  • The values specified in labels are the new values for the levels:
# time_month4 added labels
# time_month5 did not add labels

mouse_data %>% tabyl(time_month4, time_month5)
 time_month4 tp1 tp2 tp3
     1 month  32   0   0
    6 months   0  32   0
   12 months   0   0  32
levels(mouse_data$time_month4)
[1] "1 month"   "6 months"  "12 months"
levels(mouse_data$time_month5)
[1] "tp1" "tp2" "tp3"
  • Note that both time_month4 and time_month5 started with the same levels.
  • Instead of using the labels option within factor() (the base R way), we can also accomplish this by using fct_recode() from the forcats package (loaded as a part of tidyverse):
# original tp levels:
levels(mouse_data$time_month5)
[1] "tp1" "tp2" "tp3"
mouse_data <- mouse_data %>% 
  mutate(time_month6 = fct_recode(time_month5, 
                            # new_name = "old_name"
                                 "1 month" = "tp1", 
                                 "6 months" = "tp2", 
                                 "12 months" = "tp3"))

levels(mouse_data$time_month6)
[1] "1 month"   "6 months"  "12 months"
mouse_data %>% tabyl(time_month6, time_month5)
 time_month6 tp1 tp2 tp3
     1 month  32   0   0
    6 months   0  32   0
   12 months   0   0  32
  • Learn more about fct_recode() here.

%in%

%in% command, I feel like I understand but have some confusion and think it might just be one of those things I have to work with/apply to fully understand

  • We’ve used the %in% function in some examples, but I don’t think we’ve discussed it in detail.

  • The %in% function is used to test whether elements of one vector are contained in another vector. It returns a logical vector indicating whether each element of the first vector is found in the second vector.

  • Below are some examples that ChatGPT generated (and I slightly edited).

# Example 1: Using %in% with two numeric vectors
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6)

x %in% y
[1] FALSE  TRUE FALSE  TRUE FALSE
# Example 2: Using %in% with two character vectors
fruits <- c("apple", "banana", "orange", "grape")
selected_fruits <- c("banana", "grape", "kiwi")

selected_fruits %in% fruits
[1]  TRUE  TRUE FALSE
# Example 3: Using %in% with dataframe columns
library(tidyverse)

# Create a dataframe
df <- tibble(
  ID = c(1, 2, 3, 4, 5),
  fruit = c("apple", "banana", "orange", "grape", "kiwi")
)

df
# A tibble: 5 × 2
     ID fruit 
  <dbl> <chr> 
1     1 apple 
2     2 banana
3     3 orange
4     4 grape 
5     5 kiwi  
# Filter rows where 'fruit' column contains values from selected_fruits
selected_fruits <- c("banana", "grape", "kiwi")

df_filtered <- df %>%
  filter(fruit %in% selected_fruits)

df_filtered
# A tibble: 3 × 2
     ID fruit 
  <dbl> <chr> 
1     2 banana
2     4 grape 
3     5 kiwi  

Clearest points

This class was all really clear. It was helpful to be reviewing some of the things we learned last week.

I appreciate the new codes on how to clean/reshape/combine messy data. I think that was the hardest parts to do in the other Biostatistics courses during projects.

Data cleaning

Most of the data cleaning exercises.

different strategies to clean data sets

The data cleaning made a lot of sense but I think I will struggle with solving problems in a really inefficient way.

Everything before Challenge 3

methods to merge datasets to create a table

inner join and full join are the same if all vectors are the same.

Pivot

ggplot and how to code data in to display what we want to display

Other comments

Is there a difference between summarize (with z) and summarise (with s)?

Great question!

  • In English, summarize is American English and summarise is British English. * In R they work the same way. The reference page for summarise() lists them as synonyms.
  • In R code I see summarise more, and now keep mixing up which is American and which is British.
  • In general, R accepts both American and British English, such as both color and colour.

Thank you for the survey reminders! The pace of the class feels much better compared to the pace at the beginning of the term

Thanks for the feedback!

I really enjoyed the walk through from start to finish of how to clean the data sheet and it really helped clear up many of the commands I was previously confused about

Thanks for the feedback! Glad the data wrangling walk through was helpful.