Week 5

Data summarizing, reshaping, and wrangling with multiple tables
Published

February 7, 2024

Modified

February 14, 2024

Post-class updates

Updates made on 2/9/24

  • HW 5: see the updated HW 5 assignment on OneDrive called hw_05_b526_v2.qmd
  • Midterm: due date extended to 2/25/24.
    • See the updated midterm file on OneDrive with new yaml, due date, and links to previous midterm projects.
  • Material covered in class on 2/7/24
    • Part 5 Sections 1-4.4
    • Solutions to Challenge 1 from section 4.3 of the Part 5 notes are in OneDrive (Challenge_1_solutions_Part5.html)

Topics

  • Practice loading data and using mutate() and separate()
  • Practice using here() to load data in a subfolder of the project
  • Learn how to summarize() data with group_by() to summarize within categories
  • Learn and apply bind_rows() to combine rows from two or more datasets
  • Learn about the different kinds of joins and how they merge data
  • Apply inner_join() and left_join() to join tables on columns
  • Utilize pivot_longer() to make a wide dataset long

Announcements

  • Class materials for BSTA 526 will be provided in the shared OneDrive folder BSTA_526_W24_class_materials_public.
  • For today’s class, make sure to download to your computer the folder called part_05, and then open RStudio by double-clicking on the file called part_05.Rproj.

Class materials

Post-class survey

Homework

  • See OneDrive folder for homework assignment.
  • HW 5 due on 2/14.
    • See the updated HW 5 assignment on OneDrive called hw_05_b526_v2.qmd

Recording

  • In-class recording links are on Sakai. Navigate to Course Materials -> Schedule with links to in-class recordings. Note that the password to the recordings is at the top of the page.

Muddiest points from Week 5

case_when() vs ifelse()

The difference between case_when and ifelse

  • ifelse() is the base R version of tidyverse’s case_when()
  • I prefer using case_when() since it’s easier to follow the logic.
  • case_when() is especially useful when there are more than two logical conditions being used.

The example below creates a binary variable for bill length (long vs not long) using both case_when() and ifelse() as a comparison.

  • Compare the crosstabs of the two variables!
library(tidyverse)
library(janitor)
library(palmerpenguins)

summary(penguins)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 
penguins <- penguins %>% 
  mutate(
    long_bill1 = case_when(
      bill_length_mm >= 45 ~ "long",
      bill_length_mm < 45 ~ "not long",
    ),
    long_bill2 = ifelse(bill_length_mm >= 45, "long", "not long")
  )

penguins %>% tabyl(long_bill1, long_bill2) %>% 
  adorn_title()
            long_bill2             
 long_bill1       long not long NA_
       long        166        0   0
   not long          0      176   0
       <NA>          0        0   2

Below is an example using case_when() to create a categorical variable with 3 groups:

penguins <- penguins %>% 
  mutate(
    long_bill3 = case_when(
      bill_length_mm >= 50 ~ "long",
      bill_length_mm <= 40 ~ "short",
      TRUE ~ "medium"
    ))

penguins %>% tabyl(long_bill3, long_bill1) %>% 
  adorn_title()
            long_bill1             
 long_bill3       long not long NA_
       long         57        0   0
     medium        109       76   2
      short          0      100   0
  • Creating a categorical variable with 3 groups can be done with ifelse(), but it’s harder to follow the logic:
penguins <- penguins %>% 
  mutate(
    long_bill4 = ifelse(
      bill_length_mm >= 50, "long",
      ifelse(bill_length_mm <= 40, "short", "medium")
      ))

penguins %>% tabyl(long_bill3, long_bill4) %>% 
  adorn_title()
            long_bill4                 
 long_bill3       long medium short NA_
       long         57      0     0   0
     medium          0    185     0   2
      short          0      0   100   0

separate()

Different ways of using the function separate, it was a bit unclear that when to use one or the other or examples of my research data where it’ll be most relevant to use.

  • Choosing the “best” way of using separate() is overwhelming at first.
  • I recommend starting with the simplest use case with a string being specified in sep = " ":

separate(data, col, into, sep = " ")

  • Which of the various versions we showed to use depends on how the data being separated are structured.
  • Most of the time I have a simple character, such as a space (sep = " ") or a comma (sep = ",") that I want to separate by.
  • If the data are structured in a more complex way, then one of the stringr package options might come in handy.

here::here()

TSV files, very neat… But also, I got a bit confused when you did the render process around 22:00-23:00 minutes. Also, “here: and also”here” Directories/root directories. I was a bit confused about in what situations we would tangibly utilize this/if it is beneficial.

  • Great question! This is definitely not intuitive, which is why I wanted to demonstrate it in class.
  • The key is that
    • when rendering a qmd file the current working directory is the folder the file is sitting in,
    • while when running code in a file within RStudio the working directory is the folder where the .Rproj file is located.
  • This distinction is important when loading other files from our computer during our workflow, and why here::here() makes our workflow so much easier!

what functions will only work within another function (generally)

  • I’m not aware of functions that only work standalone within other functions. For example, the mean() function works on its own, but can also be used within a summarise().
mean(penguins$bill_length_mm, na.rm = TRUE)
[1] 43.92193
penguins %>% summarise(
  m = mean(bill_length_mm, na.rm = TRUE)
)
# A tibble: 1 × 1
      m
  <dbl>
1  43.9
  • That being said, a function has a set of parameters to be specified that are specific to that function.