Week 6

Start with your goal: more data wrangling
Published

February 14, 2024

Modified

February 28, 2024

Announcements

  • HW 5: see the updated HW 5 assignment on OneDrive called hw_05_b526_v2.qmd
  • Midterm: due date extended to 2/25/24.
    • See the updated midterm file on OneDrive with new yaml, due date, and links to previous midterm projects.

Reminder to fill out post-class surveys

  • This is a reminder that 5% of your grade is based on filling out post-class surveys as a way of telling us that you came to class and engaged with the material for that week.
  • You only need to fill out 5 surveys (of the 10 class sessions) for the full 5%. We encourage you to fill out as many surveys as possible to provide feedback on the class though.
  • Please fill out surveys by 8 pm on Sunday evenings to guarantee that they will be counted. We usually download them some time on Sunday evening or Monday. If you turn it in before we download the responses, it will get counted.

See syllabus section on Post-class surveys

Topics

Finish week 5

  • We will first finish the material not covered from Week 5, starting with section 4.6.
    • Note: I created a new version of the code file (inside the code folder) called part_05_b526_v2.qmd/.html.
    • I decided not to move the Week 5 material not covered to the Week 6 notes.

From Week 5:

  • Learn how to summarize() data with group_by() to summarize within categories
  • Learn about the different kinds of joins and how they merge data
    • Apply inner_join() and left_join() to join tables on columns
  • Utilize pivot_longer() to make a wide dataset long

New Week 6 topics - did not get to Part 6

  • Practice working with real data
  • Practice joining and pivoting
  • Practice ggplot and learn more geometries
  • Learn how to deal with missing data

Class materials

Post-class survey

Homework

  • See OneDrive folder for homework assignment.
  • HW 6 due on 2/21.

Recording

  • In-class recording links are on Sakai. Navigate to Course Materials -> Schedule with links to in-class recordings. Note that the password to the recordings is at the top of the page.

Muddiest points from Week 6

  • See Week 5 page for Week 5 feedback.

  • Note: During Week 6 we finished covering the part 5 material. Part 6 material will be covered during week 7.

across()

what exactly the across function does

.fns, i.e. .fns=list, etc… I wasn’t really sure what that was achieving within across.

  • The across() function lets us apply a function to many columns at once.
  • For example, let’s say we want the mean value for every continuous variable in a dataset.
    • The code below calculates the mean for one variable in the penguins dataset using both base R and summarize().
    • One option to calculate the mean value for every continuous variable in the dataset is to repeat this code for the 4 other continuous variables.
library(tidyverse)
library(janitor)
library(palmerpenguins)
library(gt)

summary(penguins)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 
  # base R
mean(penguins$bill_length_mm, na.rm = TRUE)
[1] 43.92193
# with summarize
penguins %>% 
  summarize(mean(bill_length_mm, na.rm = TRUE))
# A tibble: 1 × 1
  `mean(bill_length_mm, na.rm = TRUE)`
                                 <dbl>
1                                 43.9
  • In this case across() lets us apply the mean function to all the columns of interest at once:
penguins %>%
  summarize(across(.cols = where(is.numeric), 
                   .fns = ~ mean(.x, na.rm = TRUE)
                   )) %>% 
  gt()
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
43.92193 17.15117 200.9152 4201.754 2008.029
  • The .fns=list part of the across code is where we specify the function(s) that we want to apply to the specified columns.
    • Above we only specified one function (mean()), but we can specify additional functions as well, which is when we need to create a list to list all the functions we want to apply.
    • Below I apply the mean and standard deviation functions:
penguins %>%
  summarize(across(.cols = where(is.numeric), 
                   .fns = list(
                     mean = ~ mean(.x, na.rm = TRUE),
                     sd = ~ sd(.x, na.rm = TRUE)
                     ))) %>% 
  gt()
bill_length_mm_mean bill_length_mm_sd bill_depth_mm_mean bill_depth_mm_sd flipper_length_mm_mean flipper_length_mm_sd body_mass_g_mean body_mass_g_sd year_mean year_sd
43.92193 5.459584 17.15117 1.974793 200.9152 14.06171 4201.754 801.9545 2008.029 0.8183559
  • In general, lists are another type of R object to store information, whether data, lists of functions, output from regression models, etc. While concatenate is just a vector of values, lists are multidimensional. We will be learning more about lists in parts 7 and 8.

  • You can learn more about across() at its help file.

case_when() vs ifelse()

still a little confused on the difference between ifelse and casewhen, understand they are very similar but still confused on when it is best to use one over another

  • The two functions can be used interchangeably. * ifelse() is the original function from base R
    • case_when() is the user-friendly version of ifelse() from the dplyr package
  • I recommend using case_when(), and it is what I use almost exclusively in my own work. My guess is that ifelse() was included in the notes since you might run into the function when reading R code on the internet.
  • Just be careful that you preserve missing values when using case_when() as we discussed last time.

factor levels

working with factor levels doesn’t feel totally intuitive yet. I think that’s because I tend to get confused with anything involving a concatenated list.

  • Working with factor variables takes a while to get used to, and in particular with their factor levels.
  • We will be looking at more examples with factor variables in the part 6 notes. See sections 2.8 and 4.
  • You can think of a concatenated list(c(...)) as a vector of values or a column of a dataset. Concatenating lets us create a set of values, which we typically create to use for some other purpose, such as specifying the levels of a factor variable.
  • Please submit a follow-up question in the post-class survey if this is still muddy after today’s class!

pivoting tables

  • Definitely a tricky topic, and over half of the muddiest points were about pivoting tables.
  • We will be looking at more examples in part 6.

How pivot_longer() would work on very large datasets with many rows/columns

  • It works the same way. However the resulting long table will end up being much much longer.
  • Extra columns in the dataset just hang out and their values get repeated (such as an age variable that is not being made long by) over and over again.
    • We will be pivoting a dataset in part 6 that has extra variables that are not being pivoted.

Trying to visualize the joins and pivot longer/wider

  • I recommend trying them out with small datasets where you can actually see what is happening.
  • Joins: Our BERD workshop slides have another example that might visualize joins.
    • Slide 18 shows to datasets x and y, and what the resulting joins look like.
    • Slide 19 shows Venn diagrams of how the different joins behave.
  • Pivoting: There’s an example with a very small dataset in my (supplemental) notes from BSTA 511. The graphic that goes along with this is on Slide 28 from the pdf.

pivot_longer makes plotting more understandable in an analysis sense, which situations would call for pivot_wider?

  • I tend to use pivot_longer() much more frequently. However, there are times when pivot_wider() comes in handy. For example, below is a long table of summary statistics created with group_by() and summarize(). I would use pivot_wider() to reshape this table so that I have columns comparing species or columns comparing islands.
penguins %>% 
  group_by(species, island) %>%
  summarize(across(.cols = bill_length_mm, 
                   .fns = list(
                     mean = ~ mean(.x, na.rm = TRUE),
                     sd = ~ sd(.x, na.rm = TRUE)
                     )))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 5 × 4
# Groups:   species [3]
  species   island    bill_length_mm_mean bill_length_mm_sd
  <fct>     <fct>                   <dbl>             <dbl>
1 Adelie    Biscoe                   39.0              2.48
2 Adelie    Dream                    38.5              2.47
3 Adelie    Torgersen                39.0              3.03
4 Chinstrap Dream                    48.8              3.34
5 Gentoo    Biscoe                   47.5              3.08

How to use arguments of pivot longer.

  • The arguments of the pivot functions take some practice to get used to. I sometimes still pull up an example to remind me what I need to specify for the various arguments, such as the one mentioned above that I have used in workshops and classes.
  • We have not covered all the different arguments, and I recommend reviewing the help file and in particular the examples at the end of the page.

gt::gt()

The gt::gt package does make the tables look fancier, how do we add labels to those to have them look nice as well?

  • I highly recommend the gt webpage to learn more about all the different options to create pretty tables. Note the tabs at the top of the page for “Get started” and “Reference.”
  • See also section 3 of part 6 on “Side note about gt::gt()” for more on creating pretty tables.

here::here

would also love more examples of here() I am starting to understand it better but still am a little confused

I am still having trouble getting here() to work consistently. I was going to ask during class, but I think I am just not understanding how to manually nest my files correctly so that “here” works. I am struggling to get that set up correct, and thus, struggling to use it.

Clearest points

  • group_by() function (n=3)
  • summarize() (n=2)
  • across() (n=1)
  • case_when() (n=1)
  • drop_na( ) (n=2)
  • Joining tables (n=6)