Week 6

Start with your goal: more data wrangling

Published

February 14, 2024

Modified

February 28, 2024

Announcements

HW 5: see the updated HW 5 assignment on OneDrive called hw_05_b526_v2.qmd
Midterm: due date extended to 2/25/24.
- See the updated midterm file on OneDrive with new yaml, due date, and links to previous midterm projects.

Reminder to fill out post-class surveys

This is a reminder that 5% of your grade is based on filling out post-class surveys as a way of telling us that you came to class and engaged with the material for that week.
You only need to fill out 5 surveys (of the 10 class sessions) for the full 5%. We encourage you to fill out as many surveys as possible to provide feedback on the class though.
Please fill out surveys by 8 pm on Sunday evenings to guarantee that they will be counted. We usually download them some time on Sunday evening or Monday. If you turn it in before we download the responses, it will get counted.

See syllabus section on Post-class surveys

Topics

Finish week 5

We will first finish the material not covered from Week 5, starting with section 4.6.
- Note: I created a new version of the code file (inside the code folder) called part_05_b526_v2.qmd/.html.
- I decided not to move the Week 5 material not covered to the Week 6 notes.

From Week 5:

Learn how to summarize() data with group_by() to summarize within categories
Learn about the different kinds of joins and how they merge data
- Apply inner_join() and left_join() to join tables on columns
Utilize pivot_longer() to make a wide dataset long

New Week 6 topics - did not get to Part 6

Practice working with real data
Practice joining and pivoting
Practice ggplot and learn more geometries
Learn how to deal with missing data

Class materials

Week 5 Readings
Week 6 Readings
One Drive part_05 & part_06 Project folders

Post-class survey

Please fill out the post-class survey to provide feedback. Thank you!

Homework

See OneDrive folder for homework assignment.
HW 6 due on 2/21.

Recording

In-class recording links are on Sakai. Navigate to Course Materials -> Schedule with links to in-class recordings. Note that the password to the recordings is at the top of the page.

Muddiest points from Week 6

See Week 5 page for Week 5 feedback.
Note: During Week 6 we finished covering the part 5 material. Part 6 material will be covered during week 7.

`across()`

what exactly the across function does

.fns, i.e. .fns=list, etc… I wasn’t really sure what that was achieving within across.

The across() function lets us apply a function to many columns at once.
For example, let’s say we want the mean value for every continuous variable in a dataset.
- The code below calculates the mean for one variable in the penguins dataset using both base R and summarize().
- One option to calculate the mean value for every continuous variable in the dataset is to repeat this code for the 4 other continuous variables.

library(tidyverse)
library(janitor)
library(palmerpenguins)
library(gt)

summary(penguins)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2

  # base R
mean(penguins$bill_length_mm, na.rm = TRUE)

[1] 43.92193

# with summarize
penguins %>% 
  summarize(mean(bill_length_mm, na.rm = TRUE))

# A tibble: 1 × 1
  `mean(bill_length_mm, na.rm = TRUE)`
                                 <dbl>
1                                 43.9

In this case across() lets us apply the mean function to all the columns of interest at once:

penguins %>%
  summarize(across(.cols = where(is.numeric), 
                   .fns = ~ mean(.x, na.rm = TRUE)
                   )) %>% 
  gt()

bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	year
43.92193	17.15117	200.9152	4201.754	2008.029

The .fns=list part of the across code is where we specify the function(s) that we want to apply to the specified columns.
- Above we only specified one function (mean()), but we can specify additional functions as well, which is when we need to create a list to list all the functions we want to apply.
- Below I apply the mean and standard deviation functions:

penguins %>%
  summarize(across(.cols = where(is.numeric), 
                   .fns = list(
                     mean = ~ mean(.x, na.rm = TRUE),
                     sd = ~ sd(.x, na.rm = TRUE)
                     ))) %>% 
  gt()

bill_length_mm_mean	bill_length_mm_sd	bill_depth_mm_mean	bill_depth_mm_sd	flipper_length_mm_mean	flipper_length_mm_sd	body_mass_g_mean	body_mass_g_sd	year_mean	year_sd
43.92193	5.459584	17.15117	1.974793	200.9152	14.06171	4201.754	801.9545	2008.029	0.8183559

In general, lists are another type of R object to store information, whether data, lists of functions, output from regression models, etc. While concatenate is just a vector of values, lists are multidimensional. We will be learning more about lists in parts 7 and 8.
You can learn more about across() at its help file.

`case_when()` vs `ifelse()`

still a little confused on the difference between ifelse and casewhen, understand they are very similar but still confused on when it is best to use one over another

The two functions can be used interchangeably. * ifelse() is the original function from base R
- case_when() is the user-friendly version of ifelse() from the dplyr package
I recommend using case_when(), and it is what I use almost exclusively in my own work. My guess is that ifelse() was included in the notes since you might run into the function when reading R code on the internet.
Just be careful that you preserve missing values when using case_when() as we discussed last time.

factor levels

working with factor levels doesn’t feel totally intuitive yet. I think that’s because I tend to get confused with anything involving a concatenated list.

Working with factor variables takes a while to get used to, and in particular with their factor levels.
We will be looking at more examples with factor variables in the part 6 notes. See sections 2.8 and 4.
You can think of a concatenated list(c(...)) as a vector of values or a column of a dataset. Concatenating lets us create a set of values, which we typically create to use for some other purpose, such as specifying the levels of a factor variable.
Please submit a follow-up question in the post-class survey if this is still muddy after today’s class!

pivoting tables

Definitely a tricky topic, and over half of the muddiest points were about pivoting tables.
We will be looking at more examples in part 6.

How pivot_longer() would work on very large datasets with many rows/columns

It works the same way. However the resulting long table will end up being much much longer.
Extra columns in the dataset just hang out and their values get repeated (such as an age variable that is not being made long by) over and over again.
- We will be pivoting a dataset in part 6 that has extra variables that are not being pivoted.

Trying to visualize the joins and pivot longer/wider

I recommend trying them out with small datasets where you can actually see what is happening.
Joins: Our BERD workshop slides have another example that might visualize joins.
- Slide 18 shows to datasets x and y, and what the resulting joins look like.
- Slide 19 shows Venn diagrams of how the different joins behave.
Pivoting: There’s an example with a very small dataset in my (supplemental) notes from BSTA 511. The graphic that goes along with this is on Slide 28 from the pdf.

pivot_longer makes plotting more understandable in an analysis sense, which situations would call for pivot_wider?

I tend to use pivot_longer() much more frequently. However, there are times when pivot_wider() comes in handy. For example, below is a long table of summary statistics created with group_by() and summarize(). I would use pivot_wider() to reshape this table so that I have columns comparing species or columns comparing islands.

penguins %>% 
  group_by(species, island) %>%
  summarize(across(.cols = bill_length_mm, 
                   .fns = list(
                     mean = ~ mean(.x, na.rm = TRUE),
                     sd = ~ sd(.x, na.rm = TRUE)
                     )))

`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

# A tibble: 5 × 4
# Groups:   species [3]
  species   island    bill_length_mm_mean bill_length_mm_sd
  <fct>     <fct>                   <dbl>             <dbl>
1 Adelie    Biscoe                   39.0              2.48
2 Adelie    Dream                    38.5              2.47
3 Adelie    Torgersen                39.0              3.03
4 Chinstrap Dream                    48.8              3.34
5 Gentoo    Biscoe                   47.5              3.08

How to use arguments of pivot longer.

The arguments of the pivot functions take some practice to get used to. I sometimes still pull up an example to remind me what I need to specify for the various arguments, such as the one mentioned above that I have used in workshops and classes.
We have not covered all the different arguments, and I recommend reviewing the help file and in particular the examples at the end of the page.

`gt::gt()`

The gt::gt package does make the tables look fancier, how do we add labels to those to have them look nice as well?

I highly recommend the gt webpage to learn more about all the different options to create pretty tables. Note the tabs at the top of the page for “Get started” and “Reference.”
See also section 3 of part 6 on “Side note about gt::gt()” for more on creating pretty tables.

`here::here`

would also love more examples of here() I am starting to understand it better but still am a little confused

I am still having trouble getting here() to work consistently. I was going to ask during class, but I think I am just not understanding how to manually nest my files correctly so that “here” works. I am struggling to get that set up correct, and thus, struggling to use it.

We’ll have some more examples in class, but I recommend reaching out to one of us (instructors or TA) to help you troubleshoot here::here.
Here are also some resources that might help

Clearest points

group_by() function (n=3)
summarize() (n=2)
across() (n=1)
case_when() (n=1)
drop_na( ) (n=2)
Joining tables (n=6)