Week 8

Lists and Functions
Published

February 28, 2024

Modified

March 13, 2024

Announcements

Reminder to fill out post-class surveys

  • This is a reminder that 5% of your grade is based on filling out post-class surveys as a way of telling us that you came to class and engaged with the material for that week.
  • You only need to fill out 5 surveys (of the 10 class sessions) for the full 5%. We encourage you to fill out as many surveys as possible to provide feedback on the class though.
  • Please fill out surveys by 8 pm on Sunday evenings to guarantee that they will be counted. We usually download them some time on Sunday evening or Monday. If you turn it in before we download the responses, it will get counted.

See syllabus section on Post-class surveys

Topics

Part 7

  • Writing functions
  • lists()
  • For Loops
  • Iteration

Class materials

Post-class survey

Homework

  • See OneDrive folder for homework assignment.
  • HW 8 due on 3/6. Assignment is in the part 7 folder.

Recording

  • In-class recording links are on Sakai. Navigate to Course Materials -> Schedule with links to in-class recordings. Note that the password to the recordings is at the top of the page.

Muddiest points from Week 8

When loading a dataset, what does mean?

This occurs when you use the data() function to load a data set from a package. Per the help on this function (?data):

data() was originally intended to allow users to load datasets from packages for use in their examples, and as such it loaded the datasets into the workspace .GlobalEnv. This avoided having large datasets in memory when not in use: that need has been almost entirely superseded by lazy-loading of datasets.

data("iris")  # this doesn't actually load the data set, but makes it available for use
head(iris)    # Once it's used it will appear in the Environment as an object.
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Challenge # 2

This was where we created a function to load 3 data sets, clean them, and convert them to long format. These are tasks we’ve seen in previous classes. In this challenge, the main takeaway was to see the DRY (Don’t repeat yourself) concept at play. Instead of writing the code 3 times for each data set, we can create a function where we only write the cleaning code once, then use that function 3 times.

Reviewing the challenge solutions and taking more time to work through it on your own is a good idea. We went through it pretty quick in class. You won’t usually be limited on time to get a function like that to work. In practice, if it’s taking more time and too complicated, then it’s fine to duplicate code so that you know it’s working correctly. But, with very repetitive tasks, functions can make your code less prone to errors from copying and pasting.

If you have trouble getting your code to work for the challenges, office hours are great for helping to debug code. Else, sharing full code in an email or on Slack.

purrr::pluck

There was a specific question:

purrr::pluck seems really useful, I wonder if you can tell it to pluck a specific record_ID?

The short answer is no. Not a specific ID. But if you know the position of the specific ID then you could.

# Load packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Create sample data
df <- tibble::tibble(
  id = c("0001", "0002", "0003", "0004", "0005", "0006", "0007", "0008", "0009", "0010"), 
  sex = sample(x = c("M", "F"), size = 10, replace = TRUE), 
  age = sample(x = 18:65, size = 10, replace = TRUE)
  
)

df
# A tibble: 10 × 3
   id    sex     age
   <chr> <chr> <int>
 1 0001  F        39
 2 0002  M        32
 3 0003  F        24
 4 0004  F        29
 5 0005  F        31
 6 0006  M        56
 7 0007  F        25
 8 0008  F        56
 9 0009  F        43
10 0010  F        52
# Say we want to extract ID 0003.

# With purrr::pluck we need to know that it's in the 3rd row of the ID column

purrr::pluck(df, 
             "id", 
             3)
[1] "0003"
# Gives an error
purrr::pluck(df, 
             "id", 
             "0003")
NULL
# More than likely in this scenario, you would use a filter:
df |> 
  dplyr::filter(id == "0003")
# A tibble: 1 × 3
  id    sex     age
  <chr> <chr> <int>
1 0003  F        24

purrr::pluck was created to work with deeply nested data structures. Not necessarily data frames; there’s probably a more appropriate function out there for the task.

Lists – general confusion

  • What do we do with lists?
  • Using lists

We will get to work more with lists in Week 9 and get more opportunities to see how they are used.

Lists are more flexible and have the ability to handle various data structures which make them a powerful tool for organizing, manipulating, and representing complex data in R.

Lists – One bracket versus two brackets.

One bracket [ ] and two brackets [[ ]] serves different purposes, primarily when accessing elements in a data structure like vectors, lists, or data frames.

One Bracket [ ]:

Vectors:

When used with a single bracket, you can use it to subset or extract elements from a vector.

# Example with a vector
my_vector <- c(1, 2, 3, 4, 5)
my_vector[3]  # Extracts the element at index 3
[1] 3
Data Frames:

When used with a data frame, it can be used to extract columns or rows.

# Example with a data frame

df <- tibble::tibble(
  name = c("Alice", "Bob", "Charlie"), 
  age = c(25, 30, 22)
  )

# Extract the age column
df["age"]
# A tibble: 3 × 1
    age
  <dbl>
1    25
2    30
3    22

Two Brackets [[ ]]:

Lists:

When working with lists, double brackets are used to extract elements from the list. The result is the actual element, not a list containing the element.

# Example with a list
my_list <- list(1, 
                c(2, 3), 
                "four")

my_list[[2]]  # Extracts the second element (a vector) from the list
[1] 2 3

Compare to using []

my_list[2]
[[1]]
[1] 2 3

[[]] returned the vector contained in that slot. [] returned a list containing the vector.

Nested Data Structures:

For accessing elements in nested data structures like lists within lists.

# Example with a nested list
nested_list <- list(first = list(a = 1, b = 2), 
                    second = list(c = 3, d = 4))

nested_list
$first
$first$a
[1] 1

$first$b
[1] 2


$second
$second$c
[1] 3

$second$d
[1] 4
nested_list[[1]] # Extract the list contained in the first slot
$a
[1] 1

$b
[1] 2
nested_list[[1]][["b"]]  # Extracts the value associated with "b" in the first list
[1] 2

In summary, one bracket [ ] is used for general subsetting, whether it’s extracting elements from vectors, columns from data frames, or specific elements from lists. On the other hand, two brackets [[ ]] are specifically used for extracting elements from lists and accessing elements in nested structures.

How and when to use curly curly within a function

{{ }} will be covered in upcoming class lectures. We talked about it in Week 8 as a quick aside because a specific question came up. Not much detail was given intentionally as it is a separate topic for another day.