Week 9

Lists and Functions (continued)

Published

March 6, 2024

Modified

March 20, 2024

Announcements

Reminder to fill out post-class surveys

This is a reminder that 5% of your grade is based on filling out post-class surveys as a way of telling us that you came to class and engaged with the material for that week.
You only need to fill out 5 surveys (of the 10 class sessions) for the full 5%. We encourage you to fill out as many surveys as possible to provide feedback on the class though.
Please fill out surveys by 8 pm on Sunday evenings to guarantee that they will be counted. We usually download them some time on Sunday evening or Monday. If you turn it in before we download the responses, it will get counted.

See syllabus section on Post-class surveys

Topics

Part 7

Writing functions
lists()
For Loops
Iteration

Class materials

Readings are still the same as for Week 8, and we are using the part_07 material on One Drive.

Week 8 Readings
OneDrive part_07 Project folders

Post-class survey

Please fill out the post-class survey to provide feedback. Thank you!

Homework

See OneDrive folder for homework assignment.
HW 9 due on 3/13. Assignment is in the part 7 folder.

Recording

In-class recording links are on Sakai. Navigate to Course Materials -> Schedule with links to in-class recordings. Note that the password to the recordings is at the top of the page.

Muddiest points from Week 9

See Week 8 page for Week 8 feedback.

Matrices

Not entirely sure how to read or make sense of matrices yet (maybe I should have payed more attention in algebra), like when we saw the structure of a matrix here in the class script: str(output_model$coefficients)

In R, matrices are two-dimensional data structures that can store elements of the same data type. They are similar to vectors but have two dimensions (rows and columns). They are widely used in various statistical and mathematical operations, making them a fundamental data structure in the R.

Basic way to create matrices

# Create a matrix with values filled column-wise
(mat1 <- matrix(1:6, nrow = 2, ncol = 3, byrow = FALSE))

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

# Create a matrix with values filled row-wise
(mat2 <- matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE))

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

Accessing elements of a matrix

# Accessing individual elements
element <- mat1[1, 2]  # Row 1, Column 2
element

[1] 3

# Accessing entire row or column
row_vector <- mat1[1, ]  # Entire first row
row_vector

[1] 1 3 5

col_vector <- mat1[, 2]  # Entire second column
col_vector

[1] 3 4

Convert to data.frame

as.data.frame(mat1)

  V1 V2 V3
1  1  3  5
2  2  4  6

library(tibble)
tibble::as_tibble(mat1)

Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.

# A tibble: 2 × 3
     V1    V2    V3
  <int> <int> <int>
1     1     3     5
2     2     4     6

# You can also name the columns (and the rows)

colnames(mat1) <- c("a", "b", "c")
mat1

     a b c
[1,] 1 3 5
[2,] 2 4 6

tibble::as_tibble(mat1)

# A tibble: 2 × 3
      a     b     c
  <int> <int> <int>
1     1     3     5
2     2     4     6

`for()` loops

Still a little confused about the for() loops…

For loops are a staple in programming languages, not just R. They are used when we want to repeat the same operation (or a set of operations) several times.

The basis syntax in R looks like:

for (variable in sequence) {
  # Statements to be executed for each iteration
}

Here’s a breakdown of the components:

variable: This is a loop variable that takes on each value in the specified sequence during each iteration of the loop.
sequence: This is the sequence of values over which the loop iterates. It can be a vector, list, or any other iterable object.
Loop Body: The statements enclosed within the curly braces {} constitute the body of the loop. These statements are executed for each iteration of the loop.

Basic example

for (i in 1:5) {
  print(i)
}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

First iteration manually:

i <- 1
print(i)

[1] 1

Second iteration manually:

i <- 2
print(i)

[1] 2

Etc.

Adapted from 1st edition of R for Data Science

Here’s a tibble for an example

pacman::p_load(tidyverse)

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

df

# A tibble: 10 × 4
        a       b      c      d
    <dbl>   <dbl>  <dbl>  <dbl>
 1 -0.462 -1.51    0.528 -0.802
 2 -0.278  0.225   0.523 -0.625
 3  0.432  1.22    1.07   1.44 
 4 -1.67   1.20   -0.203  1.63 
 5 -0.478 -0.814   0.336 -1.87 
 6 -0.466  0.529   1.74   0.984
 7 -0.934 -0.0985  0.668  1.08 
 8 -0.600 -1.98    0.674 -0.469
 9 -0.820 -0.316  -1.38   1.59 
10 -1.71   1.25   -1.26   1.33

Using a copy and paste method to calculate the mean of each column would look something like this:

median(df$a)

[1] -0.5388446

median(df$b)

[1] 0.06331579

median(df$c)

[1] 0.5253658

median(df$d)

[1] 1.030272

But this breaks the rule of DRY (“Don’t repeat yourself”)

output <- c()  # vector to store the results of the for loop

for (i in seq_along(df)) {
  
  output[i] <- median(df[[i]])
  
}

output

[1] -0.53884460  0.06331579  0.52536580  1.03027185

For loops in R are commonly used when you know the number of iterations in advance or when you need to iterate over a specific sequence of values. While for loops are useful, R also provides other ways to perform iteration, such as using vectorized operations (example below) and functions from the apply family (not covered). It’s often recommended to explore these alternatives when working with R for better code efficiency and readability.

# Creating two vectors
vector1 <- c(1, 2, 3, 4, 5)
vector2 <- c(6, 7, 8, 9, 10)

# Vectorized addition
result_addition <- vector1 + vector2
result_addition

[1]  7  9 11 13 15

# With a for loop
result_addition_for_loop <- c()

for (i in 1:length(vector1)) {
  
  result_addition_for_loop[i] <- vector1[i] + vector2[i]
  
}

result_addition_for_loop

[1]  7  9 11 13 15

`na.rm` vs `na.omit`

Is there a difference between na.rm and na.omit?

Yes, there is a difference. In R, they are used in different context.

na.rm (Remove)

na.rm is an argument found in various functions (e.g. mean(), sum(), etc.) that allows you to specify whether missing values (NA or NaN) should be removed before performing the calculation.

From the help for mean() (?mean): a logical evaluating to TRUE or FALSE indicating whether NA values should be stripped before the computation proceeds.

# A vector with NA values
values_with_na <- c(1, 2, 3, NA, 5)

mean(values_with_na, na.rm = FALSE)  # Result will be NA

[1] NA

# Excluding NA values
mean(values_with_na, na.rm = TRUE)  # Result will be (1+2+3+5)/4 = 2.75

[1] 2.75

na.omit (Omit missing)

na.omit is a function that can be used to remove rows with missing values (NA) from a data frame or matrix.

# Creating a data frame with NA values
df <- data.frame(A = c(1, 2, NA, 4), B = c(5, NA, 7, 8))

# NAs in the columns of the data frame
df

# Using na.omit to remove rows with NA values
df |> 
  na.omit()

  A B
1 1 5
4 4 8

`purrr::map()`

I am still a little foggy on the formatting of purrrmap and how to utilize it effectively.

The purrr::map function is used to apply a specified function to each element of a list or vector, returning the results in a new list.

Basic Syntax:

purrr::map(.x, .f, ...)

.x: The input list or vector.
.f: The function to apply to each element of .x.
...: Additional arguments passed to the function specified in .f.

Key Features:

Consistent Output:
- map returns a list, ensuring a consistent output format regardless of the input structure.
Function Application:
- The primary purpose is to apply a specified function to each element of the input .x.
Formula Interface:
- Supports a formula interface (~) for concise function specifications.
purrr::map(.x, ~ function(.))

Example:

# Sample list
my_list <- list(a = 1:3, 
                b = c(4, 5, 6), 
                c = rnorm(n = 3))

my_list

$a
[1] 1 2 3

$b
[1] 4 5 6

$c
[1]  0.246615980 -0.005590403 -0.151755392

# Using map to square each element in the list
squared_list <- purrr::map(.x = my_list, 
                           .f = ~ .x ^ 2)

squared_list

$a
[1] 1 4 9

$b
[1] 16 25 36

$c
[1] 0.0608194414 0.0000312526 0.0230296988

In this example, the map function applies the squaring function (~ .x ^ 2) to each element of the input list my_list. The resulting squared_list is a list where each element is the squared version of the corresponding element in my_list.

The purrr::map function is particularly useful when working with lists and helps to create cleaner and more readable code, especially in cases where you want to apply the same operation to each element of a collection.

General references

Is there a good dictionary type document with “R language” or very basic function descriptions? … find it difficult to know what functions I need because it is hard to recall their name or confuse it with a different function.

R Documentation (Built-in Help): R itself provides built-in documentation that you can access using the help() function or the ? operator. For example, to get help on the mean() function, you can type help(mean) or ?mean in the R console.
R Manuals and Guides: The official R documentation, including manuals and guides, is available on the R Project website: R Manuals.
R Packages Documentation: Many R packages come with detailed documentation. You can find documentation for a specific package by visiting the CRAN website (Comprehensive R Archive Network) and searching for the package of interest.
Online Resources: Websites like RDocumentation provide a searchable database of R functions along with their documentation. You can search for a specific function and find details on its usage and parameters.
RStudio cheatsheets
Base R cheatsheet
R: A Language and Environment for Statistical Computing: Reference Index
CRAN Task Views
Part 3 section on getting help with errors.
Books like “R for Data Science” by Hadley Wickham
When you use a function or learn to use it, make notes to yourself using Google Doc or OneNote or something similar.