map(.x, .f, ..., .progress = FALSE)Week 10
Announcements
- Grades for midterm projects will be released in 1-2 days.
- You will be notified by Sakai via your OHSU email address when grades are released.
- Please review the comments and let us know if you have any questions.
- Please fill out the course evaluations for the class.
- Course evaluations are very helpful for making improvements to our classes.
- If too few students fill out the evaluations, they are not released to us.
- You might have two evaluations since there are two instructors.
Reminder to fill out post-class surveys
- This is a reminder that 5% of your grade is based on filling out post-class surveys as a way of telling us that you came to class and engaged with the material for that week.
- You only need to fill out 5 surveys (of the 10 class sessions) for the full 5%. We encourage you to fill out as many surveys as possible to provide feedback on the class though.
- Please fill out surveys by 8 pm on Sunday evenings to guarantee that they will be counted. We usually download them some time on Sunday evening or Monday. If you turn it in before we download the responses, it will get counted.
Topics
Part 8
- Introduce simple statistical modelling
- Learn about the useful
broompackage - More iteration with
purrr - Slitting up data for iteration
- Other useful
purrr
Class materials
Readings are linked below, and we are using the part_08 material on One Drive.
- Week 10 Readings
- OneDrive part_08 Project folders
Post-class survey
- Please fill out the post-class survey to provide feedback. Thank you!
Homework
- See OneDrive folder for homework assignment.
- HW 10 due on 3/20. Assignment is in the part 8 folder.
Recording
- In-class recording links are on Sakai. Navigate to Course Materials -> Schedule with links to in-class recordings. Note that the password to the recordings is at the top of the page.
Muddiest points from Week 10
- See Week 9 page for Week 9 feedback.
Confusion on details of purrr::map()
purrr::map() applies a function to each element of a vector or list and returns a new list where each element is the result of applying that function to the corresponding element of the original vector or list.
.xthe vector or list that you operate on.fthe function you want to apply to each element of the input vector or list. This function can be a built-in R function, a user-defined function, or an anonymous function defined on the fly.
Simple example
library(tidyverse)
# Example list
numbers <- list(1, 2, 3, 4, 5)
numbers[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] 4
[[5]]
[1] 5
# Using map to square each element of the list
squared_numbers <- purrr::map(.x = numbers,
.f = ~ .x ^ 2)In this example: - numbers is a list containing numbers from 1 to 5. - ~ .x ^ 2 is an anonymous function that squares its input. - map() applies this anonymous function to each element of the numbers list, resulting in a new list where each element is the square of the corresponding element in the original list.
After executing this code, the squared_numbers variable will contain the squared values of the original list:
squared_numbers[[1]]
[1] 1
[[2]]
[1] 4
[[3]]
[1] 9
[[4]]
[1] 16
[[5]]
[1] 25
Example with a list of data frames
Suppose we have a list of data frames where each data frame represents the sales data for different products. We want to calculate the total sales for each product across all the data frames in the list.
# Sample list of data frames
sales_data <- list(
product1 = data.frame(month = 1:3, sales = c(100, 150, 200)),
product2 = data.frame(month = 1:3, sales = c(120, 180, 220)),
product3 = data.frame(month = 1:3, sales = c(90, 130, 170))
)
sales_data$product1
month sales
1 1 100
2 2 150
3 3 200
$product2
month sales
1 1 120
2 2 180
3 3 220
$product3
month sales
1 1 90
2 2 130
3 3 170
Create a function `and apply it to each slot insales_data` list:
# Function to calculate total sales for each data frame
calculate_total_sales <- function(df) {
total_sales <- sum(df$sales)
return(total_sales)
}
# Applying the function to each data frame in the list
total_sales_per_product <- purrr::map(.x = sales_data,
.f = calculate_total_sales)In this example: - sales_data is a list containing three data frames, each representing the sales data for a different product. - calculate_total_sales() is a function that takes a data frame as input and calculates the total sales for that product. - map() applies the calculate_total_sales() function to each data frame in the sales_data list, resulting in a new list total_sales_per_product, where each element is the total sales for a specific product across all months.
After executing this code, the total_sales_per_product variable will contain the total sales for each product:
total_sales_per_product$product1
[1] 450
$product2
[1] 520
$product3
[1] 390
So, total_sales_per_product is a named list where each element represents the total sales for a specific product across all the data frames in the original list.
purrr::reduce()
How does it compare to purrr::map()?
The big difference between map() and reduce() has to do with what it returns:
map()usually returns a list or data structure with the same number as its input; The goal ofreduce()is to take a list of items and return a single object.
See the purrr cheatsheet.
Simple example
# Example vector
numbers <- c(1, 2, 3, 4, 5)
numbers[1] 1 2 3 4 5
# Using reduce to calculate cumulative sum
cumulative_sum <- purrr::reduce(.x = numbers,
.f = `+`)In this example: - numbers is the vector we want to operate on. - The function + is used as the operation to perform at each step of reduction, which in this case is addition. - reduce() will start by adding the first two elements (1 and 2), then add the result to the third element (3), and so on, until all elements have been processed.
After executing this code, the cumulative_sum variable will contain the cumulative sum of the numbers:
cumulative_sum[1] 15
The steps are as follows:
(cum_numbers <- numbers[1])[1] 1
(cum_numbers <- cum_numbers + numbers[2])[1] 3
(cum_numbers <- cum_numbers + numbers[3])[1] 6
(cum_numbers <- cum_numbers + numbers[4])[1] 10
(cum_numbers <- cum_numbers + numbers[5])[1] 15
With data frames
Using our sales data list from above
sales_data$product1
month sales
1 1 100
2 2 150
3 3 200
$product2
month sales
1 1 120
2 2 180
3 3 220
$product3
month sales
1 1 90
2 2 130
3 3 170
We can combined the data sets in the list with reduce() and bind_rows()
# Using an anonymous function, note bind_rows takes 2 arguments.
combined_sales_data <- purrr::reduce(.x = sales_data,
.f = function(x, y) bind_rows(x, y))
# Using a named function
combined_sales_data <- purrr::reduce(.x = sales_data,
.f = dplyr::bind_rows)In this example: - We use an anonymous function within reduce() that takes two arguments x and y, representing the accumulated result and the next element in the list, respectively. - Inside the anonymous function, we use bind_rows() to combine the accumulated result x with the next element y, effectively stacking them on top of each other. - reduce() applies this anonymous function iteratively to the list of data frames, resulting in a single data frame combined_sales_data that contains the combined sales data for all products.
combined_sales_data month sales
1 1 100
2 2 150
3 3 200
4 1 120
5 2 180
6 3 220
7 1 90
8 2 130
9 3 170
Doing this in steps:
(cum_sales_data <- dplyr::bind_rows(sales_data[[1]])) month sales
1 1 100
2 2 150
3 3 200
(cum_sales_data <- dplyr::bind_rows(cum_sales_data,
sales_data[[2]])) month sales
1 1 100
2 2 150
3 3 200
4 1 120
5 2 180
6 3 220
(cum_sales_data <- dplyr::bind_rows(cum_sales_data,
sales_data[[3]])) month sales
1 1 100
2 2 150
3 3 200
4 1 120
5 2 180
6 3 220
7 1 90
8 2 130
9 3 170
Examples of reduce
List.files function
the list.files() function is used to obtain a character vector of file names in a specified directory. Here’s a breakdown of how it works and its common parameters:
Directory Path: The primary argument of
list.files()is the path to the directory you want to list files from. If not specified, it defaults to the current working directory.Pattern Matching:
patternis an optional argument that allows you to specify a pattern for file names. Only file names matching this pattern will be returned. This can be useful for filtering specific types of files.Recursive Listing: If
recursive = TRUE, the function will list files recursively, i.e., it will include files from subdirectories as well. By default,recursiveis set toFALSE.File Type: The
full.namesargument controls whether the returned file names should include the full path (ifTRUE) or just the file names (ifFALSE, the default).Character Encoding: You can specify the
encodingargument to handle file names with non-ASCII characters. This argument is especially useful on Windows systems where file names may use a different character encoding.
Here’s a simple example demonstrating the basic usage of list.files():
# List files in the current directory
files <- list.files()
# Print the file names
print(files) [1] "_extensions" "_quarto.yml" "about.qmd"
[4] "BSTA_526_W24.Rproj" "data" "docs"
[7] "function_week" "function_week.qmd" "images"
[10] "index.qmd" "readings" "readings.qmd"
[13] "resources" "schedule.qmd" "styles.css"
[16] "syllabus.qmd" "weeks" "weeks.qmd"
This will print the names of all files in the current working directory.
#| eval: false
# List CSV files in a specific directory
csv_files <- list.files(path = "path/to/directory", pattern = "\\.csv$")
# Print the CSV file names
print(csv_files)character(0)
This will print the names of all CSV files in the specified directory.
Overall, list.files() is a handy function for obtaining file names within a directory, providing flexibility through various parameters for customization according to specific needs, such as filtering by pattern or handling file names with non-standard characters.
NOTE You need to pay attention to your working directory and your relative file paths. See Week 2 or 3 (?) about here package and the discussion about files paths. Best to always use Rprojects and the here package.