Week 5

Merging multiple datasets; cleaning & reshaping data
Published

February 5, 2026

Modified

February 25, 2026

Topics

Part 5:

  • Learn and apply loading comma separated and tab separated datasets using the readr package
  • Learn different techniques for cleaning data using functions from the tidyr, forcats, and stringrpackages
    • Practice cleaning data with a real dataset
  • Learn and apply bind_rows() to combine rows from two or more datasets
  • Learn about ways to merge columns from different datasets
    • Apply inner_join() and left_join() to merge columns from different datasets
  • Learn about wide vs long data and how to reshape data
    • apply pivot_longer() to make a wide dataset long

Announcements

  • Functions of the Week
    • Signup sheet for your functions of the week presentations. The file is in OneDrive in the functions_of_the_week folder.
    • Presentations will be during weeks 5-10. Please no more than 4 presentations per week.
    • If there is a function you are interested in presenting that is not on the signup sheet, please check with me. If it hasn’t been covered before and isn’t covered in the class, then most likely I will approve it.
  • The Midterm is posted on OneDrive. It is due Sunday 2/22/26.
    • Please start early on this since finding a suitable dataset might take some time.
    • I encourage you to meet with me to discuss your research question and data, to make sure you are on the right track.
  • Cascadia R Conf in June 26-27 this year. It will be held at OHSU in RLSB. This is a great conference to meet other R enthusiasts in the area and learn more about what they are working on.

Class materials

  • Class materials in OneDrive folder BSTA_526_W26_class_materials_public.
  • For today’s class, make sure to download to your computer the folder called part5.
  • Open RStudio by double-clicking on the project file called BSTA_526_W26_class_materials_public.Rproj in the main OneDrive folder.
Part OneDrive folder Slides Webpage
5

Readings

R4DS = R for Data Science (2e)

Required

  • R4DS book:
    • Modifying factor levels: Section 16.5
      • In particular fct_collapse(). In part 6 we will cover fct_recode()
    • Separating into columns (separate_wider_delim()): Section 14.4.2
    • Making numbers (parse_number()): Section 13.2
    • Joins: Chapter 19
    • Lengthening and widening data (pivot_longer() and pivot_wider()): Sections 5.3 and 5.4

Optional

  • Feel like the cat that got the cream with {forcats} - a great read on the forcats package
    • Some of this is a review.New functions in part 5:
    • There are many other great forcats functions covered here, some of which will be presented in part 6
  • Pivoting vignette
    • This is a great supplement to the R4DS chapter 5 sections linked to in the required readings above. It has some advanced examples that we will not be covering in class but we frequently encounter in practice.
  • Regular expressions: R4DS book Section 15
    • This goes into much more detail than we will be covering in BSTA 526. However, it’s a great resource for learning more about regular expressions and using them in R if you are interested.
    • In part 5, we will be covering just str_remove_all()
    • For now, I recommend at least a quick skim of this section so that you are aware of what we mean by “regular expressions” and how they can be used in data cleaning. Figuring out the details on how to use the more advanced examples can be postponed until you have need for them.
  • Are you ready to learn more about Quarto?

Post-class survey

  • Please fill out the post-class survey to provide feedback. Thank you!
  • Previous muddiest points and clearest points with responses are collected here.

Homework

  • See OneDrive folder for homework assignment.
  • HW 5 due on 02/05.

Recording

  • In-class recording links are on Sakai. Navigate to Course Materials -> Schedule with links to in-class recordings.