Today we will practice wrangling messy data to prepare them for analysis.
Before you get started:
Remember to save this notebook under a new name, such as part_06_b526_YOURNAME.qmd.
Load the packages in the setup code chunk
1.1 Learning Objectives
Practice thinking through and planning ahead data wrangling steps to prepare a dataset ready for analysis
Practice working with real data
Practice cleaning variables
Practice merging datasets with bind_rows and joining
Practice making data long by pivoting
Practice making complex visualizations with ggplot
2 “Real data”
Today’s goal: wrangle messy data into something we can work with.
2.1 Mouse data
Today we are going to be working with a subset of real data that Dr. Minnier has analyzed.
The actual dataset had hundreds of columns (dozens of outcome variables, a few biomarker values, hundreds of miRNA expression data, and lipidomics on a number of lipids).
The data we are using are a subset of the actual data with some random noise added to the values to protect the data privacy of the lab.
time point (1 month, 6 month, 12 month = age of mice)
two miRNA expression values (micro ribonucleic acid, or microRNA)
biomarker values from three different brain tissues
outcomes of interest
learning outcome (higher values are better)
preference for object 1 and object 2 (from behavioral and cognitive tests on mice) which is measured in % time spent with that object.
The format of the Excel sheet is pretty similar to what was originally given, other than removing some columns.
One big change is that in the original data, each time point was a different cohort of mice, but we are pretending that it is the same cohort of mice and we have longitudinal data (actually impossible due to the way the biomarkers are measured, but let’s just ignore that). This will allow you to practice more complex joins.
2.2 Challenge 1 - group work!
The data are in mouse_biomarker.xlsx (4 different sheets (tabs) with data), which is in the data folder. Open the data and look at the different sheets.
Talk about what challenges to importing the data you anticipate, just by looking at them in Excel. You are not yet loading the data in this step.
Look at the plot below. Based on the data in the Excel file, discuss what kind of data wrangling steps you would need to take to make this plot. These can be somewhat vague ideas about what kind of columns you’d need in your wrangled dataset, not necessarily specific steps. It might help to write out pseudo ggplot code, with elements (i.e. x axis, y axis, fill, facet) mapped to column names. You still haven’t loaded the data in this step!
Try to read in the four sheets into R, saving them as data frames called mouse_demo, mouse_tp1, mouse_tp2, and mouse_tp3, respectively. Use glimpse() to look at how the data were read in and determine whether you were successful in reading in the data how you intended to.
How would you combine all these data into one dataset? Write out the steps, including any data cleaning/wrangling steps you need to do.
Use .name_repair argument to clean the names somewhat and standardize the format
Include “/” as a missing value
Remove empty rows and columns (janitor::remove_empty)
Check column types and names
# .name_repair = make_clean_names is the same as applying clean_names() after reading in the data# Note this is the same code for each, just the sheet number is changing# sheet 1mouse_demo <-read_excel(here::here("part6", "data","mouse_biomarker.xlsx"), sheet =1,.name_repair = janitor::make_clean_names) %>%remove_empty(which =c("rows","cols"))# sheet 2mouse_tp1 <-read_excel(here::here("part6", "data","mouse_biomarker.xlsx"), sheet =2, na =c("","/"),n_max =34,.name_repair = janitor::make_clean_names) %>%remove_empty(which =c("rows","cols"))# sheet 3mouse_tp2 <-read_excel(here::here("part6", "data","mouse_biomarker.xlsx"), sheet =3, na =c("","/"),.name_repair = janitor::make_clean_names) %>%remove_empty(which =c("rows","cols"))# sheet 4mouse_tp3 <-read_excel(here::here("part6", "data","mouse_biomarker.xlsx"), sheet =4, na =c("","/"),.name_repair = janitor::make_clean_names) %>%remove_empty(which =c("rows","cols"))glimpse(mouse_demo)
So for our data, we use the time point to distinguish where each dataset came from:
# use the name of the dataset to create an id variable timemouse_tp <-bind_rows("tp1"= mouse_tp1, "tp2"= mouse_tp2,"tp3"= mouse_tp3,.id ="time")mouse_tp %>%glimpse()
# repalce _ng with _pg in the column names# note I tried replacing just "ng" with "pg", but then learning became learnipg!mouse_tp2 <- mouse_tp2 %>%rename_with(.fn =~str_replace(.x, "_ng", "_pg"))colnames(mouse_tp2)
mouse_tp <- mouse_tp %>%mutate(# set Obj 1 and 2 to NA, preference_obj1 =na_if(preference_obj1, "Obj 1"),preference_obj2 =na_if(preference_obj2, "Obj 2"),# then convert to numericpreference_obj1 =as.numeric(preference_obj1),preference_obj2 =as.numeric(preference_obj2) )glimpse(mouse_tp)
Those rows that used to contain “Obj 1” and “Obj 2” are now almost empty:
# when sid is NA, the rest of the data is NA# look at those rows:mouse_tp %>%filter(is.na(sid))
# A tibble: 3 × 14
time sid normalized_bdnf_amygdala_pg_mg normalized_bdnf_cortex_pg_mg
<chr> <dbl> <dbl> <dbl>
1 tp1 NA NA NA
2 tp2 NA NA NA
3 tp3 NA NA NA
# ℹ 10 more variables: normalized_bdnf_hypothalamus_pg_mg <dbl>,
# normalized_cd68_amygdala_pg_mg <dbl>, normalized_cd68_cortex_pg_mg <dbl>,
# normalized_cd68_hypothalamus_pg_mg <dbl>,
# normalized_map2_cortex_pg_mg <dbl>, mirna1 <dbl>, mirna2 <dbl>,
# learning_outcome <dbl>, preference_obj1 <dbl>, preference_obj2 <dbl>
4.6drop_na()
We want to remove the three rows that have missing data in almost all columns,
which is the same as removing rows with any missing data in sid (rows where sid = NA).
See Part5 notes and the drop_na() reference for examples.
# number of rows in complete datamouse_tp %>% nrow
[1] 99
# use tidyr::drop_na() to remove those# Note: remove_empty() from the `janitor` package won't remove them # because `time` is not empty# this removes rows were there is *any* missing data in *any* column# complete cases only, we do not want thismouse_tp %>%drop_na() %>% nrow
[1] 25
# this removes rows where there is missing data in sidmouse_tp %>%drop_na(sid) %>% nrow
[1] 96
# save our work, remove just missing sid valuesmouse_tp <- mouse_tp %>%drop_na(sid)
There are of course many ways to do this, but one way is:
Create a new column that is the sum of the two preference columns
When that sum is 0, set the two preference columns to NA
# First create the columnmouse_data <- mouse_data %>%mutate(total_pref = preference_obj1 + preference_obj2 ) # look at the rows where the sum is 0mouse_data %>%filter(total_pref==0) %>%select(sid, time, contains("pref"))
Below we use case_when() to mutate both columns, set to NA if total_pref == 0 and otherwise (.default) use the value of that column
We show two ways to do this, either:
one column at a time
or, across columns that start with “pref”
Mutate columns individually
# we can mutate one column at a time:mouse_data %>%mutate(preference_obj1 =case_when( total_pref ==0~NA,.default = preference_obj1),preference_obj2 =case_when( total_pref ==0~NA,.default = preference_obj2),) %>%# show a subset to see that it's workingfilter(sid%in%c(180, 181)) %>%select(sid, time, contains("pref"))
# A tibble: 6 × 5
sid time preference_obj1 preference_obj2 total_pref
<dbl> <chr> <dbl> <dbl> <dbl>
1 180 tp1 NA NA 0
2 180 tp2 59.0 41.0 100
3 180 tp3 63.2 36.8 100
4 181 tp1 NA NA 0
5 181 tp2 52.5 47.5 100
6 181 tp3 75.6 24.4 100
Mutate both columns simultaneously (& save our changes)
# or both columns using acrossmouse_data <- mouse_data %>%mutate(across(.cols =starts_with("pref"),.fns =~case_when( total_pref ==0~NA,.default = .) )) mouse_data %>%# show a subset to see that it's workingfilter(sid %in%c(180, 181)) %>%select(sid, time, contains("pref"))
# A tibble: 6 × 5
sid time preference_obj1 preference_obj2 total_pref
<dbl> <chr> <dbl> <dbl> <dbl>
1 180 tp1 NA NA 0
2 180 tp2 59.0 41.0 100
3 180 tp3 63.2 36.8 100
4 181 tp1 NA NA 0
5 181 tp2 52.5 47.5 100
6 181 tp3 75.6 24.4 100
Therefore, we need separate columns that contain this information, such as:
biomarker_type = bdnf, cd68, map2, etc
tissue_type = amygdala, cortex, hypothalamus, etc
biomarker_value = the biomarker numeric value for each of these
Each mouse (sid) will have multiple rows,
because they have multiple tissue samples and multiple biomarkers measured.
This means we need “long” data, where each observation is in a separate row.
7.2 Make data long
First we will make the data long, and deal with the biomarker_type/tissue_type separation second:
# make a long dataset where we have multiple biomarker data in the same columnmouse_biomarker_long <- mouse_data %>%pivot_longer(cols =starts_with("normalized"),names_to ="biomarker_type_temp",values_to ="biomarker_value")glimpse(mouse_biomarker_long) # helps to View it too
We’re going to pick up in Part 7 where we left off here.
In order to do that seamlessly and make sure we have all the objects in our workspace (environment) to access, we can save all the objects in our workspace in one .Rdata file
I recommend adding the code below to the end of your files so that in the future you have a record of what versions of packages were used in your work.