#mutating tumor_stage into a factor
smoke_complete2 <- smoke_complete %>%
mutate(stage_factor =
factor(tumor_stage,
levels = c("not reported",
"stage i",
"stage ia",
"stage ib",
"stage ii",
"stage iia",
"stage iib",
"stage iii",
"stage iiia",
"stage iiib",
"stage iv")))forcats::fct_collapse
Function of the Week
1 fct_collapse
In this document, I will introduce the fct_collapse function that allows you to collapse factor levels into manually defined groups.
1.1 What is it for?
The fct_collapse function accepts a factor or character vector and multiple named character levels. ‘fct_collapse’ consolidates several factor levels into fewer groups. Let’s look at tumor_stage:
smoke_complete2 %>%
select(stage_factor) %>%
tbl_summary()| Characteristic | N = 1,1521 |
|---|---|
| stage_factor | |
| not reported | 99 (8.6%) |
| stage i | 7 (0.6%) |
| stage ia | 146 (13%) |
| stage ib | 266 (23%) |
| stage ii | 65 (5.6%) |
| stage iia | 112 (9.7%) |
| stage iib | 148 (13%) |
| stage iii | 86 (7.5%) |
| stage iiia | 102 (8.9%) |
| stage iiib | 30 (2.6%) |
| stage iv | 91 (7.9%) |
| 1 n (%) | |
That’s a lot of levels right? Here are visual examples of the tumor_stage variable before any manipulation:
#boxplot
ggplot(smoke_complete2) +
aes(x = tumor_stage, y = cigarettes_per_day) +
geom_boxplot() +
labs(x="Tumor Stage", y="Cigarettes Per Day") +
ylim(c(0,15))Warning: Removed 1 row containing non-finite outside the scale range
(`stat_boxplot()`).
#bar graph
ggplot(smoke_complete2) +
aes(x = tumor_stage, y = cigarettes_per_day) +
geom_col() +
labs(x="Tumor Stage", y="Cigarettes Per Day")Having a large amount of categories can be overwhelming in some graphics. There are distinguishable categories that these levels could be reduced to using ‘fct_collapse’.
smoke_complete3 <- smoke_complete2 %>%
mutate(stage_collapsed = fct_collapse(
stage_factor,
not_reported = c("not reported"),
stage_i = c("stage i", "stage ia", "stage ib"),
stage_ii = c("stage ii", "stage iia", "stage iib"),
stage_iii = c("stage iii", "stage iiia", "stage iiib"),
stage_iv = c("stage iv"))) %>%
glimpse()Rows: 1,152
Columns: 22
$ primary_diagnosis <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ tumor_stage <chr> "stage ia", "stage ib", "stage ib", "stage…
$ age_at_diagnosis <dbl> 24477, 26615, 28171, 27154, 23370, 19025, …
$ vital_status <chr> "dead", "dead", "dead", "alive", "alive", …
$ morphology <chr> "8070/3", "8070/3", "8070/3", "8083/3", "8…
$ days_to_death <chr> "371", "136", "2304", "NA", "NA", "345", "…
$ state <chr> "live", "live", "live", "live", "live", "l…
$ tissue_or_organ_of_origin <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_birth <dbl> -24477, -26615, -28171, -27154, -23370, -1…
$ site_of_resection_or_biopsy <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_last_follow_up <chr> "NA", "NA", "2099", "3747", "3576", "NA", …
$ cigarettes_per_day <dbl> 10.9589041, 2.1917808, 1.6438356, 1.095890…
$ years_smoked <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", …
$ gender <chr> "male", "male", "female", "male", "female"…
$ year_of_birth <chr> "1936", "1931", "1927", "1930", "1942", "1…
$ race <chr> "white", "asian", "white", "white", "not r…
$ ethnicity <chr> "not hispanic or latino", "not hispanic or…
$ year_of_death <chr> "2004", "2003", "NA", "NA", "NA", "2005", …
$ bcr_patient_barcode <chr> "TCGA-18-3406", "TCGA-18-3407", "TCGA-18-3…
$ disease <chr> "LUSC", "LUSC", "LUSC", "LUSC", "LUSC", "L…
$ stage_factor <fct> stage ia, stage ib, stage ib, stage ia, st…
$ stage_collapsed <fct> stage_i, stage_i, stage_i, stage_i, stage_…
You can also use the ‘other_level’ argument to set any category levels you’d like to ‘NA’.
smoke_complete4 <- smoke_complete3 %>%
mutate(stage_collapsed_na = fct_collapse(
stage_factor,
not_reported = c("not reported"),
stage_i = c("stage i", "stage ia", "stage ib"),
stage_ii = c("stage ii", "stage iia", "stage iib"),
stage_iii = c("stage iii", "stage iiia", "stage iiib"),
other_level = NA)) %>%
glimpse()Rows: 1,152
Columns: 23
$ primary_diagnosis <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ tumor_stage <chr> "stage ia", "stage ib", "stage ib", "stage…
$ age_at_diagnosis <dbl> 24477, 26615, 28171, 27154, 23370, 19025, …
$ vital_status <chr> "dead", "dead", "dead", "alive", "alive", …
$ morphology <chr> "8070/3", "8070/3", "8070/3", "8083/3", "8…
$ days_to_death <chr> "371", "136", "2304", "NA", "NA", "345", "…
$ state <chr> "live", "live", "live", "live", "live", "l…
$ tissue_or_organ_of_origin <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_birth <dbl> -24477, -26615, -28171, -27154, -23370, -1…
$ site_of_resection_or_biopsy <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_last_follow_up <chr> "NA", "NA", "2099", "3747", "3576", "NA", …
$ cigarettes_per_day <dbl> 10.9589041, 2.1917808, 1.6438356, 1.095890…
$ years_smoked <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", …
$ gender <chr> "male", "male", "female", "male", "female"…
$ year_of_birth <chr> "1936", "1931", "1927", "1930", "1942", "1…
$ race <chr> "white", "asian", "white", "white", "not r…
$ ethnicity <chr> "not hispanic or latino", "not hispanic or…
$ year_of_death <chr> "2004", "2003", "NA", "NA", "NA", "2005", …
$ bcr_patient_barcode <chr> "TCGA-18-3406", "TCGA-18-3407", "TCGA-18-3…
$ disease <chr> "LUSC", "LUSC", "LUSC", "LUSC", "LUSC", "L…
$ stage_factor <fct> stage ia, stage ib, stage ib, stage ia, st…
$ stage_collapsed <fct> stage_i, stage_i, stage_i, stage_i, stage_…
$ stage_collapsed_na <fct> stage_i, stage_i, stage_i, stage_i, stage_…
Now there are 5 factor levels instead of 11.
smoke_complete3 %>%
select(stage_collapsed) %>%
tbl_summary()| Characteristic | N = 1,1521 |
|---|---|
| stage_collapsed | |
| not_reported | 99 (8.6%) |
| stage_i | 419 (36%) |
| stage_ii | 325 (28%) |
| stage_iii | 218 (19%) |
| stage_iv | 91 (7.9%) |
| 1 n (%) | |
ggplot(smoke_complete3) +
aes(x = stage_collapsed, y = cigarettes_per_day) +
geom_boxplot() +
labs(x="Tumor Stage", y="Cigarettes Per Day") +
ylim(c(0,15))Warning: Removed 1 row containing non-finite outside the scale range
(`stat_boxplot()`).
#bar graph
ggplot(smoke_complete3) +
aes(x = stage_collapsed, y = cigarettes_per_day) +
geom_col() +
labs(x="Tumor Stage", y="Cigarettes Per Day")1.2 Is it helpful?
Collapsing factor levels could be done in general when variables have broader classifications that many levels could be reduced under. ‘fct_collapse’ can help with data visualization and comparing groups together. This function is very useful when working with large data sets or a data set like ‘smoke_complete’ where there are variables with ordered categories and subcategories. Many other ‘fct_ …’ functions can be used in conjunction with ‘fct_collapse’ to further manipulate factor levels.