forcats::fct_collapse

Function of the Week

Collapse factor levels into manually defined groups
Author

Alexa Bovenkamp

Published

February 6, 2025

1 fct_collapse

In this document, I will introduce the fct_collapse function that allows you to collapse factor levels into manually defined groups.

1.1 What is it for?

The fct_collapse function accepts a factor or character vector and multiple named character levels. ‘fct_collapse’ consolidates several factor levels into fewer groups. Let’s look at tumor_stage:

#mutating tumor_stage into a factor

smoke_complete2 <- smoke_complete %>% 
    mutate(stage_factor = 
               factor(tumor_stage,
                      levels = c("not reported", 
                                 "stage i", 
                                 "stage ia", 
                                 "stage ib", 
                                 "stage ii", 
                                 "stage iia", 
                                 "stage iib", 
                                 "stage iii", 
                                 "stage iiia", 
                                 "stage iiib", 
                                 "stage iv")))
smoke_complete2 %>% 
  select(stage_factor) %>% 
  tbl_summary()
Characteristic N = 1,1521
stage_factor
    not reported 99 (8.6%)
    stage i 7 (0.6%)
    stage ia 146 (13%)
    stage ib 266 (23%)
    stage ii 65 (5.6%)
    stage iia 112 (9.7%)
    stage iib 148 (13%)
    stage iii 86 (7.5%)
    stage iiia 102 (8.9%)
    stage iiib 30 (2.6%)
    stage iv 91 (7.9%)
1 n (%)

That’s a lot of levels right? Here are visual examples of the tumor_stage variable before any manipulation:

#boxplot
ggplot(smoke_complete2) +
  aes(x = tumor_stage, y = cigarettes_per_day) +
  geom_boxplot() +
  labs(x="Tumor Stage", y="Cigarettes Per Day") +
  ylim(c(0,15))
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_boxplot()`).

#bar graph
ggplot(smoke_complete2) +
  aes(x = tumor_stage, y = cigarettes_per_day) +
  geom_col() +
  labs(x="Tumor Stage", y="Cigarettes Per Day")

Having a large amount of categories can be overwhelming in some graphics. There are distinguishable categories that these levels could be reduced to using ‘fct_collapse’.

smoke_complete3 <- smoke_complete2 %>% 
  mutate(stage_collapsed = fct_collapse(
    stage_factor,
    not_reported = c("not reported"),
    stage_i = c("stage i", "stage ia", "stage ib"),
    stage_ii = c("stage ii", "stage iia", "stage iib"),
    stage_iii = c("stage iii", "stage iiia", "stage iiib"),
    stage_iv = c("stage iv"))) %>% 
  glimpse()
Rows: 1,152
Columns: 22
$ primary_diagnosis           <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ tumor_stage                 <chr> "stage ia", "stage ib", "stage ib", "stage…
$ age_at_diagnosis            <dbl> 24477, 26615, 28171, 27154, 23370, 19025, …
$ vital_status                <chr> "dead", "dead", "dead", "alive", "alive", …
$ morphology                  <chr> "8070/3", "8070/3", "8070/3", "8083/3", "8…
$ days_to_death               <chr> "371", "136", "2304", "NA", "NA", "345", "…
$ state                       <chr> "live", "live", "live", "live", "live", "l…
$ tissue_or_organ_of_origin   <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_birth               <dbl> -24477, -26615, -28171, -27154, -23370, -1…
$ site_of_resection_or_biopsy <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_last_follow_up      <chr> "NA", "NA", "2099", "3747", "3576", "NA", …
$ cigarettes_per_day          <dbl> 10.9589041, 2.1917808, 1.6438356, 1.095890…
$ years_smoked                <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", …
$ gender                      <chr> "male", "male", "female", "male", "female"…
$ year_of_birth               <chr> "1936", "1931", "1927", "1930", "1942", "1…
$ race                        <chr> "white", "asian", "white", "white", "not r…
$ ethnicity                   <chr> "not hispanic or latino", "not hispanic or…
$ year_of_death               <chr> "2004", "2003", "NA", "NA", "NA", "2005", …
$ bcr_patient_barcode         <chr> "TCGA-18-3406", "TCGA-18-3407", "TCGA-18-3…
$ disease                     <chr> "LUSC", "LUSC", "LUSC", "LUSC", "LUSC", "L…
$ stage_factor                <fct> stage ia, stage ib, stage ib, stage ia, st…
$ stage_collapsed             <fct> stage_i, stage_i, stage_i, stage_i, stage_…

You can also use the ‘other_level’ argument to set any category levels you’d like to ‘NA’.

smoke_complete4 <- smoke_complete3 %>% 
  mutate(stage_collapsed_na = fct_collapse(
    stage_factor,
    not_reported = c("not reported"),
    stage_i = c("stage i", "stage ia", "stage ib"),
    stage_ii = c("stage ii", "stage iia", "stage iib"),
    stage_iii = c("stage iii", "stage iiia", "stage iiib"),
    other_level = NA)) %>% 
  glimpse()
Rows: 1,152
Columns: 23
$ primary_diagnosis           <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ tumor_stage                 <chr> "stage ia", "stage ib", "stage ib", "stage…
$ age_at_diagnosis            <dbl> 24477, 26615, 28171, 27154, 23370, 19025, …
$ vital_status                <chr> "dead", "dead", "dead", "alive", "alive", …
$ morphology                  <chr> "8070/3", "8070/3", "8070/3", "8083/3", "8…
$ days_to_death               <chr> "371", "136", "2304", "NA", "NA", "345", "…
$ state                       <chr> "live", "live", "live", "live", "live", "l…
$ tissue_or_organ_of_origin   <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_birth               <dbl> -24477, -26615, -28171, -27154, -23370, -1…
$ site_of_resection_or_biopsy <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_last_follow_up      <chr> "NA", "NA", "2099", "3747", "3576", "NA", …
$ cigarettes_per_day          <dbl> 10.9589041, 2.1917808, 1.6438356, 1.095890…
$ years_smoked                <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", …
$ gender                      <chr> "male", "male", "female", "male", "female"…
$ year_of_birth               <chr> "1936", "1931", "1927", "1930", "1942", "1…
$ race                        <chr> "white", "asian", "white", "white", "not r…
$ ethnicity                   <chr> "not hispanic or latino", "not hispanic or…
$ year_of_death               <chr> "2004", "2003", "NA", "NA", "NA", "2005", …
$ bcr_patient_barcode         <chr> "TCGA-18-3406", "TCGA-18-3407", "TCGA-18-3…
$ disease                     <chr> "LUSC", "LUSC", "LUSC", "LUSC", "LUSC", "L…
$ stage_factor                <fct> stage ia, stage ib, stage ib, stage ia, st…
$ stage_collapsed             <fct> stage_i, stage_i, stage_i, stage_i, stage_…
$ stage_collapsed_na          <fct> stage_i, stage_i, stage_i, stage_i, stage_…

Now there are 5 factor levels instead of 11.

smoke_complete3 %>% 
  select(stage_collapsed) %>% 
  tbl_summary()
Characteristic N = 1,1521
stage_collapsed
    not_reported 99 (8.6%)
    stage_i 419 (36%)
    stage_ii 325 (28%)
    stage_iii 218 (19%)
    stage_iv 91 (7.9%)
1 n (%)
ggplot(smoke_complete3) +
  aes(x = stage_collapsed, y = cigarettes_per_day) +
  geom_boxplot() +
  labs(x="Tumor Stage", y="Cigarettes Per Day") +
  ylim(c(0,15))
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_boxplot()`).

#bar graph
ggplot(smoke_complete3) +
  aes(x = stage_collapsed, y = cigarettes_per_day) +
  geom_col() +
  labs(x="Tumor Stage", y="Cigarettes Per Day")

1.2 Is it helpful?

Collapsing factor levels could be done in general when variables have broader classifications that many levels could be reduced under. ‘fct_collapse’ can help with data visualization and comparing groups together. This function is very useful when working with large data sets or a data set like ‘smoke_complete’ where there are variables with ordered categories and subcategories. Many other ‘fct_ …’ functions can be used in conjunction with ‘fct_collapse’ to further manipulate factor levels.