forcats::fct_inorder(), fct_infreq(), fct_inseq()

Function of the Week

Switching up the order of factor levels by frequency, first appearance, or numeric sequence
Author

Katie Russell

Published

March 6, 2024

1 fct_inorder(), fct_infreq(), & fct_inseq()

In this document, I will introduce a set of functions that allows us to reorder the levels in factor variables.

library(tidyverse)
library(gtsummary)
library(ggplot2)

data("msleep")

1.1 What is it for?

We’ve seen how to manually reorder factor levels, which is straightforward and allows for a lot of control… but is also a lot of typing. Using a data set on the sleeping patterns of mammals, we can see how this works.

First, let’s create some factor variables in our data:

glimpse(msleep)
Rows: 83
Columns: 11
$ name         <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater shor…
$ genus        <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bra…
$ vore         <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "carn…
$ order        <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Art…
$ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "dome…
$ sleep_total  <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5…
$ sleep_rem    <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, …
$ sleep_cycle  <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, N…
$ awake        <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0, 1…
$ brainwt      <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0…
$ bodywt       <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.04…
msleep2 <- msleep %>% 
  mutate(vore_factor = factor(vore),
         order_factor = factor(order),
         sleep_total_factor = case_when(
           # Get ready for some super inappropriate rounding:
           sleep_total < 5 ~ 5,
           sleep_total >= 5 & sleep_total < 10 ~ 10,
           sleep_total >= 10 & sleep_total < 15 ~ 15,
           sleep_total >= 15 & sleep_total < 20 ~ 20),
         sleep_total_factor = factor(sleep_total_factor))

glimpse(msleep2)
Rows: 83
Columns: 14
$ name               <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greate…
$ genus              <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos"…
$ vore               <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi",…
$ order              <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha"…
$ conservation       <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA,…
$ sleep_total        <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, …
$ sleep_rem          <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6,…
$ sleep_cycle        <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833…
$ awake              <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 2…
$ brainwt            <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07…
$ bodywt             <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490…
$ vore_factor        <fct> carni, omni, herbi, omni, herbi, herbi, carni, NA, …
$ order_factor       <fct> Carnivora, Primates, Rodentia, Soricomorpha, Artiod…
$ sleep_total_factor <fct> 15, 20, 15, 15, 5, 15, 10, 10, 15, 5, 10, 10, 15, 1…

Now let’s look at a nice table of our new factor variables:

msleep2 %>% 
  select(vore_factor, order_factor, sleep_total_factor) %>% 
  tbl_summary()
Characteristic N = 831
vore_factor
    carni 19 (25%)
    herbi 32 (42%)
    insecti 5 (6.6%)
    omni 20 (26%)
    Unknown 7
order_factor
    Afrosoricida 1 (1.2%)
    Artiodactyla 6 (7.2%)
    Carnivora 12 (14%)
    Cetacea 3 (3.6%)
    Chiroptera 2 (2.4%)
    Cingulata 2 (2.4%)
    Didelphimorphia 2 (2.4%)
    Diprotodontia 2 (2.4%)
    Erinaceomorpha 2 (2.4%)
    Hyracoidea 3 (3.6%)
    Lagomorpha 1 (1.2%)
    Monotremata 1 (1.2%)
    Perissodactyla 3 (3.6%)
    Pilosa 1 (1.2%)
    Primates 12 (14%)
    Proboscidea 2 (2.4%)
    Rodentia 22 (27%)
    Scandentia 1 (1.2%)
    Soricomorpha 5 (6.0%)
sleep_total_factor
    5 11 (13%)
    10 27 (33%)
    15 33 (40%)
    20 12 (14%)
1 n (%)

Ok, so our new factor levels have defaulted to alpha-numeric order, but maybe that’s not how we want them. Re-ordering them manually might look something like this:

msleep2 <- msleep2 %>% 
  mutate(vore_factor = factor(vore_factor,
                                  ordered = T,
                                  levels = c("omni", "insecti", "herbi", "carni")))

msleep2 %>% 
  select(vore_factor) %>% 
  tbl_summary()
Characteristic N = 831
vore_factor
    omni 20 (26%)
    insecti 5 (6.6%)
    herbi 32 (42%)
    carni 19 (25%)
    Unknown 7
1 n (%)

This works, but could be a very unweildy process if we had a variable with a lot of factor levels, like our order variable. In addition, what if we wanted to order them in a different way, such as by frequency or by order in which they appear in our data? This would be hard to figure out manually!

Forcats has three ordering functions that can help us out.

Order of first appearance

fct_inorder() will organize levels by the order in which they first appear. Let’s try that out on our vore factor levels. First, a quick look at the first rows of our msleep2 data set for reference:

head(msleep2)
# A tibble: 6 × 14
  name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
  <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
1 Cheetah Acin… carni Carn… lc                  12.1      NA        NA      11.9
2 Owl mo… Aotus omni  Prim… <NA>                17         1.8      NA       7  
3 Mounta… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
4 Greate… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
5 Cow     Bos   herbi Arti… domesticated         4         0.7       0.667  20  
6 Three-… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
# ℹ 5 more variables: brainwt <dbl>, bodywt <dbl>, vore_factor <ord>,
#   order_factor <fct>, sleep_total_factor <fct>

Now let’s ask forcats to order our vore levels by the order of first appearance:

fct_inorder(msleep2$vore_factor)
 [1] carni   omni    herbi   omni    herbi   herbi   carni   <NA>    carni  
[10] herbi   herbi   herbi   omni    herbi   omni    omni    omni    carni  
[19] herbi   omni    herbi   insecti herbi   herbi   omni    omni    herbi  
[28] carni   omni    herbi   carni   carni   herbi   omni    herbi   herbi  
[37] carni   omni    herbi   herbi   herbi   herbi   insecti herbi   carni  
[46] herbi   carni   herbi   herbi   omni    carni   carni   carni   omni   
[55] <NA>    omni    <NA>    <NA>    carni   carni   herbi   insecti <NA>   
[64] herbi   omni    omni    insecti herbi   <NA>    herbi   herbi   herbi  
[73] <NA>    omni    insecti herbi   herbi   omni    omni    carni   carni  
[82] carni   carni  
Levels: carni < omni < herbi < insecti
msleep2 %>% 
  mutate(vore_factor = vore_factor %>% 
           fct_inorder()) %>% 
  select(vore_factor) %>% 
  tbl_summary()
Characteristic N = 831
vore_factor
    carni 19 (25%)
    omni 20 (26%)
    herbi 32 (42%)
    insecti 5 (6.6%)
    Unknown 7
1 n (%)

Number of observations

fct_infreq will order levels by the number of observations within each level. Let’s try it with a graph this time. Before ordering, our graph might look like this:

msleep2 %>% 
  ggplot(
    aes(y = order_factor)) +
  geom_bar()

Alphabetical order is helpful in libraries, but not necessarily in graphs! Reordering using fct_infreq(), we get a more visually helpful graph:

msleep2 %>% 
  mutate(order_factor = order_factor %>% 
           fct_infreq()) %>% 
  ggplot(aes(
         y = order_factor)) +
  geom_bar()

We can also reverse the order and go from highest frequency to lowest by using fct_rev() in combination with fct_infreq:

msleep2 %>% 
  mutate(order_factor = order_factor %>% 
           fct_infreq() %>% 
           fct_rev()) %>% 
  ggplot(aes(
         y = order_factor)) +
  geom_bar()

Numeric sequencing

fct_inseq will correctly sequence numeric factor levels:

msleep2 %>% 
  select(sleep_total_factor) %>% 
  tbl_summary()
Characteristic N = 831
sleep_total_factor
    5 11 (13%)
    10 27 (33%)
    15 33 (40%)
    20 12 (14%)
1 n (%)
fct_inseq(msleep2$sleep_total_factor)
 [1] 15 20 15 15 5  15 10 10 15 5  10 10 15 15 15 10 10 20 10 20 5  20 5  5  15
[26] 15 15 15 10 5  5  10 10 10 10 5  20 15 15 15 15 15 20 15 15 10 15 10 5  10
[51] 20 15 15 10 15 15 15 15 5  10 15 20 10 15 10 10 10 15 15 20 15 20 15 10 10
[76] 20 5  20 10 10 10 15 10
Levels: 5 10 15 20
msleep2 %>% 
  mutate(sleep_total_factor = sleep_total_factor %>% 
           fct_inseq() %>% 
           fct_rev()) %>% 
  select(sleep_total_factor) %>% 
  tbl_summary()
Characteristic N = 831
sleep_total_factor
    20 12 (14%)
    15 33 (40%)
    10 27 (33%)
    5 11 (13%)
1 n (%)

1.2 Is it helpful?

I think this family of functions could be very helpful, particularly the ordering by frequency of occurrence. I’m a little less sold on fct_inseq, mainly because R already seems to sequence numeric factor levels correctly (unless you want them reversed). I bet Meike and Emile have better knowledge of where this could come in handy, though…

1.3 Sources

https://forcats.tidyverse.org/reference/fct_inorder.html
https://livebook.manning.com/concept/r/fct_infreq