library(tidyverse)
library(gtsummary)
library(ggplot2)
data("msleep")forcats::fct_inorder(), fct_infreq(), fct_inseq()
Function of the Week
1 fct_inorder(), fct_infreq(), & fct_inseq()
In this document, I will introduce a set of functions that allows us to reorder the levels in factor variables.
1.1 What is it for?
We’ve seen how to manually reorder factor levels, which is straightforward and allows for a lot of control… but is also a lot of typing. Using a data set on the sleeping patterns of mammals, we can see how this works.
First, let’s create some factor variables in our data:
glimpse(msleep)Rows: 83
Columns: 11
$ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater shor…
$ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bra…
$ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "carn…
$ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Art…
$ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "dome…
$ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5…
$ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, …
$ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, N…
$ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0, 1…
$ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0…
$ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.04…
msleep2 <- msleep %>%
mutate(vore_factor = factor(vore),
order_factor = factor(order),
sleep_total_factor = case_when(
# Get ready for some super inappropriate rounding:
sleep_total < 5 ~ 5,
sleep_total >= 5 & sleep_total < 10 ~ 10,
sleep_total >= 10 & sleep_total < 15 ~ 15,
sleep_total >= 15 & sleep_total < 20 ~ 20),
sleep_total_factor = factor(sleep_total_factor))
glimpse(msleep2)Rows: 83
Columns: 14
$ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greate…
$ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos"…
$ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi",…
$ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha"…
$ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA,…
$ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, …
$ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6,…
$ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833…
$ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 2…
$ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07…
$ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490…
$ vore_factor <fct> carni, omni, herbi, omni, herbi, herbi, carni, NA, …
$ order_factor <fct> Carnivora, Primates, Rodentia, Soricomorpha, Artiod…
$ sleep_total_factor <fct> 15, 20, 15, 15, 5, 15, 10, 10, 15, 5, 10, 10, 15, 1…
Now let’s look at a nice table of our new factor variables:
msleep2 %>%
select(vore_factor, order_factor, sleep_total_factor) %>%
tbl_summary()| Characteristic | N = 831 |
|---|---|
| vore_factor | |
| carni | 19 (25%) |
| herbi | 32 (42%) |
| insecti | 5 (6.6%) |
| omni | 20 (26%) |
| Unknown | 7 |
| order_factor | |
| Afrosoricida | 1 (1.2%) |
| Artiodactyla | 6 (7.2%) |
| Carnivora | 12 (14%) |
| Cetacea | 3 (3.6%) |
| Chiroptera | 2 (2.4%) |
| Cingulata | 2 (2.4%) |
| Didelphimorphia | 2 (2.4%) |
| Diprotodontia | 2 (2.4%) |
| Erinaceomorpha | 2 (2.4%) |
| Hyracoidea | 3 (3.6%) |
| Lagomorpha | 1 (1.2%) |
| Monotremata | 1 (1.2%) |
| Perissodactyla | 3 (3.6%) |
| Pilosa | 1 (1.2%) |
| Primates | 12 (14%) |
| Proboscidea | 2 (2.4%) |
| Rodentia | 22 (27%) |
| Scandentia | 1 (1.2%) |
| Soricomorpha | 5 (6.0%) |
| sleep_total_factor | |
| 5 | 11 (13%) |
| 10 | 27 (33%) |
| 15 | 33 (40%) |
| 20 | 12 (14%) |
| 1 n (%) | |
Ok, so our new factor levels have defaulted to alpha-numeric order, but maybe that’s not how we want them. Re-ordering them manually might look something like this:
msleep2 <- msleep2 %>%
mutate(vore_factor = factor(vore_factor,
ordered = T,
levels = c("omni", "insecti", "herbi", "carni")))
msleep2 %>%
select(vore_factor) %>%
tbl_summary()| Characteristic | N = 831 |
|---|---|
| vore_factor | |
| omni | 20 (26%) |
| insecti | 5 (6.6%) |
| herbi | 32 (42%) |
| carni | 19 (25%) |
| Unknown | 7 |
| 1 n (%) | |
This works, but could be a very unweildy process if we had a variable with a lot of factor levels, like our order variable. In addition, what if we wanted to order them in a different way, such as by frequency or by order in which they appear in our data? This would be hard to figure out manually!
Forcats has three ordering functions that can help us out.
Order of first appearance
fct_inorder() will organize levels by the order in which they first appear. Let’s try that out on our vore factor levels. First, a quick look at the first rows of our msleep2 data set for reference:
head(msleep2)# A tibble: 6 × 14
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Cheetah Acin… carni Carn… lc 12.1 NA NA 11.9
2 Owl mo… Aotus omni Prim… <NA> 17 1.8 NA 7
3 Mounta… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
4 Greate… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
6 Three-… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
# ℹ 5 more variables: brainwt <dbl>, bodywt <dbl>, vore_factor <ord>,
# order_factor <fct>, sleep_total_factor <fct>
Now let’s ask forcats to order our vore levels by the order of first appearance:
fct_inorder(msleep2$vore_factor) [1] carni omni herbi omni herbi herbi carni <NA> carni
[10] herbi herbi herbi omni herbi omni omni omni carni
[19] herbi omni herbi insecti herbi herbi omni omni herbi
[28] carni omni herbi carni carni herbi omni herbi herbi
[37] carni omni herbi herbi herbi herbi insecti herbi carni
[46] herbi carni herbi herbi omni carni carni carni omni
[55] <NA> omni <NA> <NA> carni carni herbi insecti <NA>
[64] herbi omni omni insecti herbi <NA> herbi herbi herbi
[73] <NA> omni insecti herbi herbi omni omni carni carni
[82] carni carni
Levels: carni < omni < herbi < insecti
msleep2 %>%
mutate(vore_factor = vore_factor %>%
fct_inorder()) %>%
select(vore_factor) %>%
tbl_summary()| Characteristic | N = 831 |
|---|---|
| vore_factor | |
| carni | 19 (25%) |
| omni | 20 (26%) |
| herbi | 32 (42%) |
| insecti | 5 (6.6%) |
| Unknown | 7 |
| 1 n (%) | |
Number of observations
fct_infreq will order levels by the number of observations within each level. Let’s try it with a graph this time. Before ordering, our graph might look like this:
msleep2 %>%
ggplot(
aes(y = order_factor)) +
geom_bar()Alphabetical order is helpful in libraries, but not necessarily in graphs! Reordering using fct_infreq(), we get a more visually helpful graph:
msleep2 %>%
mutate(order_factor = order_factor %>%
fct_infreq()) %>%
ggplot(aes(
y = order_factor)) +
geom_bar()We can also reverse the order and go from highest frequency to lowest by using fct_rev() in combination with fct_infreq:
msleep2 %>%
mutate(order_factor = order_factor %>%
fct_infreq() %>%
fct_rev()) %>%
ggplot(aes(
y = order_factor)) +
geom_bar()Numeric sequencing
fct_inseq will correctly sequence numeric factor levels:
msleep2 %>%
select(sleep_total_factor) %>%
tbl_summary()| Characteristic | N = 831 |
|---|---|
| sleep_total_factor | |
| 5 | 11 (13%) |
| 10 | 27 (33%) |
| 15 | 33 (40%) |
| 20 | 12 (14%) |
| 1 n (%) | |
fct_inseq(msleep2$sleep_total_factor) [1] 15 20 15 15 5 15 10 10 15 5 10 10 15 15 15 10 10 20 10 20 5 20 5 5 15
[26] 15 15 15 10 5 5 10 10 10 10 5 20 15 15 15 15 15 20 15 15 10 15 10 5 10
[51] 20 15 15 10 15 15 15 15 5 10 15 20 10 15 10 10 10 15 15 20 15 20 15 10 10
[76] 20 5 20 10 10 10 15 10
Levels: 5 10 15 20
msleep2 %>%
mutate(sleep_total_factor = sleep_total_factor %>%
fct_inseq() %>%
fct_rev()) %>%
select(sleep_total_factor) %>%
tbl_summary()| Characteristic | N = 831 |
|---|---|
| sleep_total_factor | |
| 20 | 12 (14%) |
| 15 | 33 (40%) |
| 10 | 27 (33%) |
| 5 | 11 (13%) |
| 1 n (%) | |
1.2 Is it helpful?
I think this family of functions could be very helpful, particularly the ordering by frequency of occurrence. I’m a little less sold on fct_inseq, mainly because R already seems to sequence numeric factor levels correctly (unless you want them reversed). I bet Meike and Emile have better knowledge of where this could come in handy, though…
1.3 Sources
https://forcats.tidyverse.org/reference/fct_inorder.html
https://livebook.manning.com/concept/r/fct_infreq