tidyr::uncount()

Function of the Week

Author

Cesar Cristancho

Published

February 21, 2024

uncount()

Package/library: tidyr

What is it for?

“Uncount” a data frame:

Duplicate rows according to a weighting variable.

Performs the opposite operation to dplyr::count(), duplicating rows according to a weighting variable (or expression). Therefore, expand counts into multiple rows.

Usage: uncount(data, weights, ..., .remove = TRUE, .id = NULL)

Arguments

  • data
    A data frame, tibble, or grouped tibble.

  • weights (integer)
    A vector of weights. Evaluated in the context of data; supports quasi-quotation.

  • .remove
    If TRUE, and “weights” is the name of a column in the data, then this column is removed.

  • .id
    Supply a string to create a new variable which gives a unique identifier for each created row.

  • … Additional arguments passed on to methods.

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
uncount (penguins, 2) #duplicate the rows
# A tibble: 688 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.1          18.7               181        3750
 3 Adelie  Torgersen           39.5          17.4               186        3800
 4 Adelie  Torgersen           39.5          17.4               186        3800
 5 Adelie  Torgersen           40.3          18                 195        3250
 6 Adelie  Torgersen           40.3          18                 195        3250
 7 Adelie  Torgersen           NA            NA                  NA          NA
 8 Adelie  Torgersen           NA            NA                  NA          NA
 9 Adelie  Torgersen           36.7          19.3               193        3450
10 Adelie  Torgersen           36.7          19.3               193        3450
# ℹ 678 more rows
# ℹ 2 more variables: sex <fct>, year <int>
uncount (penguins, 1, .id = "id") # adding consecutive ID
# A tibble: 344 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, id <int>
# uncount (penguins$island, 1, .id = "id"): error!
island <- penguins %>% distinct (island)
island %>% gt()
island
Torgersen
Biscoe
Dream
new_island <- uncount (island,3)
new_island %>% group_by(island) %>%  count() %>% gt ()
n
Biscoe
3
Dream
3
Torgersen
3

If you have a data frame with a column representing the number in each group (frequency table), and you want to create a new data frame “unfolding” this table, with each row representing a single observation. Replicate each row based on the number in that group.

table_1 <- penguins %>% count (island)
table_1 %>% gt() 
island n
Biscoe 168
Dream 124
Torgersen 52
duplicate_p <- uncount (table_1, n, .id = "id")
glimpse(duplicate_p) #`only 2 columns`
Rows: 344
Columns: 2
$ island <fct> Biscoe, Biscoe, Biscoe, Biscoe, Biscoe, Biscoe, Biscoe, Biscoe,…
$ id     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
penguins <- penguins %>% arrange(island)
all(penguins$island == duplicate_p$island) 
[1] TRUE

Is it helpful?

Multiplicate rows according to a key and weigth, or created a consecutive ID.

This can be useful for tasks like expanding data to represent individual occurrences within a group or category.