n_distinct

Function of the Week

The n_distinct() function counts the number of unique values in a vector
Author

Ann McMonigal

Published

February 7, 2024

1 dplyr::n_distinct

In this document, I will introduce the n_distinct() function from dplyr and show what it’s for.

#load dplyr
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
#example dataset
data(starwars)

1.1 What is it for?

The n_distinct() function counts the number of unique values in a vector or set of vectors. It has two arguments:

  • ... : One or more vectors from your dataset.

  • na.rm : Can equal TRUE or FALSE.

The default is na.rm = FALSE, meaning missing values are included in the count of distinct values by default. If TRUE, missing values will be excluded from the count of distinct values.

1.2 Example using starwars

#Let's see what is in our dataset.
glimpse(starwars)
Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
#Let's use n_distinct on a categorical variable, such as species.
n_distinct(starwars$species)
[1] 38
#Let's examine how na.rm works.
n_distinct(starwars$hair_color, na.rm = FALSE)
[1] 13
#Now let's change to na.rm = TRUE
n_distinct(starwars$hair_color, na.rm = TRUE)
[1] 12
#Let's try with multiple vectors. Missing values will be included in the count.
n_distinct(starwars$hair_color, starwars$eye_color)
[1] 35
#What are the distinct pairs?
starwars %>% distinct(eye_color, hair_color)
# A tibble: 35 × 2
   eye_color hair_color   
   <chr>     <chr>        
 1 blue      blond        
 2 yellow    <NA>         
 3 red       <NA>         
 4 yellow    none         
 5 brown     brown        
 6 blue      brown, grey  
 7 blue      brown        
 8 brown     black        
 9 blue-gray auburn, white
10 blue      auburn, grey 
# ℹ 25 more rows
(tibble1 <- starwars %>% group_by(eye_color) %>%
  summarise(count = n_distinct(hair_color)))
# A tibble: 15 × 2
   eye_color     count
   <chr>         <int>
 1 black             2
 2 blue              8
 3 blue-gray         1
 4 brown             4
 5 dark              1
 6 gold              1
 7 green, yellow     1
 8 hazel             2
 9 orange            2
10 pink              1
11 red               2
12 red, blue         1
13 unknown           2
14 white             1
15 yellow            6
sum(tibble1$count)
[1] 35

1.3 Is n_distinct() helpful?

The function n_distinct() is helpful for data exploration for categorical variables because it quickly counts the number of distinct values.

However, n_distinct() on its own is not very powerful, and the function is more helpful when used in combination with other functions.