dplyr::sample_n() & dplyr::sample_frac()

BSTA 526 Functions of the Week

Randomly selecting a number of rows, or a proportion of rows, from a data set.
Author

Jaclyn McCrea

Published

February 19, 2026

1 dyplr::sample_n() & dplyr::sample_frac()

The two functions presented here are the sample_n() and sample_frac() functions from the dplyr package, a package widely used for data wrangling.

2 What is it for?

  • The sample_n() function is used to randomly select a specified number of rows within a data set.

  • Similarly, the sample_frac() function is used to randomly select a specified percentage of rows within a data set.

3 Examples

For all examples, I will be using the penguins data from the palmerpenguins package. Let’s take a quick look at it before we begin, paying close attention to the species column and how the penguins are lined up by species, and that there are a total of 344 observations.

# Penguins lined up by species.
penguins %>%
  slice(c(1:3, 153:155, 277:279)) %>%
  gt()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Gentoo Biscoe 46.1 13.2 211 4500 female 2007
Gentoo Biscoe 50.0 16.3 230 5700 male 2007
Gentoo Biscoe 48.7 14.1 210 4450 female 2007
Chinstrap Dream 46.5 17.9 192 3500 female 2007
Chinstrap Dream 50.0 19.5 196 3900 male 2007
Chinstrap Dream 51.3 19.2 193 3650 male 2007
# Number of rows.
nrow(penguins)
[1] 344

3.1 Functions

If you wanted 100 randomly selected observations within a data set, then we can use sample_n() to do so for us.

penguins %>%
  sample_n(100) %>%
  slice(1:8) %>%
  gt()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Biscoe 42.2 19.5 197 4275 male 2009
Adelie Torgersen 39.0 17.1 191 3050 female 2009
Chinstrap Dream 42.4 17.3 181 3600 female 2007
Gentoo Biscoe 49.3 15.7 217 5850 male 2007
Chinstrap Dream 46.2 17.5 187 3650 female 2008
Gentoo Biscoe 48.8 16.2 222 6000 male 2009
Chinstrap Dream 50.2 18.8 202 3800 male 2009
Adelie Torgersen 34.1 18.1 193 3475 NA 2007
  • As seen here, in this new output, there are 100 random observations taken from the original data set. This is confirmed to be random because the species are out of order, when in the original, they were lined up by species. You can select how many observations you want to randomly select with any integer relative to your data set, although this can change when a different argument is specified, which is discussed in the arguments section.

If you wanted to randomly select half of the observations in the original data set, then we can use sample_frac() to do so.

penguins %>%
  sample_frac(0.5) %>%
  slice(1:8) %>%
  gt()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Chinstrap Dream 50.5 19.6 201 4050 male 2007
Chinstrap Dream 51.3 19.2 193 3650 male 2007
Gentoo Biscoe 45.5 13.9 210 4200 female 2008
Chinstrap Dream 50.2 18.7 198 3775 female 2009
Adelie Dream 39.2 21.1 196 4150 male 2007
Gentoo Biscoe 50.1 15.0 225 5000 male 2008
Chinstrap Dream 48.1 16.4 199 3325 female 2009
Gentoo Biscoe 49.4 15.8 216 4925 male 2009
  • Here, we see that this selected 172 observations, which was half of the total observations in the original data set. Again, we can confirm this randomly selected observations because they are not lined up by species like the original data set was. When selecting the fraction of observations you want to randomly sample, this needs to be a value between 0 and 1.

3.2 Function Arguments

For both of these functions, they each have arguments that can be changed to fit what you want to do. We have already shown that you need to specify the data you want to use, as well as the number or fraction of randomly selected observations you want. However, there are two other arguments you can specify.

  • For both functions, you can specify the argument replace as either TRUE or FALSE. Specifying the argument as TRUE allows a chance for observations to be selected multiple times (selection with replacement). Specifying FALSE, which is the default setting, does not allow the chance for observations to be selected multiple times (selection without replacement). For a fun example, when using the sample_n() function, you are usually limited by the maximum number of observations in the original data set when choosing how many random observations you want. However, with replace = TRUE, we can exceed that maximum, as seen below with 400 observations.
penguins %>%
  sample_n(400, replace = TRUE) %>%
  
  # New observation number.
  nrow()
[1] 400
  • Also for both functions, you can specify the argument weight. This argument makes it so that certain rows have a higher chance of being selected. weight = NULL is the default setting, where there is no weight being applied. To use this argument, you can create a new column in your data set specifying your weight values for what you want weighted, and then use that new column in your weight argument. The weight values should be anything between 0-1, and should sum to 1. As seen with the table, a Chinstrap penguin was only chosen once (low weight) while the other species were chosen more often (high weight).
# First, create the weight column.

penguins %>% 
  mutate(
  weight = case_when(
    species == "Chinstrap" ~ 0.1,
    species == "Adelie" ~ 0.3,
    species == "Gentoo" ~ 0.6
  )
) %>%
  
# Use the weight argument.
  
  sample_frac(0.5, replace = FALSE, weight = weight) %>%
  slice(1:10) %>%
  gt()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year weight
Gentoo Biscoe 46.9 14.6 222 4875 female 2009 0.6
Adelie Biscoe 40.6 18.8 193 3800 male 2008 0.3
Gentoo Biscoe 46.5 14.4 217 4900 female 2008 0.6
Chinstrap Dream 49.7 18.6 195 3600 male 2008 0.1
Adelie Dream 35.6 17.5 191 3175 female 2009 0.3
Gentoo Biscoe 48.5 14.1 220 5300 male 2008 0.6
Gentoo Biscoe 46.5 13.5 210 4550 female 2007 0.6
Gentoo Biscoe 52.5 15.6 221 5450 male 2009 0.6
Gentoo Biscoe 44.5 14.7 214 4850 female 2009 0.6
Adelie Torgersen 39.7 18.4 190 3900 male 2008 0.3

4 Is it helpful?

  • I think these two functions are the most helpful when you want to bootstrap your data, especially when you want bootstrap samples that are a specific sample size. Bootstrapping needs to be paired with the replace = TRUE argument, which these two functions are able to provide.
  • However, these functions have been optimized into a single, different function: dplyr::slice_sample(). This function does the exact same things as the two functions above do, and have similar arguments.
  • You can still use sample_n() and sample_frac(), but it is recommended to move to using the newer, optimized version as only necessary bug fixes will be patched when pertaining to the old functions.