`dplyr::sample_n()` & `dplyr::sample_frac()`

BSTA 526 Functions of the Week

Randomly selecting a number of rows, or a proportion of rows, from a data set.

Author

Jaclyn McCrea

Published

February 19, 2026

1 `dyplr::sample_n() & dplyr::sample_frac()`

The two functions presented here are the sample_n() and sample_frac() functions from the dplyr package, a package widely used for data wrangling.

2 What is it for?

The sample_n() function is used to randomly select a specified number of rows within a data set.

Similarly, the sample_frac() function is used to randomly select a specified percentage of rows within a data set.

3 Examples

For all examples, I will be using the penguins data from the palmerpenguins package. Let’s take a quick look at it before we begin, paying close attention to the species column and how the penguins are lined up by species, and that there are a total of 344 observations.

# Penguins lined up by species.
penguins %>%
  slice(c(1:3, 153:155, 277:279)) %>%
  gt()

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Gentoo	Biscoe	46.1	13.2	211	4500	female	2007
Gentoo	Biscoe	50.0	16.3	230	5700	male	2007
Gentoo	Biscoe	48.7	14.1	210	4450	female	2007
Chinstrap	Dream	46.5	17.9	192	3500	female	2007
Chinstrap	Dream	50.0	19.5	196	3900	male	2007
Chinstrap	Dream	51.3	19.2	193	3650	male	2007

# Number of rows.
nrow(penguins)

[1] 344

If you wanted 100 randomly selected observations within a data set, then we can use sample_n() to do so for us.

penguins %>%
  sample_n(100) %>%
  slice(1:8) %>%
  gt()

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Biscoe	42.2	19.5	197	4275	male	2009
Adelie	Torgersen	39.0	17.1	191	3050	female	2009
Chinstrap	Dream	42.4	17.3	181	3600	female	2007
Gentoo	Biscoe	49.3	15.7	217	5850	male	2007
Chinstrap	Dream	46.2	17.5	187	3650	female	2008
Gentoo	Biscoe	48.8	16.2	222	6000	male	2009
Chinstrap	Dream	50.2	18.8	202	3800	male	2009
Adelie	Torgersen	34.1	18.1	193	3475	NA	2007

As seen here, in this new output, there are 100 random observations taken from the original data set. This is confirmed to be random because the species are out of order, when in the original, they were lined up by species. You can select how many observations you want to randomly select with any integer relative to your data set, although this can change when a different argument is specified, which is discussed in the arguments section.

If you wanted to randomly select half of the observations in the original data set, then we can use sample_frac() to do so.

penguins %>%
  sample_frac(0.5) %>%
  slice(1:8) %>%
  gt()

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Chinstrap	Dream	50.5	19.6	201	4050	male	2007
Chinstrap	Dream	51.3	19.2	193	3650	male	2007
Gentoo	Biscoe	45.5	13.9	210	4200	female	2008
Chinstrap	Dream	50.2	18.7	198	3775	female	2009
Adelie	Dream	39.2	21.1	196	4150	male	2007
Gentoo	Biscoe	50.1	15.0	225	5000	male	2008
Chinstrap	Dream	48.1	16.4	199	3325	female	2009
Gentoo	Biscoe	49.4	15.8	216	4925	male	2009

Here, we see that this selected 172 observations, which was half of the total observations in the original data set. Again, we can confirm this randomly selected observations because they are not lined up by species like the original data set was. When selecting the fraction of observations you want to randomly sample, this needs to be a value between 0 and 1.

3.2 Function Arguments

For both of these functions, they each have arguments that can be changed to fit what you want to do. We have already shown that you need to specify the data you want to use, as well as the number or fraction of randomly selected observations you want. However, there are two other arguments you can specify.

replace
weight

For both functions, you can specify the argument replace as either TRUE or FALSE. Specifying the argument as TRUE allows a chance for observations to be selected multiple times (selection with replacement). Specifying FALSE, which is the default setting, does not allow the chance for observations to be selected multiple times (selection without replacement). For a fun example, when using the sample_n() function, you are usually limited by the maximum number of observations in the original data set when choosing how many random observations you want. However, with replace = TRUE, we can exceed that maximum, as seen below with 400 observations.

penguins %>%
  sample_n(400, replace = TRUE) %>%
  
  # New observation number.
  nrow()

[1] 400

Also for both functions, you can specify the argument weight. This argument makes it so that certain rows have a higher chance of being selected. weight = NULL is the default setting, where there is no weight being applied. To use this argument, you can create a new column in your data set specifying your weight values for what you want weighted, and then use that new column in your weight argument. The weight values should be anything between 0-1, and should sum to 1. As seen with the table, a Chinstrap penguin was only chosen once (low weight) while the other species were chosen more often (high weight).

# First, create the weight column.

penguins %>% 
  mutate(
  weight = case_when(
    species == "Chinstrap" ~ 0.1,
    species == "Adelie" ~ 0.3,
    species == "Gentoo" ~ 0.6
  )
) %>%
  
# Use the weight argument.
  
  sample_frac(0.5, replace = FALSE, weight = weight) %>%
  slice(1:10) %>%
  gt()

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year	weight
Gentoo	Biscoe	46.9	14.6	222	4875	female	2009	0.6
Adelie	Biscoe	40.6	18.8	193	3800	male	2008	0.3
Gentoo	Biscoe	46.5	14.4	217	4900	female	2008	0.6
Chinstrap	Dream	49.7	18.6	195	3600	male	2008	0.1
Adelie	Dream	35.6	17.5	191	3175	female	2009	0.3
Gentoo	Biscoe	48.5	14.1	220	5300	male	2008	0.6
Gentoo	Biscoe	46.5	13.5	210	4550	female	2007	0.6
Gentoo	Biscoe	52.5	15.6	221	5450	male	2009	0.6
Gentoo	Biscoe	44.5	14.7	214	4850	female	2009	0.6
Adelie	Torgersen	39.7	18.4	190	3900	male	2008	0.3

4 Is it helpful?

I think these two functions are the most helpful when you want to bootstrap your data, especially when you want bootstrap samples that are a specific sample size. Bootstrapping needs to be paired with the replace = TRUE argument, which these two functions are able to provide.

However, these functions have been optimized into a single, different function: dplyr::slice_sample(). This function does the exact same things as the two functions above do, and have similar arguments.

You can still use sample_n() and sample_frac(), but it is recommended to move to using the newer, optimized version as only necessary bug fixes will be patched when pertaining to the old functions.