Function of the Week:

janitor::get_dupes

Author

Sofia Chapela Lara

Published

March 6, 2024

1 `janitor::get_dupes()`

In this document, I will introduce the get_dupes() function and show what it’s for. This function is part of the janitor package so will need to load janitor first.

#loading janitor
library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

1.1 What is it for?

This function is quite simple but useful, specially when we are working with big data frames. It basically identifies duplicate rows in a given data frame.

Here is an example to illustrate how it works:

First, I will input a data frame which I will call example:

example <- data.frame(
  ID = c(01,02,03,04,05,06,07,08,09,10,06,12,13,07,15,16,17,06),
  group = c("T", "T", "C", "C", "C", "T", "T", "C", "T", "C", "C", "C", "T", "T", "C", "C","T","C"),
  age = c(26, 26, 22, 25, 29, 33, 26, 32, 33, 31, 35, 32, 24, 25, 28, 20, 29, 35))
example

   ID group age
1   1     T  26
2   2     T  26
3   3     C  22
4   4     C  25
5   5     C  29
6   6     T  33
7   7     T  26
8   8     C  32
9   9     T  33
10 10     C  31
11  6     C  35
12 12     C  32
13 13     T  24
14  7     T  25
15 15     C  28
16 16     C  20
17 17     T  29
18  6     C  35

Since this is a small data frame, we can visually inspect it and notice that participants with ID numbers 6 and 7 are repeated. However, if our data set were larger, identifying duplicates visually would be a tedious and error-prone process.

Now, we can use get_dupes() following this format: get_dupes (dat, ...) where:

dat name of the data frame

... names of the variables to search for duplicates (unquoted)

get_dupes(example, ID)

  ID dupe_count group age
1  6          3     T  33
2  6          3     C  35
3  6          3     C  35
4  7          2     T  26
5  7          2     T  25

The output will give us the rows with duplicate records in the specified variable (ID) and a count of the duplicates (dupe_count)

We corroborated here that ID number 6 is repeated 3 times and ID 7 is repeated 2 times.

We can also use pipes with the get_dupes function:

example |>
  get_dupes(age)

   age dupe_count ID group
1   26          3  1     T
2   26          3  2     T
3   26          3  7     T
4   25          2  4     C
5   25          2  7     T
6   29          2  5     C
7   29          2 17     T
8   32          2  8     C
9   32          2 12     C
10  33          2  6     T
11  33          2  9     T
12  35          2  6     C
13  35          2  6     C

The output will provide us with the duplicates for age. We can observe that it will order them in descending order, with the most frequently repeated observations appearing at the top of the table. Here, we notice that age 26 is repeated 3 times, age 25 is repeated 2 times, and so forth. This results are not very informative, so we need to be careful to select a meaningful variable to account for duplicates.

If we don’t specify any variables, get_dupes will look for duplicates using all columns

example |>
  get_dupes()

No variable names specified - using all columns.

  ID group age dupe_count
1  6     C  35          2
2  6     C  35          2

Here, we have two rows with the exact same values in all columns.

We can also use tidyselect helpers. For example, we can look for duplicates among all variables except age:

example |>
  get_dupes(-age)

  ID group dupe_count age
1  6     C          2  35
2  6     C          2  35
3  7     T          2  26
4  7     T          2  25

Even though the output displays a column for age, it is not accounting for the repeated age records as it did before.

1.2 Is it helpful?

Yes, this function can save us a lot of time during the data cleaning process. I use it at early stages of data analysis to identify potential coding errors. For me, it is especially useful to use it before merging datasets to ensure we have only one ID code per subject.

1 janitor::get_dupes()

1.1 What is it for?

1.2 Is it helpful?

1 `janitor::get_dupes()`