#loading janitor
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
janitor::get_dupes
Sofia Chapela Lara
March 6, 2024
janitor::get_dupes()In this document, I will introduce the get_dupes() function and show what it’s for. This function is part of the janitor package so will need to load janitor first.
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
This function is quite simple but useful, specially when we are working with big data frames. It basically identifies duplicate rows in a given data frame.
Here is an example to illustrate how it works:
First, I will input a data frame which I will call example:
example <- data.frame(
ID = c(01,02,03,04,05,06,07,08,09,10,06,12,13,07,15,16,17,06),
group = c("T", "T", "C", "C", "C", "T", "T", "C", "T", "C", "C", "C", "T", "T", "C", "C","T","C"),
age = c(26, 26, 22, 25, 29, 33, 26, 32, 33, 31, 35, 32, 24, 25, 28, 20, 29, 35))
example ID group age
1 1 T 26
2 2 T 26
3 3 C 22
4 4 C 25
5 5 C 29
6 6 T 33
7 7 T 26
8 8 C 32
9 9 T 33
10 10 C 31
11 6 C 35
12 12 C 32
13 13 T 24
14 7 T 25
15 15 C 28
16 16 C 20
17 17 T 29
18 6 C 35
Since this is a small data frame, we can visually inspect it and notice that participants with ID numbers 6 and 7 are repeated. However, if our data set were larger, identifying duplicates visually would be a tedious and error-prone process.
Now, we can use
get_dupes()following this format:get_dupes (dat, ...)where:
datname of the data frame
...names of the variables to search for duplicates (unquoted)
ID dupe_count group age
1 6 3 T 33
2 6 3 C 35
3 6 3 C 35
4 7 2 T 26
5 7 2 T 25
The output will give us the rows with duplicate records in the specified variable (ID) and a count of the duplicates (dupe_count)
We corroborated here that ID number 6 is repeated 3 times and ID 7 is repeated 2 times.
We can also use pipes with the
get_dupesfunction:
age dupe_count ID group
1 26 3 1 T
2 26 3 2 T
3 26 3 7 T
4 25 2 4 C
5 25 2 7 T
6 29 2 5 C
7 29 2 17 T
8 32 2 8 C
9 32 2 12 C
10 33 2 6 T
11 33 2 9 T
12 35 2 6 C
13 35 2 6 C
The output will provide us with the duplicates for age. We can observe that it will order them in descending order, with the most frequently repeated observations appearing at the top of the table. Here, we notice that age 26 is repeated 3 times, age 25 is repeated 2 times, and so forth. This results are not very informative, so we need to be careful to select a meaningful variable to account for duplicates.
If we don’t specify any variables,
get_dupeswill look for duplicates using all columns
No variable names specified - using all columns.
ID group age dupe_count
1 6 C 35 2
2 6 C 35 2
Here, we have two rows with the exact same values in all columns.
We can also use tidyselect helpers. For example, we can look for duplicates among all variables except age:
Even though the output displays a column for age, it is not accounting for the repeated age records as it did before.
Yes, this function can save us a lot of time during the data cleaning process. I use it at early stages of data analysis to identify potential coding errors. For me, it is especially useful to use it before merging datasets to ensure we have only one ID code per subject.