#load tidyverse up
library(tidyverse)
#example dataset
library(palmerpenguins)
data(penguins)tidyr::drop_na()
Function of the Week
1 tidyr::drop_na()
In this document, I will introduce the drop_na() function and show what it’s for.
1.1 What is it for?
The drop_na function will drop out rows from your dataset where columns contain missing values. It can take two arguments:
data: the name of your dataframe...: Columns to inspect for missing values. If not specified, all columns are inspected.
Example code setup: data %>% drop_na() or data %>% drop_na(column_name)
1.2 Example with Penguins dataset
#How many NA's in our dataset?
(sum(is.na(penguins)))[1] 19
#What columns contain NA's?
summary(penguins) species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
1.2.1 Drop from one specific column
#Drop NA's from just one column
penguins_dropNA_mass <- penguins %>% drop_na(body_mass_g)
summary(penguins_dropNA_mass) species island bill_length_mm bill_depth_mm
Adelie :151 Biscoe :167 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :123 Torgersen: 51 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 9 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
1.2.2 Drop from all columns
#Drop NA's from all columns
penguins_dropNA_all <- penguins %>% drop_na()
summary(penguins_dropNA_all) species island bill_length_mm bill_depth_mm
Adelie :146 Biscoe :163 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :123 1st Qu.:39.50 1st Qu.:15.60
Gentoo :119 Torgersen: 47 Median :44.50 Median :17.30
Mean :43.99 Mean :17.16
3rd Qu.:48.60 3rd Qu.:18.70
Max. :59.60 Max. :21.50
flipper_length_mm body_mass_g sex year
Min. :172 Min. :2700 female:165 Min. :2007
1st Qu.:190 1st Qu.:3550 male :168 1st Qu.:2007
Median :197 Median :4050 Median :2008
Mean :201 Mean :4207 Mean :2008
3rd Qu.:213 3rd Qu.:4775 3rd Qu.:2009
Max. :231 Max. :6300 Max. :2009
1.3 Example with Glucose dataset
library(nlme)
summary(Glucose) Subject Time conc Meal
6:63 Min. :-0.25 Min. : 2.070 2am :66
2:63 1st Qu.: 0.50 1st Qu.: 4.348 6am :66
3:63 Median : 2.00 Median : 4.900 10am:60
5:63 Mean : 2.50 Mean : 5.511 2pm :60
1:63 3rd Qu.: 4.00 3rd Qu.: 6.357 6pm :60
4:63 Max. : 7.00 Max. :10.470 10pm:66
NA's :2
#Drop all NA's
Glucose_noNA <- Glucose %>% drop_na()
summary(Glucose_noNA) Subject Time conc Meal
6:62 Min. :-0.250 Min. : 2.070 2am :65
2:63 1st Qu.: 0.500 1st Qu.: 4.348 6am :66
3:63 Median : 2.000 Median : 4.900 10am:59
5:63 Mean : 2.484 Mean : 5.511 2pm :60
1:63 3rd Qu.: 4.000 3rd Qu.: 6.357 6pm :60
4:62 Max. : 7.000 Max. :10.470 10pm:66
1.4 Graphical considerations
Does having NA’s in the graph make sense?
ggplot(data = penguins,
aes(x = species, fill = species)) +
geom_bar() +
facet_wrap(vars(sex)) +
labs( x = "Species",
y = "Count",
title = "Frequency of Species by Sex")Does having NA’s in the graph make a difference?
ggplot(data = Glucose,
aes(x = Subject, y = conc)) +
geom_boxplot() +
labs(x = "Subject Number",
y = "Glucose Concentration",
title = "Boxplot of Glucose by Subject")Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
1.5 Is it helpful?
I find this tool very helpful, but it must be used with care. Some modeling techniques will not work with missing values, so it is necessary to find a solution for this. However, dropping all NA’s in all columns can drastically reduce the size of your dataset. There are several things to consider before dropping NA’s:
How many NA’s are in your dataset? How much data do you lose if you drop them?
What types of data have NA’s?
Are the NA’s in the variables you will be considering?
Does the modeling technique you want to use accept NA’s?
Does the graph you want to generate make sense with NA’s?