tidyr::drop_na()

Function of the Week

The drop_na function will drop out rows from your dataset where columns contain missing values.

Author

Ariel Weingarten

Published

January 31, 2024

1 `tidyr::drop_na()`

In this document, I will introduce the drop_na() function and show what it’s for.

#load tidyverse up
library(tidyverse)
#example dataset
library(palmerpenguins)
data(penguins)

1.1 What is it for?

The drop_na function will drop out rows from your dataset where columns contain missing values. It can take two arguments:

data : the name of your dataframe
... : Columns to inspect for missing values. If not specified, all columns are inspected.

Example code setup: data %>% drop_na() or data %>% drop_na(column_name)

1.2 Example with Penguins dataset

#How many NA's in our dataset?
(sum(is.na(penguins)))

[1] 19

#What columns contain NA's?
summary(penguins)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2

1.2.1 Drop from one specific column

#Drop NA's from just one column
penguins_dropNA_mass <- penguins %>% drop_na(body_mass_g)
summary(penguins_dropNA_mass)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :151   Biscoe   :167   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :123   Torgersen: 51   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  :  9   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009

1.2.2 Drop from all columns

#Drop NA's from all columns
penguins_dropNA_all <- penguins %>% drop_na()
summary(penguins_dropNA_all)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :146   Biscoe   :163   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :123   1st Qu.:39.50   1st Qu.:15.60  
 Gentoo   :119   Torgersen: 47   Median :44.50   Median :17.30  
                                 Mean   :43.99   Mean   :17.16  
                                 3rd Qu.:48.60   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172       Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190       1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197       Median :4050                Median :2008  
 Mean   :201       Mean   :4207                Mean   :2008  
 3rd Qu.:213       3rd Qu.:4775                3rd Qu.:2009  
 Max.   :231       Max.   :6300                Max.   :2009

1.3 Example with Glucose dataset

library(nlme)
summary(Glucose)

 Subject      Time            conc          Meal   
 6:63    Min.   :-0.25   Min.   : 2.070   2am :66  
 2:63    1st Qu.: 0.50   1st Qu.: 4.348   6am :66  
 3:63    Median : 2.00   Median : 4.900   10am:60  
 5:63    Mean   : 2.50   Mean   : 5.511   2pm :60  
 1:63    3rd Qu.: 4.00   3rd Qu.: 6.357   6pm :60  
 4:63    Max.   : 7.00   Max.   :10.470   10pm:66  
                         NA's   :2

#Drop all NA's
Glucose_noNA <- Glucose %>% drop_na()
summary(Glucose_noNA)

 Subject      Time             conc          Meal   
 6:62    Min.   :-0.250   Min.   : 2.070   2am :65  
 2:63    1st Qu.: 0.500   1st Qu.: 4.348   6am :66  
 3:63    Median : 2.000   Median : 4.900   10am:59  
 5:63    Mean   : 2.484   Mean   : 5.511   2pm :60  
 1:63    3rd Qu.: 4.000   3rd Qu.: 6.357   6pm :60  
 4:62    Max.   : 7.000   Max.   :10.470   10pm:66

1.4 Graphical considerations

Does having NA’s in the graph make sense?

ggplot(data = penguins, 
       aes(x = species, fill = species)) + 
    geom_bar() +
  facet_wrap(vars(sex)) +
  labs( x = "Species",
        y = "Count",
        title = "Frequency of Species by Sex")

Does having NA’s in the graph make a difference?

ggplot(data = Glucose,
       aes(x = Subject, y = conc)) +
  geom_boxplot() +
  labs(x = "Subject Number", 
       y = "Glucose Concentration",
       title = "Boxplot of Glucose by Subject")

Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

1.5 Is it helpful?

I find this tool very helpful, but it must be used with care. Some modeling techniques will not work with missing values, so it is necessary to find a solution for this. However, dropping all NA’s in all columns can drastically reduce the size of your dataset. There are several things to consider before dropping NA’s:

How many NA’s are in your dataset? How much data do you lose if you drop them?
What types of data have NA’s?
Are the NA’s in the variables you will be considering?
Does the modeling technique you want to use accept NA’s?
Does the graph you want to generate make sense with NA’s?

1 tidyr::drop_na()