tidyr::drop_na()

Function of the Week

The drop_na function will drop out rows from your dataset where columns contain missing values.
Author

Ariel Weingarten

Published

January 31, 2024

1 tidyr::drop_na()

In this document, I will introduce the drop_na() function and show what it’s for.

#load tidyverse up
library(tidyverse)
#example dataset
library(palmerpenguins)
data(penguins)

1.1 What is it for?

The drop_na function will drop out rows from your dataset where columns contain missing values. It can take two arguments:

  • data : the name of your dataframe

  • ... : Columns to inspect for missing values. If not specified, all columns are inspected.

Example code setup: data %>% drop_na() or data %>% drop_na(column_name)

1.2 Example with Penguins dataset

#How many NA's in our dataset?
(sum(is.na(penguins)))
[1] 19
#What columns contain NA's?
summary(penguins)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 

1.2.1 Drop from one specific column

#Drop NA's from just one column
penguins_dropNA_mass <- penguins %>% drop_na(body_mass_g)
summary(penguins_dropNA_mass)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :151   Biscoe   :167   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :123   Torgersen: 51   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  :  9   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  

1.2.2 Drop from all columns

#Drop NA's from all columns
penguins_dropNA_all <- penguins %>% drop_na()
summary(penguins_dropNA_all)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :146   Biscoe   :163   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :123   1st Qu.:39.50   1st Qu.:15.60  
 Gentoo   :119   Torgersen: 47   Median :44.50   Median :17.30  
                                 Mean   :43.99   Mean   :17.16  
                                 3rd Qu.:48.60   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172       Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190       1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197       Median :4050                Median :2008  
 Mean   :201       Mean   :4207                Mean   :2008  
 3rd Qu.:213       3rd Qu.:4775                3rd Qu.:2009  
 Max.   :231       Max.   :6300                Max.   :2009  

1.3 Example with Glucose dataset

library(nlme)
summary(Glucose)
 Subject      Time            conc          Meal   
 6:63    Min.   :-0.25   Min.   : 2.070   2am :66  
 2:63    1st Qu.: 0.50   1st Qu.: 4.348   6am :66  
 3:63    Median : 2.00   Median : 4.900   10am:60  
 5:63    Mean   : 2.50   Mean   : 5.511   2pm :60  
 1:63    3rd Qu.: 4.00   3rd Qu.: 6.357   6pm :60  
 4:63    Max.   : 7.00   Max.   :10.470   10pm:66  
                         NA's   :2                 
#Drop all NA's
Glucose_noNA <- Glucose %>% drop_na()
summary(Glucose_noNA)
 Subject      Time             conc          Meal   
 6:62    Min.   :-0.250   Min.   : 2.070   2am :65  
 2:63    1st Qu.: 0.500   1st Qu.: 4.348   6am :66  
 3:63    Median : 2.000   Median : 4.900   10am:59  
 5:63    Mean   : 2.484   Mean   : 5.511   2pm :60  
 1:63    3rd Qu.: 4.000   3rd Qu.: 6.357   6pm :60  
 4:62    Max.   : 7.000   Max.   :10.470   10pm:66  

1.4 Graphical considerations

Does having NA’s in the graph make sense?

ggplot(data = penguins, 
       aes(x = species, fill = species)) + 
    geom_bar() +
  facet_wrap(vars(sex)) +
  labs( x = "Species",
        y = "Count",
        title = "Frequency of Species by Sex")

Does having NA’s in the graph make a difference?

ggplot(data = Glucose,
       aes(x = Subject, y = conc)) +
  geom_boxplot() +
  labs(x = "Subject Number", 
       y = "Glucose Concentration",
       title = "Boxplot of Glucose by Subject")
Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

1.5 Is it helpful?

I find this tool very helpful, but it must be used with care. Some modeling techniques will not work with missing values, so it is necessary to find a solution for this. However, dropping all NA’s in all columns can drastically reduce the size of your dataset. There are several things to consider before dropping NA’s:

  • How many NA’s are in your dataset? How much data do you lose if you drop them?

  • What types of data have NA’s?

  • Are the NA’s in the variables you will be considering?

  • Does the modeling technique you want to use accept NA’s?

  • Does the graph you want to generate make sense with NA’s?