dplyr::na_if

Function of the Week

Replace annoying values with NA

Author

Emma Coonfield

Published

February 19, 2025

1 Function Of Interest: `dplyr - na_if(x,y)`

In this document, I will introduce the dplyr na_if() function and show what it’s for.

#load tidyverse up
knitr::opts_chunk$set(echo = TRUE)
pacman::p_load(
  tidyverse,
  readxl,
  here,         
  janitor,
  gt
  )

#load example dataset
clinical_data <- read_excel(here("function_week", "data", "tcga_clinical_data.xlsx"), 
                             sheet = 2,
                             na = "NA")

2 What is it for?

This function is used to replace annoying values with NA. It allows you to replace NaN with NA, even though NaN == NaN returns NA.

2.1 Example 1: The Basics

# Example 1: The basics
# na_if functions as `na_if(x, y)`; where x is the vector to modify and y is the value to replace with NA.

x <- c(1, 25, -5, 0, 10)

x_inf <- 100/x
# This enters us an infinite value, which has downstream effects on common data analysis.
x_inf

[1] 100   4 -20 Inf  10

mean(x_inf, na.rm = T)

[1] Inf

# We see that we are not given a proper mean.

x_na_if <- 100/ na_if(x, 0)
x_na_if

[1] 100   4 -20  NA  10

mean(x_na_if, na.rm = T)

[1] 23.5

# Success! What a meaningful change!

The previous example we adapted from Rdocumentation.org.

2.2 Lets Set The Table: Data Clean Up

Now that we have glimpsed the power of na_if, lets see how to utilize it in an actual data set.

# First, lets clean up column names to aid data viewing.

clinical_clean <- clinical_data %>%
  rename(tumor_class = classification_of_tumor,
         last_status = last_known_disease_status,
         vital = vital_status,
         morph = morphology,
         diagnosis = primary_diagnosis,
         stage = tumor_stage,
         last_diseasestat = days_to_last_known_disease_status,
         datetime = created_datetime,
         recurrence = days_to_recurrence,
         origin = tissue_or_organ_of_origin,
         progression = progression_or_recurrence,
         biopsy_site = site_of_resection_or_biopsy,
         last_follow_up = days_to_last_follow_up,
         intent_type = treatment_intent_type,
         treatment = treatment_or_therapy) %>%
  select(c(-updated_datetime))

2.3 Putting It All On The Table

clinical_table <- clinical_clean %>%
  head(2)

gt(clinical_table)

submitter_id	tumor_class	last_status	diagnosis	stage	age_at_diagnosis	vital	morph	days_to_death	last_diseasestat	datetime	state	recurrence	diagnosis_id	tumor_grade	origin	days_to_birth	progression	prior_malignancy	biopsy_site	last_follow_up	cigarettes_per_day	weight	alcohol_history	alcohol_intensity	bmi	years_smoked	exposure_id	height	gender	year_of_birth	race	demographic_id	ethnicity	year_of_death	treatment_id	therapeutic_agents	intent_type	treatment	bcr_patient_barcode	disease
TCGA-2W-A8YY	not reported	not reported	C53.9	not reported	18886	alive	8560/3	NA	NA	NA	live	NA	908ee155-bfca-5240-b78b-6b82f565aedd	not reported	C53.9	-18886	not reported	not reported	C53.9	533	NA	42	NA	NA	16.40625	NA	67aa3949-ad62-5f81-ad08-e8e295f84cfb	160	female	1962	white	b89e4409-f7c6-53f2-a85f-31448e2ae1f6	not hispanic or latino	NA	026fa545-ac02-5915-ac23-5984d67a75f8	NA	NA	NA	TCGA-2W-A8YY	CESC
TCGA-4J-AA1J	not reported	not reported	C53.9	not reported	11611	alive	8070/3	NA	NA	NA	live	NA	20b61f8a-5efb-5bcc-aaad-5f79cd8ff313	not reported	C53.9	-11611	not reported	not reported	C53.9	542	NA	48	NA	NA	17.63085	NA	93ddbaf1-67b9-59a9-8a04-ef00db42fd54	165	female	1982	white	1c2c712d-0a6c-5b52-a4b0-8e1d61256f6c	not hispanic or latino	NA	f68ae36d-85e6-558c-91a7-f5b69b9dde19	NA	NA	NA	TCGA-4J-AA1J	CESC

# As you can see there are many entries that use the phrase "not reported". This phrase did not get caught when we loaded that data into R.

2.4 Example 2: Are You Tired of Data Being `not reported`?

# Now if we want all the not reported inputs to be catergorzed as "NA" we will use the na_if function.

clinical_na_if <- clinical_clean %>%
  mutate(tumor_class = na_if(tumor_class, "not reported"),
         last_status = na_if(last_status, "not reported"),
         stage = na_if(stage, "not reported"),
         tumor_grade = na_if(tumor_grade, "not reported"))

gt(head(clinical_na_if, 2))

submitter_id	tumor_class	last_status	diagnosis	stage	age_at_diagnosis	vital	morph	days_to_death	last_diseasestat	datetime	state	recurrence	diagnosis_id	tumor_grade	origin	days_to_birth	progression	prior_malignancy	biopsy_site	last_follow_up	cigarettes_per_day	weight	alcohol_history	alcohol_intensity	bmi	years_smoked	exposure_id	height	gender	year_of_birth	race	demographic_id	ethnicity	year_of_death	treatment_id	therapeutic_agents	intent_type	treatment	bcr_patient_barcode	disease
TCGA-2W-A8YY	NA	NA	C53.9	NA	18886	alive	8560/3	NA	NA	NA	live	NA	908ee155-bfca-5240-b78b-6b82f565aedd	NA	C53.9	-18886	not reported	not reported	C53.9	533	NA	42	NA	NA	16.40625	NA	67aa3949-ad62-5f81-ad08-e8e295f84cfb	160	female	1962	white	b89e4409-f7c6-53f2-a85f-31448e2ae1f6	not hispanic or latino	NA	026fa545-ac02-5915-ac23-5984d67a75f8	NA	NA	NA	TCGA-2W-A8YY	CESC
TCGA-4J-AA1J	NA	NA	C53.9	NA	11611	alive	8070/3	NA	NA	NA	live	NA	20b61f8a-5efb-5bcc-aaad-5f79cd8ff313	NA	C53.9	-11611	not reported	not reported	C53.9	542	NA	48	NA	NA	17.63085	NA	93ddbaf1-67b9-59a9-8a04-ef00db42fd54	165	female	1982	white	1c2c712d-0a6c-5b52-a4b0-8e1d61256f6c	not hispanic or latino	NA	f68ae36d-85e6-558c-91a7-f5b69b9dde19	NA	NA	NA	TCGA-4J-AA1J	CESC

As you can see editing one column is very accessible, but once you get past three columns writing the na_if function gets tedious. There has got to be a better way! Akin to any late night infomercial, there is a better way by using our good old friend across function.

2.5 Example 3: Have No Fear `na_if` Is Here!

## Example 3: Multiple columns

clinical_across <- clinical_clean %>%
  mutate(across(where(is.character), 
                ~na_if(., "not reported")))

gt(head(clinical_across, 2))

submitter_id	tumor_class	last_status	diagnosis	stage	age_at_diagnosis	vital	morph	days_to_death	last_diseasestat	datetime	state	recurrence	diagnosis_id	tumor_grade	origin	days_to_birth	progression	prior_malignancy	biopsy_site	last_follow_up	cigarettes_per_day	weight	alcohol_history	alcohol_intensity	bmi	years_smoked	exposure_id	height	gender	year_of_birth	race	demographic_id	ethnicity	year_of_death	treatment_id	therapeutic_agents	intent_type	treatment	bcr_patient_barcode	disease
TCGA-2W-A8YY	NA	NA	C53.9	NA	18886	alive	8560/3	NA	NA	NA	live	NA	908ee155-bfca-5240-b78b-6b82f565aedd	NA	C53.9	-18886	NA	NA	C53.9	533	NA	42	NA	NA	16.40625	NA	67aa3949-ad62-5f81-ad08-e8e295f84cfb	160	female	1962	white	b89e4409-f7c6-53f2-a85f-31448e2ae1f6	not hispanic or latino	NA	026fa545-ac02-5915-ac23-5984d67a75f8	NA	NA	NA	TCGA-2W-A8YY	CESC
TCGA-4J-AA1J	NA	NA	C53.9	NA	11611	alive	8070/3	NA	NA	NA	live	NA	20b61f8a-5efb-5bcc-aaad-5f79cd8ff313	NA	C53.9	-11611	NA	NA	C53.9	542	NA	48	NA	NA	17.63085	NA	93ddbaf1-67b9-59a9-8a04-ef00db42fd54	165	female	1982	white	1c2c712d-0a6c-5b52-a4b0-8e1d61256f6c	not hispanic or latino	NA	f68ae36d-85e6-558c-91a7-f5b69b9dde19	NA	NA	NA	TCGA-4J-AA1J	CESC

3 Is it helpful?

This function would be particularly useful is you wanted to change any NaN inputs to NA, or if you had very cluttered data with lots of “unknowns” or “not reported” or any other unusual entry for NA. This is also useful because when loading excel data you cannot have two different NA arguments.

What is particularly useful about this function is when nesting it in the mutate and across function because you can make large edits to several vectors.

1 Function Of Interest: dplyr - na_if(x,y)