NA - Not Available/Not applicable is R’s way of denoting empty or missing values. When doing comparisons - such as equal to, greater than, etc. - extra care and thought needs to go into how missing values (NAs) are handled. More explanations about this can be found in the Chapter 2: R basics of our book that is freely available at the HealthyR website

This post lists a couple of different ways of keeping or discarding rows based on how important the variables with missing values are to you.

For example, I want to keep rows that have a value for important_a and/or important_b (so rows 1, 3, 4). I don’t care whether whatever_c is empty or not, but I do want to keep it.

library(tidyverse)
## Warning: replacing previous import
## 'lifecycle::last_warnings' by 'rlang::last_warnings'
## when loading 'pillar'
## Warning: replacing previous import
## 'lifecycle::last_warnings' by 'rlang::last_warnings'
## when loading 'tibble'
## Warning: replacing previous import
## 'lifecycle::last_warnings' by 'rlang::last_warnings'
## when loading 'hms'
mydata  = tibble(important_a = c("Value", NA, "Value", NA, NA),
                 important_b = c(NA, NA, "Value", "Value", NA),
                 whatever_c  = c(NA, "Value", NA, NA, NA))

mydata %>% knitr::kable()
important_a important_b whatever_c
Value NA NA
NA NA Value
Value Value NA
NA Value NA
NA NA NA

Functions for missing values that are very useful, but don’t do what I want are:

  1. This keeps complete cases based on all columns:
mydata %>% 
  drop_na()
## # A tibble: 0 × 3
## # … with 3 variables: important_a <chr>,
## #   important_b <chr>, whatever_c <chr>

(Returns 0 as we don’t have rows where all 3 columns have a value).

  1. This keeps complete cases based on specified columns:
mydata %>% 
  drop_na(important_a, important_b)
## # A tibble: 1 × 3
##   important_a important_b whatever_c
##   <chr>       <chr>       <chr>     
## 1 Value       Value       <NA>

This only keeps the row where both a and b have a value.

  1. This keeps rows that have a value in any column:
mydata %>% 
  filter_all(any_vars(! is.na(.)))
## # A tibble: 4 × 3
##   important_a important_b whatever_c
##   <chr>       <chr>       <chr>     
## 1 Value       <NA>        <NA>      
## 2 <NA>        <NA>        Value     
## 3 Value       Value       <NA>      
## 4 <NA>        Value       <NA>

The third example is better achieved using the janitor package:

mydata %>% 
  janitor::remove_empty()
## # A tibble: 4 × 3
##   important_a important_b whatever_c
##   <chr>       <chr>       <chr>     
## 1 Value       <NA>        <NA>      
## 2 <NA>        <NA>        Value     
## 3 Value       Value       <NA>      
## 4 <NA>        Value       <NA>

Now, (3) is pretty close, but still, I’m not interested in row 2 - where both a and b are empty but c has a value (which is why it’s kept).

  1. Simple solution

A quick solution is to use ! is.na() for each variable inside a filter():

mydata %>% 
  filter(! is.na(important_a) | ! is.na(important_b))
## # A tibble: 3 × 3
##   important_a important_b whatever_c
##   <chr>       <chr>       <chr>     
## 1 Value       <NA>        <NA>      
## 2 Value       Value       <NA>      
## 3 <NA>        Value       <NA>

And this is definitely what I do when I only have a couple of these variables. But if you have tens, then the filtering logic becomes horrendously long and it’s easy to miss one out/make a mistake.

  1. Powerful solution:

A scalable solution is to use filter_at() with vars() with a select helper (e.g., starts with()), and then the any_vars(! is.na(.)) that was introduced in (3).

mydata %>% 
  filter_at(vars(starts_with("important_")), any_vars(! is.na(.)))
## # A tibble: 3 × 3
##   important_a important_b whatever_c
##   <chr>       <chr>       <chr>     
## 1 Value       <NA>        <NA>      
## 2 Value       Value       <NA>      
## 3 <NA>        Value       <NA>