R: filtering with NA values

NA - Not Available/Not applicable is R’s way of denoting empty or missing values. When doing comparisons - such as equal to, greater than, etc. - extra care and thought needs to go into how missing values (NAs) are handled. More explanations about this can be found in the Chapter 2: R basics of our book that is freely available at the HealthyR website

This post lists a couple of different ways of keeping or discarding rows based on how important the variables with missing values are to you.

For example, I want to keep rows that have a value for important_a and/or important_b (so rows 1, 3, 4). I don’t care whether whatever_c is empty or not, but I do want to keep it.

library(tidyverse)

## Warning: replacing previous import
## 'lifecycle::last_warnings' by 'rlang::last_warnings'
## when loading 'pillar'

## Warning: replacing previous import
## 'lifecycle::last_warnings' by 'rlang::last_warnings'
## when loading 'tibble'

## Warning: replacing previous import
## 'lifecycle::last_warnings' by 'rlang::last_warnings'
## when loading 'hms'

mydata  = tibble(important_a = c("Value", NA, "Value", NA, NA),
                 important_b = c(NA, NA, "Value", "Value", NA),
                 whatever_c  = c(NA, "Value", NA, NA, NA))

mydata %>% knitr::kable()

important_a	important_b	whatever_c
Value	NA	NA
NA	NA	Value
Value	Value	NA
NA	Value	NA
NA	NA	NA

Functions for missing values that are very useful, but don’t do what I want are:

This keeps complete cases based on all columns:

mydata %>% 
  drop_na()

## # A tibble: 0 × 3
## # … with 3 variables: important_a <chr>,
## #   important_b <chr>, whatever_c <chr>

(Returns 0 as we don’t have rows where all 3 columns have a value).

This keeps complete cases based on specified columns:

mydata %>% 
  drop_na(important_a, important_b)

## # A tibble: 1 × 3
##   important_a important_b whatever_c
##   <chr>       <chr>       <chr>     
## 1 Value       Value       <NA>

This only keeps the row where both a and b have a value.

This keeps rows that have a value in any column:

mydata %>% 
  filter_all(any_vars(! is.na(.)))

## # A tibble: 4 × 3
##   important_a important_b whatever_c
##   <chr>       <chr>       <chr>     
## 1 Value       <NA>        <NA>      
## 2 <NA>        <NA>        Value     
## 3 Value       Value       <NA>      
## 4 <NA>        Value       <NA>

The third example is better achieved using the janitor package:

mydata %>% 
  janitor::remove_empty()

## # A tibble: 4 × 3
##   important_a important_b whatever_c
##   <chr>       <chr>       <chr>     
## 1 Value       <NA>        <NA>      
## 2 <NA>        <NA>        Value     
## 3 Value       Value       <NA>      
## 4 <NA>        Value       <NA>

Now, (3) is pretty close, but still, I’m not interested in row 2 - where both a and b are empty but c has a value (which is why it’s kept).

Simple solution

A quick solution is to use ! is.na() for each variable inside a filter():

mydata %>% 
  filter(! is.na(important_a) | ! is.na(important_b))

## # A tibble: 3 × 3
##   important_a important_b whatever_c
##   <chr>       <chr>       <chr>     
## 1 Value       <NA>        <NA>      
## 2 Value       Value       <NA>      
## 3 <NA>        Value       <NA>

And this is definitely what I do when I only have a couple of these variables. But if you have tens, then the filtering logic becomes horrendously long and it’s easy to miss one out/make a mistake.

Powerful solution:

A scalable solution is to use filter_at() with vars() with a select helper (e.g., starts with()), and then the any_vars(! is.na(.)) that was introduced in (3).

mydata %>% 
  filter_at(vars(starts_with("important_")), any_vars(! is.na(.)))

## # A tibble: 3 × 3
##   important_a important_b whatever_c
##   <chr>       <chr>       <chr>     
## 1 Value       <NA>        <NA>      
## 2 Value       Value       <NA>      
## 3 <NA>        Value       <NA>

R: filtering with NA values

Riinu Pius (Ots)