The dplyr package contains many functions used to manipulate data.

filter

The filter function in the dplyr package constructs a subset of the original data set based on specific values of a variable or variables. In other words, the filter functions selects rows from a data set.

The code below generates a data set that contains only cars with 3 gears and 8 cylinders.

library(dplyr) # Loads the dplyr library

mtcars_cyl8 <- mtcars %>% filter(gear==3, cyl==8) # creates the subset

mtcars_cyl8 # displays the subset
## # A tibble: 12 x 11
##      mpg cyl    disp    hp  drat    wt  qsec    vs am        gear   carb
##    <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>     <fct> <dbl>
##  1  18.7 8      360    175  3.15  3.44  17.0     0 Automatic 3         2
##  2  14.3 8      360    245  3.21  3.57  15.8     0 Automatic 3         4
##  3  16.4 8      276.   180  3.07  4.07  17.4     0 Automatic 3         3
##  4  17.3 8      276.   180  3.07  3.73  17.6     0 Automatic 3         3
##  5  15.2 8      276.   180  3.07  3.78  18       0 Automatic 3         3
##  6  10.4 8      472    205  2.93  5.25  18.0     0 Automatic 3         4
##  7  10.4 8      460    215  3     5.42  17.8     0 Automatic 3         4
##  8  14.7 8      440    230  3.23  5.34  17.4     0 Automatic 3         4
##  9  15.5 8      318    150  2.76  3.52  16.9     0 Automatic 3         2
## 10  15.2 8      304    150  3.15  3.44  17.3     0 Automatic 3         2
## 11  13.3 8      350    245  3.73  3.84  15.4     0 Automatic 3         4
## 12  19.2 8      400    175  3.08  3.84  17.0     0 Automatic 3         2

select

The select function in the dplyr package creates a dataset that contains only the specified variables. In other words, the select function selects entire columns to include in the new dataset.

The code below creates a dataset with only the cyl, wt, andgear variables.

library(dplyr)

mtcars_small <- mtcars %>% select(cyl, wt, gear) # Creates the subset

mtcars_small # displays the subset
## # A tibble: 32 x 3
##    cyl      wt gear 
##    <fct> <dbl> <fct>
##  1 6      2.62 4    
##  2 6      2.88 4    
##  3 4      2.32 4    
##  4 6      3.22 3    
##  5 8      3.44 3    
##  6 6      3.46 3    
##  7 8      3.57 3    
##  8 4      3.19 4    
##  9 4      3.15 4    
## 10 6      3.44 4    
## # … with 22 more rows

drop_na

The drop_na function in the tidyr package removes observations from a dataset based on missing values of a variable or variables. This opperation could also be done with filter.

The example dataset below contains variables a and b. Each variable has missing values denoted with NA.

library(tidyr)
library(tibble)

data_tbl <- tibble( a = c("A", "A", "B", NA, NA), b = c(NA, "C", "D", NA, "D") )

data_tbl
## # A tibble: 5 x 2
##   a     b    
##   <chr> <chr>
## 1 A     <NA> 
## 2 A     C    
## 3 B     D    
## 4 <NA>  <NA> 
## 5 <NA>  D
# removes all rows in the dataset where variable a has missing values.  
data_tbl %>% drop_na(a)
## # A tibble: 3 x 2
##   a     b    
##   <chr> <chr>
## 1 A     <NA> 
## 2 A     C    
## 3 B     D
# removes all rows in the dataset where variable b has missing values.  
data_tbl %>% drop_na(b)
## # A tibble: 3 x 2
##   a     b    
##   <chr> <chr>
## 1 A     C    
## 2 B     D    
## 3 <NA>  D
# removes all rows in the dataset where either variable a or b has missing values.  
data_tbl %>% drop_na(a, b)
## # A tibble: 2 x 2
##   a     b    
##   <chr> <chr>
## 1 A     C    
## 2 B     D

Mathematicss, Computer Science, and Statistics Department Gustavus Adolphus College