Data Manipulation in R

Select Data Frame Columns in R

 

In this tutorial, you will learn how to select or subset data frame columns by names and position using the R function select() and pull() [in dplyr package]. We’ll also show how to remove columns from a data frame.

You will learn how to use the following functions:

  • pull(): Extract column values as a vector. The column of interest can be specified either by name or by index.
  • select(): Extract one or multiple columns as a data table. It can be also used to remove columns from the data frame.
  • select_if(): Select columns based on a particular condition. One can use this function to, for example, select columns if they are numeric.
  • Helper functions - starts_with(), ends_with(), contains(), matches(), one_of(): Select columns/variables based on their names

Select Columns of a Data Frame in R



Contents:

Required packages

Load the tidyverse packages, which include dplyr:

library(tidyverse)

Demo dataset

We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.

my_data <- as_tibble(iris)
my_data
## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa 
## # ... with 144 more rows

Extract column values as a vector

my_data %>% pull(Species)
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  virginica  virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  virginica  virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

Extract columns as a data table

Select column by position

  • Select columns 1 to 3:
my_data %>% select(1:3)
  • Select column 1 and 3 but not 2:
my_data %>% select(1, 3)

Select columns by names

Select columns by names: Sepal.Length and Petal.Length

my_data %>% select(Sepal.Length, Petal.Length)
## # A tibble: 150 x 2
##   Sepal.Length Petal.Length
##          <dbl>        <dbl>
## 1          5.1          1.4
## 2          4.9          1.4
## 3          4.7          1.3
## 4          4.6          1.5
## 5          5            1.4
## 6          5.4          1.7
## # ... with 144 more rows

Select all columns from Sepal.Length to Petal.Length

my_data %>% select(Sepal.Length:Petal.Length)
## # A tibble: 150 x 3
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## 4          4.6         3.1          1.5
## 5          5           3.6          1.4
## 6          5.4         3.9          1.7
## # ... with 144 more rows

There are several special functions that can be used inside select(): starts_with(), ends_with(), contains(), matches(), one_of(), etc.

# Select column whose name starts with "Petal"
my_data %>% select(starts_with("Petal"))

# Select column whose name ends with "Width"
my_data %>% select(ends_with("Width"))

# Select columns whose names contains "etal"
my_data %>% select(contains("etal"))
  
# Select columns whose name maches a regular expression
my_data %>% select(matches(".t."))

# selects variables provided in a character vector.
my_data %>% select(one_of(c("Sepal.Length", "Petal.Length")))

Select column based on a condtion

It’s possible to apply a function to the columns. The columns for which the function returns TRUE are selected.

Select only numeric columns:

my_data %>% select_if(is.numeric)
## # A tibble: 150 x 4
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
##          <dbl>       <dbl>        <dbl>       <dbl>
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3            1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5           3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4
## # ... with 144 more rows

Remove columns

Note that, to remove a column from a data frame, prepend its name by minus -.

Removing Sepal.Length and Petal.Length columns:

my_data %>% select(-Sepal.Length, -Petal.Length)

Removing all columns from Sepal.Length to Petal.Length:

my_data %>% select(-(Sepal.Length:Petal.Length))
## # A tibble: 150 x 2
##   Petal.Width Species
##         <dbl> <fct>  
## 1         0.2 setosa 
## 2         0.2 setosa 
## 3         0.2 setosa 
## 4         0.2 setosa 
## 5         0.2 setosa 
## 6         0.4 setosa 
## # ... with 144 more rows

Removing all columns whose name starts with “Petal”:

my_data %>% select(-starts_with("Petal"))
## # A tibble: 150 x 3
##   Sepal.Length Sepal.Width Species
##          <dbl>       <dbl> <fct>  
## 1          5.1         3.5 setosa 
## 2          4.9         3   setosa 
## 3          4.7         3.2 setosa 
## 4          4.6         3.1 setosa 
## 5          5           3.6 setosa 
## 6          5.4         3.9 setosa 
## # ... with 144 more rows

Note that, if you want to drop columns by position, the syntax is as follow.

# Drop column 1
my_data %>% select(-1)

# Drop columns 1 to 3
my_data %>% select(-(1:3))

# Drop columns 1 and 3 but not 2
my_data %>% select(-1, -3)

Summary

In this tutorial, we describe how to select columns by positions and by names. Additionally, we present how to remove columns from a data frame.



(Next Lesson) Subset Data Frame Rows in R
Back to Data Manipulation in R

Comments ( 9 )

  • Pearson

    Thank you for sharing!

    • Kassambara

      Your feedback is appreciated!

  • Barry

    Thanks for the example!

    One of the things I am always trying and failing to do is to work with a set of variables from a data frame – not pulling them out, but applying a function to them.

    If I use the ‘select’ command, then I’ve created a copy of the data frame, and have to match it back to the original.

    Is there any way to use pipes and dplyr to apply a command to a subset of columns?

  • Monhieq

    thank you for your sharing

    i am beginner in R. i have some questions.
    how combine multi table ( i used .csv) to be one, because i saw that from those tables (8 tables) the field name is all same, but the file name is different.
    can you help me how to combine those tables and add 1 field from each table (field name is Value) and added field name change to the table name?

    e.g : number of table : 8 (T1 – T8)
    number of field in every table = 10 (F1 – F10). F1 – F9 in T1 – T10 fill with the same data value
    Data value of F10 is the only field that filled different value in every tables
    How to combine (T9 ) with F1-F9, and add F10-F17 named as the tables name and will fills with the value of F10 in every tables?

    I hope you can help and give your response to me. Thank you so much

  • MadDeveloper

    very helpful. Thanks a lot!!!

  • Fiona

    Hello, I tried the pull method on a column in my dataframe but received the following error message:

    Error in UseMethod(“pull”) :
    no applicable method for ‘pull’ applied to an object of class “factor”

    Is there anything I can do to fix this?

    • Kassambara

      Did this work?

      iris %>% dplyr::pull(Species)

  • Sathira

    Thanks !!!

Give a comment

Want to post an issue with R? If yes, please make sure you have read this: How to Include Reproducible R Script Examples in Datanovia Comments

Teacher
Alboukadel Kassambara
Role : Founder of Datanovia
Read More