Data Manipulation in R

Featured

Data Manipulation in R

Data Manipulation

7 Lessons

4 hours 0 mins

Free

127 148 101 150 147 98 102 98 90

1.1K

Course description

In this course, you will learn how to easily perform data manipulation using R software. We’ll cover the following data manipulation techniques:

filtering and ordering rows,
renaming and adding columns,
computing summary statistics

We’ll use mainly the popular dplyr R package, which contains important R functions to carry out easily your data manipulation. In the final section, we’ll show you how to group your data by a grouping variable, and then compute some summary statitistics on each subset. You will also learn how to chain your data manipulation operations.

At the end of this course, you will be familiar with data manipulation tools and approaches that will allow you to efficiently manipulate data.

Required R packages

We recommend to install the tidyverse packages, which include the dplyr package (for data manipulation) and additional R packages for easily reading (readr), transforming (tidyr) and visualizing (ggplot2) datasets.

Install:

install.packages("tidyverse")

Load the tidyverse packages, which also include the dplyr package:

library("tidyverse")

Demo datasets

We’ll use mainly the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis. tbl_df data object is a data frame providing a nicer printing method, useful when working with large data sets.

library("tidyverse")
my_data <- as_tibble(iris)
my_data

## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa 
## # ... with 144 more rows

Note that, the type of data in each column is specified. Common types include:

int: integers
dbl: double (real numbers),
chr: character vectors, strings, texts
fctr: factor,
dttm: date-times (date + time)
lgl: logical (TRUE or FALSE)
date: dates

Main data manipulation functions

There are 8 fundamental data manipulation verbs that you will use to do most of your data manipulations. These functions are included in the dplyr package:

filter(): Pick rows (observations/samples) based on their values.
distinct(): Remove duplicate rows.
arrange(): Reorder the rows.
select(): Select columns (variables) by their names.
rename(): Rename columns.
mutate() and transmutate(): Add/create new variables.
summarise(): Compute statistical summaries (e.g., computing the mean or the sum)

It’s also possible to combine each of these verbs with the function group_by() to operate on subsets of the data set (group-by-group).

All these functions work similarly as follow:

The first argument is a data frame
The subsequent arguments are comma separated list of unquoted variable names and the specification of what you want to do
The result is a new data frame

You will learn how to use these functions, as well as, how to chain your data manipulation operations using the pipe operator (%>%).

Note that, dplyr package allows to use the forward-pipe chaining operator (%>%) for combining multiple operations. For example, x %>% f is equivalent to f(x). Using the pipe (%>%), the output of each operation is passed to the next operation. This makes R programming easy.

Lessons

Select Data Frame Columns in R
Easy
40 mins
Alboukadel Kassambara

You will learn how to select data frame columns by names and position. We’ll also show how to remove columns from a data frame.
Subset Data Frame Rows in R
Easy
50 mins
Alboukadel Kassambara

This tutorial describes how to subset or extract data frame rows based on certain criteria. Additionally, we'll describe how to subset a random number or fraction of rows. You will also learn how to remove rows with missing values in a given column.
Identify and Remove Duplicate Data in R
Easy
30 mins
Alboukadel Kassambara

You will learn how to identify and to remove duplicate data using R base and dplyr functions.
Reorder Data Frame Rows in R
Easy
30 mins
Alboukadel Kassambara

This tutorial describes how to reorder rows, in your data table, by the value of one or more variables. You will learn how to easily sort a data frame rows in ascending and descending orders.
Rename Data Frame Columns in R
Easy
20 mins
Alboukadel Kassambara

You will learn how to rename a data frame columns in R.
Compute and Add new Variables to a Data Frame in R
Hard
30 mins
Alboukadel Kassambara

This tutorial describes how to compute and add new variables to a data frame in R.
Compute Summary Statistics in R
Easy
40 mins
Alboukadel Kassambara

This tutorial introduces how to easily compute statistcal summaries in R using the dplyr package. You will learn, how to compute summary statistics for ungrouped data, as well as, for data that are grouped by one or multiple variables.

Anteneh Abewa

17 Feb 2019

How can I put/display the first column from numeric to text?

Kassambara

You can simply use this:

 my_data[, 1] <- as.character(my_data[, 1])

or use dply verbs and specify the column by name:

library(dplyr)
iris <- iris %>%
  mutate(Sepal.Length = as.character(Sepal.Length))

Abdoulaye Sarr

08 Oct 2019

I am trying to put my data on a format compatible with HiClimR like TestCase of the package::

$x
               1949  1950  1951  1952  1953  1954  1955  1956  1957  1958  1959  1960  1961  1962  1963
-19.75,-39.75    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
-19.75,-38.75    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
$lon
 [1] -19.75 -18.75 -17.75 -16.75 -15.75 -14.75 -13.75 -12.75 -11.75 -10.75  -9.75  -8.75  -7.75  -6.75
[15]  
$lat
 [1] -39.75 -38.75 -37.75 -36.75 -35.75 -34.75 -33.75 -32.75 -31.75 -30.75 -29.75 -28.75 -27.75 -26.75
[15] -

My data is in netcdf I read using below command:
lon <- ncvar_get(nc, "lon")
lat <- ncvar_get(nc, "lat")
time <- ncvar_get(nc, "time")
pr<-ncvar_get(nc, "pre")
How can I create the datframe compatible with HiClimR? similar to the TestCase in the package?

Kassambara

08 Oct 2019

Your question is very specific to the HiClimR. You need to refer to the package documentation.

18 Dec 2019

I have a matrix with column data as years as date but when using as.Date it expects something %y%m%d how to rename column to %Y only as date but not character?
example 2001-01-01 rename as 2001
Your comment is awaiting moderation.

Jing Lyu

07 May 2020

Hi, the courses only have text, no video?

Kassambara

07 May 2020

Hi, there is no video for the course

Azzeddine REGHAIS

02 Jan 2021

How can I start lessons

Andi

24 Feb 2021

May i know how you create those green chunks and that check mark at the top left corner?

Oh, how to add those square icon of unordered list?

Data Manipulation in R

Course description

Required R packages

Demo datasets

Main data manipulation functions

Lessons

Select Data Frame Columns in R

Subset Data Frame Rows in R

Identify and Remove Duplicate Data in R

Reorder Data Frame Rows in R

Rename Data Frame Columns in R

Compute and Add new Variables to a Data Frame in R

Compute Summary Statistics in R

Comments ( 9 )

Give a comment Cancel reply

Teachers

Alboukadel Kassambara

Founder of Datanovia

Data Manipulation in R

Course description

Required R packages

Demo datasets

Main data manipulation functions

Lessons

Comments ( 9 )

Give a comment Cancel reply

Teachers

Founder of Datanovia

Related Courses