# Data Manipulation in R

## Compute Summary Statistics in R

This tutorial introduces how to easily compute statistcal summaries in R using the dplyr package.

You will learn, how to:

• Compute summary statistics for ungrouped data, as well as, for data that are grouped by one or multiple variables. R functions: summarise() and group_by().
• Summarise multiple variable columns. R functions:
• summarise_all(): apply summary functions to every columns in the data frame.
• summarise_at(): apply summary functions to specific columns selected with a character vector
• summarise_if(): apply summary functions to columns selected with a predicate function that returns TRUE. Contents:

## Required packages

Load the tidyverse packages, which include dplyr:

library(tidyverse)

## Demo dataset

We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.

my_data <- as_tibble(iris)
my_data
## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>
## 1          5.1         3.5          1.4         0.2 setosa
## 2          4.9         3            1.4         0.2 setosa
## 3          4.7         3.2          1.3         0.2 setosa
## 4          4.6         3.1          1.5         0.2 setosa
## 5          5           3.6          1.4         0.2 setosa
## 6          5.4         3.9          1.7         0.4 setosa
## # ... with 144 more rows

## Summary statistics of ungrouped data

Compute the mean of Sepal.Length and Petal.Length as well as the number of observations using the function n():

my_data %>%
summarise(
count = n(),
mean_sep = mean(Sepal.Length, na.rm = TRUE),
mean_pet = mean(Petal.Length, na.rm = TRUE)
)
## # A tibble: 1 x 3
##   count mean_sep mean_pet
##   <int>    <dbl>    <dbl>
## 1   150     5.84     3.76

Note that, we used the additional argument na.rm to remove NAs, before computing means.

## Summary statistics of grouped data

Key R functions: group_by() and summarise()

### Group by one variable

my_data %>%
group_by(Species) %>%
summarise(
count = n(),
mean_sep = mean(Sepal.Length),
mean_pet = mean(Petal.Length)
)
## # A tibble: 3 x 4
##   Species    count mean_sep mean_pet
##   <fct>      <int>    <dbl>    <dbl>
## 1 setosa        50     5.01     1.46
## 2 versicolor    50     5.94     4.26
## 3 virginica     50     6.59     5.55

Note that, it’s possible to combine multiple operations using the maggrittr forward-pipe operator : %>%. For example, x %>% f is equivalent to f(x).

In the R code above:

• first, my_data is passed to group_by() function
• next, the output of group_by() is passed to summarise() function

### Group by multiple variables

# ToothGrowth demo data sets
head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
# Summarize
ToothGrowth %>%
group_by(supp, dose) %>%
summarise(
n = n(),
mean = mean(len),
sd = sd(len)
)
## # A tibble: 6 x 5
## # Groups:   supp [?]
##   supp   dose     n  mean    sd
##   <fct> <dbl> <int> <dbl> <dbl>
## 1 OJ      0.5    10 13.2   4.46
## 2 OJ      1      10 22.7   3.91
## 3 OJ      2      10 26.1   2.66
## 4 VC      0.5    10  7.98  2.75
## 5 VC      1      10 16.8   2.52
## 6 VC      2      10 26.1   4.80

## Summarise multiple variables

### Key R functions

The functions summarise_all(), summarise_at() and summarise_if() can be used to summarise multiple columns at once.

The simplified formats are as follow:

summarise_all(.tbl, .funs, ...)
summarise_if(.tbl, .predicate, .funs, ...)
summarise_at(.tbl, .vars, .funs, ...)
• .tbl: a tbl data frame
• .funs: List of function calls generated by funs(), or a character vector of function names, or simply a function.
• …: Additional arguments for the function calls in .funs.
• .predicate: A predicate function to be applied to the columns or a logical vector. The variables for which .predicate is or returns TRUE are selected.

### Summarise variables

• Summarise all variables - compute the mean of all variables:
my_data %>%
group_by(Species) %>%
summarise_all(mean)
## # A tibble: 3 x 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>             <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa             5.01        3.43         1.46       0.246
## 2 versicolor         5.94        2.77         4.26       1.33
## 3 virginica          6.59        2.97         5.55       2.03
• Summarise specific variables selected with a character vector:
my_data %>%
group_by(Species) %>%
summarise_at(c("Sepal.Length", "Sepal.Width"), mean, na.rm = TRUE)
• Summarise specific variables selected with a predicate function:
my_data %>%
group_by(Species) %>%
summarise_if(is.numeric, mean, na.rm = TRUE)

## Useful statistical summary functions

This section presents some R functions for computing statistical summaries.

Measure of location:

• mean(x): sum of x divided by the length
• median(x): 50% of x is above and 50% is below

Measure of variation:

• sd(x): standard deviation
• IQR(x): interquartile range (robust equivalent of sd when outliers are present in the data)
• mad(x): median absolute deviation (robust equivalent of sd when outliers are present in the data)

Measure of rank:

• min(x): minimum value of x
• max(x): maximum value of x
• quantile(x, 0.25): 25% of x is below this value

Measure of position:

• first(x): equivalent to x
• nth(x, 2): equivalent to n<-2; x[n]
• last(x): equivalent to x[length(x)]

Counts:

• n(x): the number of element in x
• sum(!is.na(x)): count non-missing values
• n_distinct(x): count the number of unique value

Counts and proportions of logical values:

• sum(x > 10): count the number of elements where x > 10
• mean(y == 0): proportion of elements where y = 0

## Summary

In this tutorial, we describe how to easily compute statistical summaries using the R functions summarise() and group_by() [in dplyr package].

• Azzeddine REGHAIS

Thank you teacher

•  