dplyr: How to Compute Summary Statistics Across Multiple Columns

04 Apr

dplyr: How to Compute Summary Statistics Across Multiple Columns

Alboukadel

Data Manipulation, dplyr, tidyverse

FAQ

1 1

This article describes how to compute summary statistics, such as mean, sd, quantiles, across multiple numeric columns.

Key R functions and packages

The dplyr package [v>= 1.0.0] is required. We’ll use the function across() to make computation across multiple columns.

Usage:

across(.cols = everything(), .fns = NULL, ..., .names = NULL)

.cols: Columns you want to operate on. You can pick columns by position, name, function of name, type, or any combination thereof using Boolean operators.
.fns: Function or list of functions to apply to each column.
...: Additional arguments for the function calls in .fns.
.names: A glue specification that describes how to name the output columns. This can use {col} to stand for the selected column name, and {fn} to stand for the name of the function being applied. The default (NULL) is equivalent to "{col}" for the single function case and "{col}_{fn}" for the case where a list is used for .fns.

# Load required R packages
library(dplyr)

# Data preparation
df <- as_tibble(iris)
head(df)

## # A tibble: 6 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa

# Compute the mean of multiple columns
df %>%
  group_by(Species) %>%
  summarise(across(Sepal.Length:Petal.Length, mean, na.rm= TRUE))

## # A tibble: 3 x 4
##   Species    Sepal.Length Sepal.Width Petal.Length
## * <fct>             <dbl>       <dbl>        <dbl>
## 1 setosa             5.01        3.43         1.46
## 2 versicolor         5.94        2.77         4.26
## 3 virginica          6.59        2.97         5.55

# Compute the mean and the sd of all numeric columns
df %>%
  group_by(Species) %>%
  summarise(across(
    .cols = is.numeric, 
    .fns = list(Mean = mean, SD = sd), na.rm = TRUE, 
    .names = "{col}_{fn}"
    ))

## # A tibble: 3 x 9
##   Species Sepal.Length_Me… Sepal.Length_SD Sepal.Width_Mean Sepal.Width_SD Petal.Length_Me… Petal.Length_SD
## * <fct>              <dbl>           <dbl>            <dbl>          <dbl>            <dbl>           <dbl>
## 1 setosa              5.01           0.352             3.43          0.379             1.46           0.174
## 2 versic…             5.94           0.516             2.77          0.314             4.26           0.470
## 3 virgin…             6.59           0.636             2.97          0.322             5.55           0.552
## # … with 2 more variables: Petal.Width_Mean <dbl>, Petal.Width_SD <dbl>