This tutorial introduces how to easily compute **statistcal summaries** in R using the **dplyr** package.

You will learn, how to:

- Compute summary statistics for ungrouped data, as well as, for data that are grouped by one or multiple variables. R functions:
**summarise**() and**group_by**(). - Summarise multiple variable columns. R functions:
**summarise_all**(): apply summary functions to every columns in the data frame.**summarise_at**(): apply summary functions to specific columns selected with a character vector**summarise_if**(): apply summary functions to columns selected with a predicate function that returns TRUE.

Contents:

## Required packages

Load the `tidyverse`

packages, which include `dplyr`

:

`library(tidyverse)`

## Demo dataset

We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.

```
my_data <- as_tibble(iris)
my_data
```

```
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## # ... with 144 more rows
```

## Summary statistics of ungrouped data

Compute the mean of Sepal.Length and Petal.Length as well as the number of observations using the function *n*():

```
my_data %>%
summarise(
count = n(),
mean_sep = mean(Sepal.Length, na.rm = TRUE),
mean_pet = mean(Petal.Length, na.rm = TRUE)
)
```

```
## # A tibble: 1 x 3
## count mean_sep mean_pet
## <int> <dbl> <dbl>
## 1 150 5.84 3.76
```

Note that, we used the additional argument *na.rm* to remove NAs, before computing means.

## Summary statistics of grouped data

Key R functions: `group_by()`

and `summarise()`

### Group by one variable

```
my_data %>%
group_by(Species) %>%
summarise(
count = n(),
mean_sep = mean(Sepal.Length),
mean_pet = mean(Petal.Length)
)
```

```
## # A tibble: 3 x 4
## Species count mean_sep mean_pet
## <fct> <int> <dbl> <dbl>
## 1 setosa 50 5.01 1.46
## 2 versicolor 50 5.94 4.26
## 3 virginica 50 6.59 5.55
```

Note that, it’s possible to combine multiple operations using the *maggrittr* forward-pipe operator : *%>%*. For example, *x %>% f* is equivalent to *f(x)*.

In the R code above:

- first, my_data is passed to group_by() function
- next, the output of group_by() is passed to summarise() function

### Group by multiple variables

```
# ToothGrowth demo data sets
head(ToothGrowth)
```

```
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
```

```
# Summarize
ToothGrowth %>%
group_by(supp, dose) %>%
summarise(
n = n(),
mean = mean(len),
sd = sd(len)
)
```

```
## # A tibble: 6 x 5
## # Groups: supp [?]
## supp dose n mean sd
## <fct> <dbl> <int> <dbl> <dbl>
## 1 OJ 0.5 10 13.2 4.46
## 2 OJ 1 10 22.7 3.91
## 3 OJ 2 10 26.1 2.66
## 4 VC 0.5 10 7.98 2.75
## 5 VC 1 10 16.8 2.52
## 6 VC 2 10 26.1 4.80
```

## Summarise multiple variables

### Key R functions

The functions `summarise_all()`

, `summarise_at()`

and `summarise_if()`

can be used to summarise multiple columns at once.

The simplified formats are as follow:

```
summarise_all(.tbl, .funs, ...)
summarise_if(.tbl, .predicate, .funs, ...)
summarise_at(.tbl, .vars, .funs, ...)
```

- .tbl: a tbl data frame
- .funs: List of function calls generated by
`funs()`

, or a character vector of function names, or simply a function. - …: Additional arguments for the function calls in .funs.
- .predicate: A predicate function to be applied to the columns or a logical vector. The variables for which .predicate is or returns TRUE are selected.

### Summarise variables

- Summarise all variables - compute the mean of all variables:

```
my_data %>%
group_by(Species) %>%
summarise_all(mean)
```

```
## # A tibble: 3 x 5
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 1.46 0.246
## 2 versicolor 5.94 2.77 4.26 1.33
## 3 virginica 6.59 2.97 5.55 2.03
```

- Summarise specific variables selected with a character vector:

```
my_data %>%
group_by(Species) %>%
summarise_at(c("Sepal.Length", "Sepal.Width"), mean, na.rm = TRUE)
```

- Summarise specific variables selected with a predicate function:

```
my_data %>%
group_by(Species) %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
```

## Useful statistical summary functions

This section presents some R functions for computing statistical summaries.

Measure of location:

*mean*(x): sum of x divided by the length*median*(x): 50% of x is above and 50% is below

Measure of variation:

*sd*(x): standard deviation*IQR*(x): interquartile range (robust equivalent of sd when outliers are present in the data)*mad(x)*: median absolute deviation (robust equivalent of sd when outliers are present in the data)

Measure of rank:

*min*(x): minimum value of x*max*(x): maximum value of x*quantile*(x, 0.25): 25% of x is below this value

Measure of position:

*first*(x): equivalent to x[1]*nth*(x, 2): equivalent to n<-2; x[n]*last*(x): equivalent to x[length(x)]

Counts:

*n*(x): the number of element in x*sum*(!is.na(x)): count non-missing values*n_distinct*(x): count the number of unique value

Counts and proportions of logical values:

*sum*(x > 10): count the number of elements where x > 10*mean*(y == 0): proportion of elements where y = 0

## Summary

In this tutorial, we describe how to easily compute statistical summaries using the R functions `summarise()`

and `group_by()`

[in **dplyr** package].

## Recommended for you

This section contains best data science and self-development resources to help you on your path.

### Coursera - Online Courses and Specialization

#### Data science

- Course: Machine Learning: Master the Fundamentals by Standford
- Specialization: Data Science by Johns Hopkins University
- Specialization: Python for Everybody by University of Michigan
- Courses: Build Skills for a Top Job in any Industry by Coursera
- Specialization: Master Machine Learning Fundamentals by University of Washington
- Specialization: Statistics with R by Duke University
- Specialization: Software Development in R by Johns Hopkins University
- Specialization: Genomic Data Science by Johns Hopkins University

#### Popular Courses Launched in 2020

- Google IT Automation with Python by Google
- AI for Medicine by deeplearning.ai
- Epidemiology in Public Health Practice by Johns Hopkins University
- AWS Fundamentals by Amazon Web Services

#### Trending Courses

- The Science of Well-Being by Yale University
- Google IT Support Professional by Google
- Python for Everybody by University of Michigan
- IBM Data Science Professional Certificate by IBM
- Business Foundations by University of Pennsylvania
- Introduction to Psychology by Yale University
- Excel Skills for Business by Macquarie University
- Psychological First Aid by Johns Hopkins University
- Graphic Design by Cal Arts

### Books - Data Science

#### Our Books

- Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
- Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
- Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
- R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
- GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
- Network Analysis and Visualization in R by A. Kassambara (Datanovia)
- Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
- Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

#### Others

- R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
- Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
- Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
- An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
- Deep Learning with R by François Chollet & J.J. Allaire
- Deep Learning with Python by François Chollet

## No Comments