This tutorial introduces how to easily compute **statistcal summaries** in R using the **dplyr** package.

You will learn, how to:

- Compute summary statistics for ungrouped data, as well as, for data that are grouped by one or multiple variables. R functions:
**summarise**() and**group_by**(). - Summarise multiple variable columns. R functions:
**summarise_all**(): apply summary functions to every columns in the data frame.**summarise_at**(): apply summary functions to specific columns selected with a character vector**summarise_if**(): apply summary functions to columns selected with a predicate function that returns TRUE.

Contents:

## Required packages

Load the `tidyverse`

packages, which include `dplyr`

:

`library(tidyverse)`

## Demo dataset

We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.

```
my_data <- as_tibble(iris)
my_data
```

```
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## # ... with 144 more rows
```

## Summary statistics of ungrouped data

Compute the mean of Sepal.Length and Petal.Length as well as the number of observations using the function *n*():

```
my_data %>%
summarise(
count = n(),
mean_sep = mean(Sepal.Length, na.rm = TRUE),
mean_pet = mean(Petal.Length, na.rm = TRUE)
)
```

```
## # A tibble: 1 x 3
## count mean_sep mean_pet
## <int> <dbl> <dbl>
## 1 150 5.84 3.76
```

Note that, we used the additional argument *na.rm* to remove NAs, before computing means.

## Summary statistics of grouped data

Key R functions: `group_by()`

and `summarise()`

### Group by one variable

```
my_data %>%
group_by(Species) %>%
summarise(
count = n(),
mean_sep = mean(Sepal.Length),
mean_pet = mean(Petal.Length)
)
```

```
## # A tibble: 3 x 4
## Species count mean_sep mean_pet
## <fct> <int> <dbl> <dbl>
## 1 setosa 50 5.01 1.46
## 2 versicolor 50 5.94 4.26
## 3 virginica 50 6.59 5.55
```

Note that, it’s possible to combine multiple operations using the *maggrittr* forward-pipe operator : *%>%*. For example, *x %>% f* is equivalent to *f(x)*.

In the R code above:

- first, my_data is passed to group_by() function
- next, the output of group_by() is passed to summarise() function

### Group by multiple variables

```
# ToothGrowth demo data sets
head(ToothGrowth)
```

```
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
```

```
# Summarize
ToothGrowth %>%
group_by(supp, dose) %>%
summarise(
n = n(),
mean = mean(len),
sd = sd(len)
)
```

```
## # A tibble: 6 x 5
## # Groups: supp [?]
## supp dose n mean sd
## <fct> <dbl> <int> <dbl> <dbl>
## 1 OJ 0.5 10 13.2 4.46
## 2 OJ 1 10 22.7 3.91
## 3 OJ 2 10 26.1 2.66
## 4 VC 0.5 10 7.98 2.75
## 5 VC 1 10 16.8 2.52
## 6 VC 2 10 26.1 4.80
```

## Summarise multiple variables

### Key R functions

The functions `summarise_all()`

, `summarise_at()`

and `summarise_if()`

can be used to summarise multiple columns at once.

The simplified formats are as follow:

```
summarise_all(.tbl, .funs, ...)
summarise_if(.tbl, .predicate, .funs, ...)
summarise_at(.tbl, .vars, .funs, ...)
```

- .tbl: a tbl data frame
- .funs: List of function calls generated by
`funs()`

, or a character vector of function names, or simply a function. - …: Additional arguments for the function calls in .funs.
- .predicate: A predicate function to be applied to the columns or a logical vector. The variables for which .predicate is or returns TRUE are selected.

### Summarise variables

- Summarise all variables - compute the mean of all variables:

```
my_data %>%
group_by(Species) %>%
summarise_all(mean)
```

```
## # A tibble: 3 x 5
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 1.46 0.246
## 2 versicolor 5.94 2.77 4.26 1.33
## 3 virginica 6.59 2.97 5.55 2.03
```

- Summarise specific variables selected with a character vector:

```
my_data %>%
group_by(Species) %>%
summarise_at(c("Sepal.Length", "Sepal.Width"), mean, na.rm = TRUE)
```

- Summarise specific variables selected with a predicate function:

```
my_data %>%
group_by(Species) %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
```

## Useful statistical summary functions

This section presents some R functions for computing statistical summaries.

Measure of location:

*mean*(x): sum of x divided by the length*median*(x): 50% of x is above and 50% is below

Measure of variation:

*sd*(x): standard deviation*IQR*(x): interquartile range (robust equivalent of sd when outliers are present in the data)*mad(x)*: median absolute deviation (robust equivalent of sd when outliers are present in the data)

Measure of rank:

*min*(x): minimum value of x*max*(x): maximum value of x*quantile*(x, 0.25): 25% of x is below this value

Measure of position:

*first*(x): equivalent to x[1]*nth*(x, 2): equivalent to n<-2; x[n]*last*(x): equivalent to x[length(x)]

Counts:

*n*(x): the number of element in x*sum*(!is.na(x)): count non-missing values*n_distinct*(x): count the number of unique value

Counts and proportions of logical values:

*sum*(x > 10): count the number of elements where x > 10*mean*(y == 0): proportion of elements where y = 0

## Summary

In this tutorial, we describe how to easily compute statistical summaries using the R functions `summarise()`

and `group_by()`

[in **dplyr** package].

## Recommended for you

This section contains best data science and self-development resources to help you on your path.

### Coursera - Online Courses and Specialization

#### Data science

- Course: Machine Learning: Master the Fundamentals by Stanford
- Specialization: Data Science by Johns Hopkins University
- Specialization: Python for Everybody by University of Michigan
- Courses: Build Skills for a Top Job in any Industry by Coursera
- Specialization: Master Machine Learning Fundamentals by University of Washington
- Specialization: Statistics with R by Duke University
- Specialization: Software Development in R by Johns Hopkins University
- Specialization: Genomic Data Science by Johns Hopkins University

#### Popular Courses Launched in 2020

- Google IT Automation with Python by Google
- AI for Medicine by deeplearning.ai
- Epidemiology in Public Health Practice by Johns Hopkins University
- AWS Fundamentals by Amazon Web Services

#### Trending Courses

- The Science of Well-Being by Yale University
- Google IT Support Professional by Google
- Python for Everybody by University of Michigan
- IBM Data Science Professional Certificate by IBM
- Business Foundations by University of Pennsylvania
- Introduction to Psychology by Yale University
- Excel Skills for Business by Macquarie University
- Psychological First Aid by Johns Hopkins University
- Graphic Design by Cal Arts

### Amazon FBA

#### Amazing Selling Machine

### Books - Data Science

#### Our Books

- Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
- Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
- Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
- R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
- GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
- Network Analysis and Visualization in R by A. Kassambara (Datanovia)
- Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
- Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

#### Others

- R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
- Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
- Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
- An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
- Deep Learning with R by François Chollet & J.J. Allaire
- Deep Learning with Python by François Chollet

Thank you teacher

this tutorial was very helpful.

thank you so much

Thank you! btw do you know how to save the summarized results in a datafame? I summarized means of many variables and the R console doesn’t show the results at once.