This tutorial introduces how to easily compute statistcal summaries in R using the dplyr package.
You will learn, how to:
- Compute summary statistics for ungrouped data, as well as, for data that are grouped by one or multiple variables. R functions: summarise() and group_by().
- Summarise multiple variable columns. R functions:
- summarise_all(): apply summary functions to every columns in the data frame.
- summarise_at(): apply summary functions to specific columns selected with a character vector
- summarise_if(): apply summary functions to columns selected with a predicate function that returns TRUE.
tidyverse packages, which include
We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.
my_data <- as_tibble(iris) my_data
## # A tibble: 150 x 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## # ... with 144 more rows
Summary statistics of ungrouped data
Compute the mean of Sepal.Length and Petal.Length as well as the number of observations using the function n():
my_data %>% summarise( count = n(), mean_sep = mean(Sepal.Length, na.rm = TRUE), mean_pet = mean(Petal.Length, na.rm = TRUE) )
## # A tibble: 1 x 3 ## count mean_sep mean_pet ## <int> <dbl> <dbl> ## 1 150 5.84 3.76
Note that, we used the additional argument na.rm to remove NAs, before computing means.
Summary statistics of grouped data
Key R functions:
Group by one variable
my_data %>% group_by(Species) %>% summarise( count = n(), mean_sep = mean(Sepal.Length), mean_pet = mean(Petal.Length) )
## # A tibble: 3 x 4 ## Species count mean_sep mean_pet ## <fct> <int> <dbl> <dbl> ## 1 setosa 50 5.01 1.46 ## 2 versicolor 50 5.94 4.26 ## 3 virginica 50 6.59 5.55
Note that, it’s possible to combine multiple operations using the maggrittr forward-pipe operator : %>%. For example, x %>% f is equivalent to f(x).
In the R code above:
- first, my_data is passed to group_by() function
- next, the output of group_by() is passed to summarise() function
Group by multiple variables
# ToothGrowth demo data sets head(ToothGrowth)
## len supp dose ## 1 4.2 VC 0.5 ## 2 11.5 VC 0.5 ## 3 7.3 VC 0.5 ## 4 5.8 VC 0.5 ## 5 6.4 VC 0.5 ## 6 10.0 VC 0.5
# Summarize ToothGrowth %>% group_by(supp, dose) %>% summarise( n = n(), mean = mean(len), sd = sd(len) )
## # A tibble: 6 x 5 ## # Groups: supp [?] ## supp dose n mean sd ## <fct> <dbl> <int> <dbl> <dbl> ## 1 OJ 0.5 10 13.2 4.46 ## 2 OJ 1 10 22.7 3.91 ## 3 OJ 2 10 26.1 2.66 ## 4 VC 0.5 10 7.98 2.75 ## 5 VC 1 10 16.8 2.52 ## 6 VC 2 10 26.1 4.80
Summarise multiple variables
Key R functions
summarise_if() can be used to summarise multiple columns at once.
The simplified formats are as follow:
summarise_all(.tbl, .funs, ...) summarise_if(.tbl, .predicate, .funs, ...) summarise_at(.tbl, .vars, .funs, ...)
- .tbl: a tbl data frame
- .funs: List of function calls generated by
funs(), or a character vector of function names, or simply a function.
- …: Additional arguments for the function calls in .funs.
- .predicate: A predicate function to be applied to the columns or a logical vector. The variables for which .predicate is or returns TRUE are selected.
- Summarise all variables - compute the mean of all variables:
my_data %>% group_by(Species) %>% summarise_all(mean)
## # A tibble: 3 x 5 ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 setosa 5.01 3.43 1.46 0.246 ## 2 versicolor 5.94 2.77 4.26 1.33 ## 3 virginica 6.59 2.97 5.55 2.03
- Summarise specific variables selected with a character vector:
my_data %>% group_by(Species) %>% summarise_at(c("Sepal.Length", "Sepal.Width"), mean, na.rm = TRUE)
- Summarise specific variables selected with a predicate function:
my_data %>% group_by(Species) %>% summarise_if(is.numeric, mean, na.rm = TRUE)
Useful statistical summary functions
This section presents some R functions for computing statistical summaries.
Measure of location:
- mean(x): sum of x divided by the length
- median(x): 50% of x is above and 50% is below
Measure of variation:
- sd(x): standard deviation
- IQR(x): interquartile range (robust equivalent of sd when outliers are present in the data)
- mad(x): median absolute deviation (robust equivalent of sd when outliers are present in the data)
Measure of rank:
- min(x): minimum value of x
- max(x): maximum value of x
- quantile(x, 0.25): 25% of x is below this value
Measure of position:
- first(x): equivalent to x
- nth(x, 2): equivalent to n<-2; x[n]
- last(x): equivalent to x[length(x)]
- n(x): the number of element in x
- sum(!is.na(x)): count non-missing values
- n_distinct(x): count the number of unique value
Counts and proportions of logical values:
- sum(x > 10): count the number of elements where x > 10
- mean(y == 0): proportion of elements where y = 0
In this tutorial, we describe how to easily compute statistical summaries using the R functions
group_by() [in dplyr package].
Recommended for you
This section contains best data science and self-development resources to help you on your path.
Coursera - Online Courses and Specialization
- Course: Machine Learning: Master the Fundamentals by Stanford
- Specialization: Data Science by Johns Hopkins University
- Specialization: Python for Everybody by University of Michigan
- Courses: Build Skills for a Top Job in any Industry by Coursera
- Specialization: Master Machine Learning Fundamentals by University of Washington
- Specialization: Statistics with R by Duke University
- Specialization: Software Development in R by Johns Hopkins University
- Specialization: Genomic Data Science by Johns Hopkins University
Popular Courses Launched in 2020
- Google IT Automation with Python by Google
- AI for Medicine by deeplearning.ai
- Epidemiology in Public Health Practice by Johns Hopkins University
- AWS Fundamentals by Amazon Web Services
- The Science of Well-Being by Yale University
- Google IT Support Professional by Google
- Python for Everybody by University of Michigan
- IBM Data Science Professional Certificate by IBM
- Business Foundations by University of Pennsylvania
- Introduction to Psychology by Yale University
- Excel Skills for Business by Macquarie University
- Psychological First Aid by Johns Hopkins University
- Graphic Design by Cal Arts
Amazing Selling Machine
- Free Training - How to Build a 7-Figure Amazon FBA Business You Can Run 100% From Home and Build Your Dream Life! by ASM
Books - Data Science
- Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
- Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
- Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
- R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
- GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
- Network Analysis and Visualization in R by A. Kassambara (Datanovia)
- Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
- Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)
- R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
- Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
- Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
- An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
- Deep Learning with R by François Chollet & J.J. Allaire
- Deep Learning with Python by François Chollet
Thank you teacher
this tutorial was very helpful.
thank you so much