Display a Beautiful Summary Statistics in R using Skimr Package



Display a Beautiful Summary Statistics in R using Skimr Package

This article describes how to quickly display summary statistics using the R package skimr.

skimr handles different data types and returns a skim_df object which can be included in a tidyverse pipeline or displayed nicely for the human reader.

Key features of skimr:

  • Provides a larger set of statistics than the R base function summary(), including missing, complete, n, and sd.
  • reports each data types separately
  • handles dates, logicals, and a variety of other types
  • supports spark-bar and spark-line


Contents:

Prerequisite

Install the stable version from CRAN:

install.packages("skimr")

Load the package:

library(skimr)

Summarize a whole dataset

skim(iris)
Data summary
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Species 0 1 FALSE 3 set: 50, ver: 50, vir: 50

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sepal.Length 0 1 5.84 0.83 4.3 5.1 5.80 6.4 7.9 ▆▇▇▅▂
Sepal.Width 0 1 3.06 0.44 2.0 2.8 3.00 3.3 4.4 ▁▆▇▂▁
Petal.Length 0 1 3.76 1.77 1.0 1.6 4.35 5.1 6.9 ▇▁▆▇▂
Petal.Width 0 1 1.20 0.76 0.1 0.3 1.30 1.8 2.5 ▇▁▇▅▃

Select specific columns to summarize

skim(iris, Sepal.Length, Petal.Length)
Data summary
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sepal.Length 0 1 5.84 0.83 4.3 5.1 5.80 6.4 7.9 ▆▇▇▅▂
Petal.Length 0 1 3.76 1.77 1.0 1.6 4.35 5.1 6.9 ▇▁▆▇▂

Handle grouped data

skim() can handle data that has been grouped using dplyr::group_by.

iris %>% 
  dplyr::group_by(Species) %>% 
  skim() 
Data summary
Name Piped data
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
numeric 4
________________________
Group variables Species

Variable type: numeric

skim_variable Species n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sepal.Length setosa 0 1 5.01 0.35 4.3 4.80 5.00 5.20 5.8 ▃▃▇▅▁
Sepal.Length versicolor 0 1 5.94 0.52 4.9 5.60 5.90 6.30 7.0 ▂▇▆▃▃
Sepal.Length virginica 0 1 6.59 0.64 4.9 6.23 6.50 6.90 7.9 ▁▃▇▃▂
Sepal.Width setosa 0 1 3.43 0.38 2.3 3.20 3.40 3.68 4.4 ▁▃▇▅▂
Sepal.Width versicolor 0 1 2.77 0.31 2.0 2.52 2.80 3.00 3.4 ▁▅▆▇▂
Sepal.Width virginica 0 1 2.97 0.32 2.2 2.80 3.00 3.18 3.8 ▂▆▇▅▁
Petal.Length setosa 0 1 1.46 0.17 1.0 1.40 1.50 1.58 1.9 ▁▃▇▃▁
Petal.Length versicolor 0 1 4.26 0.47 3.0 4.00 4.35 4.60 5.1 ▂▂▇▇▆
Petal.Length virginica 0 1 5.55 0.55 4.5 5.10 5.55 5.88 6.9 ▃▇▇▃▂
Petal.Width setosa 0 1 0.25 0.11 0.1 0.20 0.20 0.30 0.6 ▇▂▂▁▁
Petal.Width versicolor 0 1 1.33 0.20 1.0 1.20 1.30 1.50 1.8 ▅▇▃▆▁
Petal.Width virginica 0 1 2.03 0.27 1.4 1.80 2.00 2.30 2.5 ▂▇▆▅▇

Specify your own statistics and classes

Users can specify their own statistics using a list combined with the skim_with() function. This can support any named class found in your data.

my_skim <- skim_with(
  numeric = sfl(iqr = IQR, mad = mad, p99 = ~ quantile(., probs = .99)),
  append = FALSE
)
my_skim(iris, Sepal.Length)
Data summary
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate iqr mad p99
Sepal.Length 0 1 1.3 1.04 7.7



Version: Français





Comment ( 1 )

  • Glub

    How would you plot these stats in ggplot, in a boxplot for, for example?

Give a comment

Want to post an issue with R? If yes, please make sure you have read this: How to Include Reproducible R Script Examples in Datanovia Comments