The traditional clustering methods, such as hierarchical clustering and k-means clustering, are heuristic and are not based on formal models. Furthermore, k-means algorithm is commonly randomnly initialized, so different runs of k-means will often yield different results. Additionally, k-means requires the user to specify the the optimal number of clusters.
An alternative is model-based clustering, which consider the data as coming from a distribution that is mixture of two or more clusters (Fraley and Raftery 2002, Fraley et al. (2012)). Unlike k-means, the model-based clustering uses a soft assignment, where each data point has a probability of belonging to each cluster.
In this chapter, we illustrate model-based clustering using the R package mclust.
Concept of model-based clustering
In model-based clustering, the data is considered as coming from a mixture of density.
Each component (i.e. cluster) k is modeled by the normal or Gaussian distribution which is characterized by the parameters:
- \(\mu_k\): mean vector,
- \(\sum_k\): covariance matrix,
- An associated probability in the mixture. Each point has a probability of belonging to each cluster.
For example, consider the “old faithful geyser data” [in MASS R package], which can be illustrated as follow using the ggpubr R package:
# Load the data library("MASS") data("geyser") # Scatter plot library("ggpubr") ggscatter(geyser, x = "duration", y = "waiting")+ geom_density2d() # Add 2D density
The plot above suggests at least 3 clusters in the mixture. The shape of each of the 3 clusters appears to be approximately elliptical suggesting three bivariate normal distributions. As the 3 ellipses seems to be similar in terms of volume, shape and orientation, we might anticipate that the three components of this mixture might have homogeneous covariance matrices.
Estimating model parameters
The model parameters can be estimated using the Expectation-Maximization (EM) algorithm initialized by hierarchical model-based clustering. Each cluster k is centered at the means \(\mu_k\), with increased density for points near the mean.
Geometric features (shape, volume, orientation) of each cluster are determined by the covariance matrix \(\sum_k\).
Different possible parameterizations of \(\sum_k\) are available in the R package mclust (see ?mclustModelNames).
The available model options, in mclust package, are represented by identifiers including: EII, VII, EEI, VEI, EVI, VVI, EEE, EEV, VEV and VVV.
The first identifier refers to volume, the second to shape and the third to orientation. E stands for “equal”, V for “variable” and I for “coordinate axes”.
- EVI denotes a model in which the volumes of all clusters are equal (E), the shapes of the clusters may vary (V), and the orientation is the identity (I) or “coordinate axes.
- EEE means that the clusters have the same volume, shape and orientation in p-dimensional space.
- VEI means that the clusters have variable volume, the same shape and orientation equal to coordinate axes.
Choosing the best model
The Mclust package uses maximum likelihood to fit all these models, with different covariance matrix parameterizations, for a range of k components.
The best model is selected using the Bayesian Information Criterion or BIC. A large BIC score indicates strong evidence for the corresponding model.
Computing model-based clustering in R
We start by installing the mclust package as follow: install.packages(“mclust”)
Note that, model-based clustering can be applied on univariate or multivariate data.
Here, we illustrate model-based clustering on the diabetes data set [mclust package] giving three measurements and the diagnosis for 145 subjects described as follow:
library("mclust") data("diabetes") head(diabetes, 3)
## class glucose insulin sspg ## 1 Normal 80 356 124 ## 2 Normal 97 289 117 ## 3 Normal 105 319 143
- class: the diagnosis: normal, chemically diabetic, and overtly diabetic. Excluded from the cluster analysis.
- glucose: plasma glucose response to oral glucose
- insulin: plasma insulin response to oral glucose
- sspg: steady-state plasma glucose (measures insulin resistance)
Model-based clustering can be computed using the function Mclust() as follow:
library(mclust) df <- scale(diabetes[, -1]) # Standardize the data mc <- Mclust(df) # Model-based-clustering
summary(mc) # Print a summary
## ---------------------------------------------------- ## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------- ## ## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 3 components: ## ## log.likelihood n df BIC ICL ## -169 145 29 -483 -501 ## ## Clustering table: ## 1 2 3 ## 81 36 28
For this data, it can be seen that model-based clustering selected a model with three components (i.e. clusters). The optimal selected model name is VVV model. That is the three components are ellipsoidal with varying volume, shape, and orientation. The summary contains also the clustering table specifying the number of observations in each clusters.
You can access to the results as follow:
mc$modelName # Optimal selected model ==> "VVV" mc$G # Optimal number of cluster => 3 head(mc$z, 30) # Probality to belong to a given cluster head(mc$classification, 30) # Cluster assignement of each observation
Visualizing model-based clustering
Model-based clustering results can be drawn using the base function plot.Mclust() [in mclust package]. Here we’ll use the function fviz_mclust() [in factoextra package] to create beautiful plots based on ggplot2.
In the situation, where the data contain more than two variables, fviz_mclust() uses a principal component analysis to reduce the dimensionnality of the data. The first two principal components are used to produce a scatter plot of the data. However, if you want to plot the data using only two variables of interest, let say here c(“insulin”, “sspg”), you can specify that in the fviz_mclust() function using the argument choose.vars = c(“insulin”, “sspg”).
library(factoextra) # BIC values used for choosing the number of clusters fviz_mclust(mc, "BIC", palette = "jco") # Classification: plot showing the clustering fviz_mclust(mc, "classification", geom = "point", pointsize = 1.5, palette = "jco") # Classification uncertainty fviz_mclust(mc, "uncertainty", palette = "jco")
Note that, in the uncertainty plot, larger symbols indicate the more uncertain observations.
Fraley, Chris, and Adrian E Raftery. 2002. “Model-Based Clustering, Discriminant Analysis, and Density Estimation.” Journal of the American Statistical Association 97 (458): 611–31.
Fraley, Chris, Adrian E. Raftery, T. Brendan Murphy, and Luca Scrucca. 2012. “Mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation.” Technical Report No. 597, Department of Statistics, University of Washington. https://www.stat.washington.edu/research/reports/2012/tr597.pdf.