In this chapter, we start by presenting the data format and preparation for cluster analysis. Next, we introduce two main R packages - cluster and factoextra - for computing and visualizing clusters.
Practical Guide to Cluster Analysis
To perform a cluster analysis in R, generally, the data should be prepared as follow:
- Rows are observations (individuals) and columns are variables
- Any missing value in the data must be removed or estimated.
- The data must be standardized (i.e., scaled) to make variables comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one. Read more about data standardization in chapter @ref(clustering-distance-measures).
Here, we’ll use the built-in R data set “USArrests”, which contains statistics in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. It includes also the percent of the population living in urban areas.
data("USArrests") # Load the data set df <- USArrests # Use df as shorter name
- To remove any missing value that might be present in the data, type this:
df <- na.omit(df)
- As we don’t want the clustering algorithm to depend to an arbitrary variable unit, we start by scaling/standardizing the data using the R function scale():
df <- scale(df) head(df, n = 3)
## Murder Assault UrbanPop Rape ## Alabama 1.2426 0.783 -0.521 -0.00342 ## Alaska 0.5079 1.107 -1.212 2.48420 ## Arizona 0.0716 1.479 0.999 1.04288
Required R Packages
In this book, we’ll use mainly the following R packages:
- cluster for computing clustering algorithms, and
- factoextra for ggplot2-based elegant visualization of clustering results. The official online documentation is available at: https://rpkgs.datanovia.com/factoextra/.
factoextra contains many functions for cluster analysis and visualization, including:
|dist(fviz_dist, get_dist)||Distance Matrix Computation and Visualization|
|get_clust_tendency||Assessing Clustering Tendency|
|fviz_nbclust(fviz_gap_stat)||Determining the Optimal Number of Clusters|
|fviz_dend||Enhanced Visualization of Dendrogram|
|fviz_cluster||Visualize Clustering Results|
|fviz_mclust||Visualize Model-based Clustering Results|
|fviz_silhouette||Visualize Silhouette Information from Clustering|
|hcut||Computes Hierarchical Clustering and Cut the Tree|
|hkmeans||Hierarchical k-means clustering|
|eclust||Visual enhancement of clustering analysis|
To install the two packages, type this:
This chapter introduces how to prepare your data for cluster analysis and describes the essential R package for cluster analysis.