In this chapter, we start by presenting the data format and preparation for cluster analysis. Next, we introduce two main R packages - cluster and factoextra - for computing and visualizing clusters.
Practical Guide to Cluster Analysis
To perform a cluster analysis in R, generally, the data should be prepared as follow:
- Rows are observations (individuals) and columns are variables
- Any missing value in the data must be removed or estimated.
- The data must be standardized (i.e., scaled) to make variables comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one. Read more about data standardization in chapter @ref(clustering-distance-measures).
Here, we’ll use the built-in R data set “USArrests”, which contains statistics in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. It includes also the percent of the population living in urban areas.
data("USArrests") # Load the data set df <- USArrests # Use df as shorter name
- To remove any missing value that might be present in the data, type this:
df <- na.omit(df)
- As we don’t want the clustering algorithm to depend to an arbitrary variable unit, we start by scaling/standardizing the data using the R function scale():
df <- scale(df) head(df, n = 3)
## Murder Assault UrbanPop Rape ## Alabama 1.2426 0.783 -0.521 -0.00342 ## Alaska 0.5079 1.107 -1.212 2.48420 ## Arizona 0.0716 1.479 0.999 1.04288
Required R Packages
In this book, we’ll use mainly the following R packages:
- cluster for computing clustering algorithms, and
- factoextra for ggplot2-based elegant visualization of clustering results. The official online documentation is available at: https://rpkgs.datanovia.com/factoextra/.
factoextra contains many functions for cluster analysis and visualization, including:
|dist(fviz_dist, get_dist)||Distance Matrix Computation and Visualization|
|get_clust_tendency||Assessing Clustering Tendency|
|fviz_nbclust(fviz_gap_stat)||Determining the Optimal Number of Clusters|
|fviz_dend||Enhanced Visualization of Dendrogram|
|fviz_cluster||Visualize Clustering Results|
|fviz_mclust||Visualize Model-based Clustering Results|
|fviz_silhouette||Visualize Silhouette Information from Clustering|
|hcut||Computes Hierarchical Clustering and Cut the Tree|
|hkmeans||Hierarchical k-means clustering|
|eclust||Visual enhancement of clustering analysis|
To install the two packages, type this:
This chapter introduces how to prepare your data for cluster analysis and describes the essential R package for cluster analysis.
Recommended for you
This section contains best data science and self-development resources to help you on your path.
Coursera - Online Courses and Specialization
- Course: Machine Learning: Master the Fundamentals by Stanford
- Specialization: Data Science by Johns Hopkins University
- Specialization: Python for Everybody by University of Michigan
- Courses: Build Skills for a Top Job in any Industry by Coursera
- Specialization: Master Machine Learning Fundamentals by University of Washington
- Specialization: Statistics with R by Duke University
- Specialization: Software Development in R by Johns Hopkins University
- Specialization: Genomic Data Science by Johns Hopkins University
Popular Courses Launched in 2020
- Google IT Automation with Python by Google
- AI for Medicine by deeplearning.ai
- Epidemiology in Public Health Practice by Johns Hopkins University
- AWS Fundamentals by Amazon Web Services
- The Science of Well-Being by Yale University
- Google IT Support Professional by Google
- Python for Everybody by University of Michigan
- IBM Data Science Professional Certificate by IBM
- Business Foundations by University of Pennsylvania
- Introduction to Psychology by Yale University
- Excel Skills for Business by Macquarie University
- Psychological First Aid by Johns Hopkins University
- Graphic Design by Cal Arts
Amazing Selling Machine
- Free Training - How to Build a 7-Figure Amazon FBA Business You Can Run 100% From Home and Build Your Dream Life! by ASM
Books - Data Science
- Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
- Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
- Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
- R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
- GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
- Network Analysis and Visualization in R by A. Kassambara (Datanovia)
- Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
- Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)
- R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
- Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
- Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
- An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
- Deep Learning with R by François Chollet & J.J. Allaire
- Deep Learning with Python by François Chollet
This site is awesome. Thank you!
I love this site. It really is helping me out in a “Cluster Analysis” project,
I wanted to understand what kinds of techniques should be used to perform clustering on very large datasets (my data set has about 3 million rows), I am stuck as using functions like “get_clust_tendency” or even the kmeans and hclust algorithms are throwing “cannot allocate vector of 17000 Gb” error.
Is there a better way to approach this problem with clustering on big datasets?
Dear Dr Kassambara,
as many others have already said – thank you very much for this great site! It is indeed a resource I see myself coming back to again and again.
Regarding data preprocessing, I have been wondering how to deal with skewed data – should some form of power transformation be applied to get them into a more “Gaussian” shape, or are different distance metrics better suited than the Euclidean distance, or does it not matter in the end?