# Data Clustering Basics

## Data Preparation and R Packages for Cluster Analysis

In this chapter, we start by presenting the data format and preparation for cluster analysis. Next, we introduce two main R packages - cluster and factoextra - for computing and visualizing clusters.

#### Related Book

Practical Guide to Cluster Analysis

## Data preparation

To perform a cluster analysis in R, generally, the data should be prepared as follow:

1. Rows are observations (individuals) and columns are variables
2. Any missing value in the data must be removed or estimated.
3. The data must be standardized (i.e., scaled) to make variables comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one. Read more about data standardization in chapter @ref(clustering-distance-measures).

Here, we’ll use the built-in R data set “USArrests”, which contains statistics in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. It includes also the percent of the population living in urban areas.

data("USArrests")  # Load the data set
df <- USArrests    # Use df as shorter name
1. To remove any missing value that might be present in the data, type this:
df <- na.omit(df)
1. As we don’t want the clustering algorithm to depend to an arbitrary variable unit, we start by scaling/standardizing the data using the R function scale():
df <- scale(df)
head(df, n = 3)
##         Murder Assault UrbanPop     Rape
## Alabama 1.2426   0.783   -0.521 -0.00342
## Alaska  0.5079   1.107   -1.212  2.48420
## Arizona 0.0716   1.479    0.999  1.04288

## Required R Packages

In this book, we’ll use mainly the following R packages:

• cluster for computing clustering algorithms, and
• factoextra for ggplot2-based elegant visualization of clustering results. The official online documentation is available at: https://rpkgs.datanovia.com/factoextra/.

factoextra contains many functions for cluster analysis and visualization, including:

Functions Description
dist(fviz_dist, get_dist) Distance Matrix Computation and Visualization
get_clust_tendency Assessing Clustering Tendency
fviz_nbclust(fviz_gap_stat) Determining the Optimal Number of Clusters
fviz_dend Enhanced Visualization of Dendrogram
fviz_cluster Visualize Clustering Results
fviz_mclust Visualize Model-based Clustering Results
fviz_silhouette Visualize Silhouette Information from Clustering
hcut Computes Hierarchical Clustering and Cut the Tree
hkmeans Hierarchical k-means clustering
eclust Visual enhancement of clustering analysis

To install the two packages, type this:

install.packages(c("cluster", "factoextra"))

## Summary

This chapter introduces how to prepare your data for cluster analysis and describes the essential R package for cluster analysis.

• Alexis Idlette-Wilson

This site is awesome. Thank you!

• PS

Hi,

I love this site. It really is helping me out in a “Cluster Analysis” project,

I wanted to understand what kinds of techniques should be used to perform clustering on very large datasets (my data set has about 3 million rows), I am stuck as using functions like “get_clust_tendency” or even the kmeans and hclust algorithms are throwing “cannot allocate vector of 17000 Gb” error.

Is there a better way to approach this problem with clustering on big datasets?

• Julian

Dear Dr Kassambara,
as many others have already said – thank you very much for this great site! It is indeed a resource I see myself coming back to again and again.
Regarding data preprocessing, I have been wondering how to deal with skewed data – should some form of power transformation be applied to get them into a more “Gaussian” shape, or are different distance metrics better suited than the Euclidean distance, or does it not matter in the end?