**CLARA** (Clustering Large Applications, (Kaufman and Rousseeuw 1990)) is an extension to k-medoids (PAM) methods to deal with data containing a large number of objects (more than several thousand observations) in order to reduce computing time and RAM storage problem. This is achieved using the sampling approach.

In this article, you will learn:

- The basic steps of
**CLARA algorithm** - Examples of computing
**CLARA in R**software using practical examples

Contents:

#### Related Book

Practical Guide to Cluster Analysis in R## CLARA concept

Instead of finding medoids for the entire data set, CLARA considers a small sample of the data with fixed size (*sampsize*) and applies the PAM algorithm to generate an optimal set of medoids for the sample. The quality of resulting medoids is measured by the average dissimilarity between every object in the entire data set and the medoid of its cluster, defined as the cost function.

CLARA repeats the sampling and clustering processes a pre-specified number of times in order to minimize the sampling bias. The final clustering results correspond to the set of medoids with the minimal cost. The CLARA algorithm is summarized in the next section.

## CLARA Algorithm

The algorithm is as follow:

- Create randomly, from the original dataset, multiple subsets with fixed size (sampsize)
- Compute PAM algorithm on each subset and choose the corresponding k representative objects (medoids). Assign each observation of the entire data set to the closest medoid.
- Calculate the mean (or the sum) of the dissimilarities of the observations to their closest medoid. This is used as a measure of the goodness of the clustering.
- Retain the sub-dataset for which the mean (or sum) is minimal. A further analysis is carried out on the final partition.

Note that, each sub-data set is forced to contain the medoids obtained from the best sub-data set until then. Randomly drawn observations are added to this set until sampsize has been reached.

## Computing CLARA in R

### Data format and preparation

To compute the CLARA algorithm in R, the data should be prepared as indicated in Chapter data preparation and R packages.

Here, we’ll generate use a random data set. To make the result reproducible, we start by using the function *set.seed*().

```
set.seed(1234)
# Generate 500 objects, divided into 2 clusters.
df <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
cbind(rnorm(300,50,8), rnorm(300,50,8)))
# Specify column and row names
colnames(df) <- c("x", "y")
rownames(df) <- paste0("S", 1:nrow(df))
# Previewing the data
head(df, nrow = 6)
```

```
## x y
## S1 -9.66 3.88
## S2 2.22 5.57
## S3 8.68 1.48
## S4 -18.77 5.61
## S5 3.43 2.49
## S6 4.05 6.08
```

### Required R packages and functions

The function *clara*() [*cluster* package] can be used to compute *CLARA*. The simplified format is as follow:

```
clara(x, k, metric = "euclidean", stand = FALSE,
samples = 5, pamLike = FALSE)
```

**x**: a numeric data matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. Missing values (NAs) are allowed.**k**: the number of clusters.**metric**: the distance metrics to be used. Available options are “euclidean” and “manhattan”. Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences. Read more on distance measures (Chapter ). Note that, manhattan distance is less sensitive to outliers.**stand**: logical value; if true, the variables (columns) in x are standardized before calculating the dissimilarities. Note that, it’s recommended to standardize variables before clustering.**samples**: number of samples to be drawn from the data set. Default value is 5 but it’s recommended a much larger value.**pamLike**: logical indicating if the same algorithm in the**pam**() function should be used. This should be always true.

To create a beautiful graph of the clusters generated with the *pam*() function, will use the *factoextra* package.

- Installing required packages:

`install.packages(c("cluster", "factoextra"))`

- Loading the packages:

```
library(cluster)
library(factoextra)
```

### Estimating the optimal number of clusters

To estimate the optimal number of clusters in your data, it’s possible to use the average silhouette method as described in PAM clustering chapter (Chapter @ref(k-medoids)). The R function *fviz_nbclust*() [*factoextra* package] provides a solution to facilitate this step.

Claim Your Membership Now.

From the plot, the suggested number of clusters is 2. In the next section, we’ll classify the observations into 2 clusters.

### Computing CLARA

The R code below computes PAM algorithm with k = 2:

```
# Compute CLARA
clara.res <- clara(df, 2, samples = 50, pamLike = TRUE)
# Print components of clara.res
print(clara.res)
```

```
## Call: clara(x = df, k = 2, samples = 50, pamLike = TRUE)
## Medoids:
## x y
## S121 -1.53 1.15
## S455 48.36 50.23
## Objective function: 9.88
## Clustering vector: Named int [1:500] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "names")= chr [1:500] "S1" "S2" "S3" "S4" "S5" "S6" "S7" ...
## Cluster sizes: 200 300
## Best sample:
## [1] S37 S49 S54 S63 S68 S71 S76 S80 S82 S101 S103 S108 S109 S118
## [15] S121 S128 S132 S138 S144 S162 S203 S210 S216 S231 S234 S249 S260 S261
## [29] S286 S299 S304 S305 S312 S315 S322 S350 S403 S450 S454 S455 S456 S465
## [43] S488 S497
##
## Available components:
## [1] "sample" "medoids" "i.med" "clustering" "objective"
## [6] "clusinfo" "diss" "call" "silinfo" "data"
```

The output of the function *clara*() includes the following components:

**medoids**: Objects that represent clusters**clustering**: a vector containing the cluster number of each object**sample**: labels or case numbers of the observations in the best sample, that is, the sample used by the clara algorithm for the final partition.

If you want to add the point classifications to the original data, use this:

```
dd <- cbind(df, cluster = clara.res$cluster)
head(dd, n = 4)
```

```
## x y cluster
## S1 -9.66 3.88 1
## S2 2.22 5.57 1
## S3 8.68 1.48 1
## S4 -18.77 5.61 1
```

You can access to the results returned by *clara*() as follow:

```
# Medoids
clara.res$medoids
```

```
## x y
## S121 -1.53 1.15
## S455 48.36 50.23
```

```
# Clustering
head(clara.res$clustering, 10)
```

```
## S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
## 1 1 1 1 1 1 1 1 1 1
```

The **medoids** are S121, S455

### Visualizing CLARA clusters

To visualize the partitioning results, we’ll use the function *fviz_cluster*() [*factoextra* package]. It draws a scatter plot of data points colored by cluster numbers.

Claim Your Membership Now.

## Summary

The CLARA (Clustering Large Applications) algorithm is an extension to the PAM (Partitioning Around Medoids) clustering method for large data sets. It intended to reduce the computation time in the case of large data set.

As almost all partitioning algorithm, it requires the user to specify the appropriate number of clusters to be produced. This can be estimated using the function *fviz_nbclust* [in *factoextra* R package].

The R function *clara*() [*cluster* package] can be used to compute CLARA algorithm. The simplified format is clara(x, k, pamLike = TRUE), where “x” is the data and k is the number of clusters to be generated.

After, computing CLARA, the R function *fviz_cluster*() [*factoextra* package] can be used to visualize the results. The format is fviz_cluster(clara.res), where clara.res is the CLARA results.

## References

Kaufman, Leonard, and Peter Rousseeuw. 1990. *Finding Groups in Data: An Introduction to Cluster Analysis*.

Hi !

I would appreciatte your help for indicating:

1) if CLARA algorithm can be applied with categorical, binary and numeric columns.

2) How can I get the optimal number of clusters if my dataset has categorical, binary and numeric columns

Thanks in advance !

Javier

Javier, yes you can use this with mixed data types. CLARA is based on PAM (aka k-medoids) methods. The difference is that in traditional k-medoids methods the algorithm is applied to the whole data set at once. CLARA applies the algorithm to many samples of the data set. This technique is done when you have a large dataset because the algorithm is too computationally-intensive to handle datasets above a certain size.

Dear all,

thank you for the nice post. My question is the following:

– I have large datset (100000 obs and 20 numerical and 2 categorical variables).

-I need to cluster these observations based on the 22 variables, I have no idea how many clusters/groups I need to indicate in clara(). I guess I have to produce trees with clara() with different number of clusters and compare them.

-> the main quesitons is: based on which factor/value do I compare the different produced trees and select one?

Best regards

Aöessandro