This course presents advanced **clustering techniques**, including: hierarchical k-means clustering, Fuzzy clustering, Model-based clustering and density-based clustering.

### Advanced Clustering

#### Lessons

The density-based clustering (DBSCAN is a partitioning method that has been introduced in Ester et al. (1996). It can find out clusters of different shapes and sizes from data containing noise and outliers. In this chapter, we’ll describe the DBSCAN algorithm and demonstrate how to compute DBSCAN using the fpc R package. In model-based clustering, the data are viewed as coming from a distribution that is mixture of two ore more clusters. It finds best fit of models to data and estimates the number of clusters. In this chapter, we illustrate model-based clustering using the R package mclust. Fuzzy clustering is also known as soft method. Standard clustering (K-means, PAM) approaches produce partitions, in which each observation belongs to only one cluster. This is known as hard clustering. In Fuzzy clustering, items can be a member of more than one cluster. Each item has a set of membership coefficients corresponding to the degree of being in a given cluster. In this article, we’ll describe how to compute fuzzy clustering using the R software. This current article presents the fuzzy c-means clustering algorithm. This article describes how to compute the fuzzy clustering using the function cmeans() [in e1071 R package]

The hierarchical k-means clustering is an hybrid approach for improving k-means results. In this article, you will learn how to compute hierarchical k-means clustering in R

Thank you so much for the very clear and excellent teaching on cluster analysis! I am wondering if I want to cluster observations based on three ordered categorical variables and one continuous variable in panel data, which method should I use? I would appreciate if you would like to answer my question.

For a mixed data, you can, first, compute a distance matrix between variables using the daisy() R function [in cluster package].

Next, you can apply hierarchical clustering on the computed distance matrix.

For example:

Kassambara, thank you for quick reply. Your explanation is always clear and straightforward. As I have 20,000 observations, my first thought is to use CLARA. I will adopt hierarchical clustering as you suggested. As a beginner to cluster analysis, may I ask why hierarchical clustering is better than CLARA in my case? That is the question I need to answer when I write the method section of the paper.

When I read Fraley’s paper (2002), I like the idea of ‘soft’ clustering, which however has some limitations such as large dataset. I don’t want to make things complicated at first place. But in future, after running basic method, is it possible to apply ‘soft’ clustering in my case? Which soft clustering method you recommend? Thank you!

Hi Connie,

My previous comment shows just an example of how to perform clustering on mixed data. Note that, Clara algorithm doesn’t take a distance matrix as input, so you can’t apply it on Gower distance.

For soft clustering, I would suggest the fuzzy clustering method using the fanny() R function [in cluster R package]. It supports distance matrix as an input.

You might be interested by the Hierarchical Clustering on Principal Components (HCPC), which can be also used for performing clustering on mixed data.

Hi, Kassambara. Please make a post/tutorial about K-Prototype Clustering for mixed attribute and how to get the cluster accuracy.

I am working on a very large dataset (over 90 lac observations) and also the dataset has both categorical and continuous variables. I tried using gowerand PAM but it simply fails to work because the dataset is too large. I’m thinking of using k-prototypes algorithm in the clustMixType package. Do you have any suggestions? Thanks.

You can also try the CLARA algorithm (https://www.datanovia.com/en/lessons/clara-in-r-clustering-large-applications/) for large data set.

For me, 90 observations is not a big dataset… But, it depends on the number of variables you have in the dataset

Thankyou for your reply. But the number of observations is 90 lacs or 9 million and not 90.