Advanced Clustering

6 Lessons

1 hour 50 mins

Free

This course presents advanced clustering techniques, including: hierarchical k-means clustering, Fuzzy clustering, Model-based clustering and density-based clustering.

Related Book

Practical Guide to Cluster Analysis in R

Lessons

Hierarchical K-Means Clustering: Optimize Clusters
10 mins
Alboukadel Kassambara

The hierarchical k-means clustering is an hybrid approach for improving k-means results. In this article, you will learn how to compute hierarchical k-means clustering in R
Fuzzy Clustering Essentials
15 mins
Alboukadel Kassambara

Fuzzy clustering is also known as soft method. Standard clustering (K-means, PAM) approaches produce partitions, in which each observation belongs to only one cluster. This is known as hard clustering. In Fuzzy clustering, items can be a member of more than one cluster. Each item has a set of membership coefficients corresponding to the degree of being in a given cluster. In this article, we’ll describe how to compute fuzzy clustering using the R software.
1. Fuzzy C-Means Clustering Algorithm
  10 mins
  Alboukadel Kassambara
  
  This current article presents the fuzzy c-means clustering algorithm.
2. cmeans() R function: Compute Fuzzy clustering
  15 mins
  Alboukadel Kassambara
  
  This article describes how to compute the fuzzy clustering using the function cmeans() [in e1071 R package]
Model Based Clustering Essentials
30 mins
Alboukadel Kassambara

In model-based clustering, the data are viewed as coming from a distribution that is mixture of two ore more clusters. It finds best fit of models to data and estimates the number of clusters. In this chapter, we illustrate model-based clustering using the R package mclust.
DBSCAN: Density-Based Clustering Essentials
30 mins
Alboukadel Kassambara

The density-based clustering (DBSCAN is a partitioning method that has been introduced in Ester et al. (1996). It can find out clusters of different shapes and sizes from data containing noise and outliers. In this chapter, we’ll describe the DBSCAN algorithm and demonstrate how to compute DBSCAN using the fpc R package.

Comments ( 10 )

Connie

05 Dec 2018

Thank you so much for the very clear and excellent teaching on cluster analysis! I am wondering if I want to cluster observations based on three ordered categorical variables and one continuous variable in panel data, which method should I use? I would appreciate if you would like to answer my question.

Reply
- Kassambara
  
  05 Dec 2018
  For a mixed data, you can, first, compute a distance matrix between variables using the daisy() R function [in cluster package].
  
  Next, you can apply hierarchical clustering on the computed distance matrix.
  
  For example:
  
  library(cluster) library(factoextra) # Load data data(flower) head(flower, 3) # Compute the gower distance matrix and visualize gower.dist <- daisy(flower, metric = "gower") fviz_dist(gower.dist) # Perform aglomerative hierarchical clustering hc.clust <- agnes(gower.dist) fviz_dend(hc.clust)
  Reply
  - Connie
    
    06 Dec 2018
    
    Kassambara, thank you for quick reply. Your explanation is always clear and straightforward. As I have 20,000 observations, my first thought is to use CLARA. I will adopt hierarchical clustering as you suggested. As a beginner to cluster analysis, may I ask why hierarchical clustering is better than CLARA in my case? That is the question I need to answer when I write the method section of the paper.
    
    When I read Fraley’s paper (2002), I like the idea of ‘soft’ clustering, which however has some limitations such as large dataset. I don’t want to make things complicated at first place. But in future, after running basic method, is it possible to apply ‘soft’ clustering in my case? Which soft clustering method you recommend? Thank you!
    
    Reply
    - Kassambara
      
      08 Dec 2018
      
      Hi Connie,
      
      My previous comment shows just an example of how to perform clustering on mixed data. Note that, Clara algorithm doesn’t take a distance matrix as input, so you can’t apply it on Gower distance.
      
      For soft clustering, I would suggest the fuzzy clustering method using the fanny() R function [in cluster R package]. It supports distance matrix as an input.
      
      You might be interested by the Hierarchical Clustering on Principal Components (HCPC), which can be also used for performing clustering on mixed data.
      
      Reply
      - Noven
        
        30 Jan 2019
        
        Hi, Kassambara. Please make a post/tutorial about K-Prototype Clustering for mixed attribute and how to get the cluster accuracy.
Poorwa_kunwar

16 Jan 2019

I am working on a very large dataset (over 90 lac observations) and also the dataset has both categorical and continuous variables. I tried using gowerand PAM but it simply fails to work because the dataset is too large. I’m thinking of using k-prototypes algorithm in the clustMixType package. Do you have any suggestions? Thanks.

Reply
- Kassambara
  
  16 Jan 2019
  
  You can also try the CLARA algorithm (https://www.datanovia.com/en/lessons/clara-in-r-clustering-large-applications/) for large data set.
  
  For me, 90 observations is not a big dataset… But, it depends on the number of variables you have in the dataset
  
  Reply
  - Poorwa_kunwar
    
    16 Jan 2019
    
    Thankyou for your reply. But the number of observations is 90 lacs or 9 million and not 90.
    
    Reply
Yulin

25 Aug 2019

Excellent course! Many thanks of sharing the knowledge!

Reply
Hema Latha Krishna Nair

20 Jan 2021

Hi,
It would be helpful if anyone can explain on how may I use K-Means clustering in a situation where I have more than 2 dimension/ argument for evaluation. I would like to cluster them into 5 clusters (K-5) but I am afraid basic Kmeans only takes up 2 dimensions for distance measure. Any best practice?

Reply