Comments on: Advanced Clustering

By: Hema Latha Krishna Nair

Hema Latha Krishna Nair — Wed, 20 Jan 2021 06:41:25 +0000

Hi,
It would be helpful if anyone can explain on how may I use K-Means clustering in a situation where I have more than 2 dimension/ argument for evaluation. I would like to cluster them into 5 clusters (K-5) but I am afraid basic Kmeans only takes up 2 dimensions for distance measure. Any best practice?

By: Yulin

Yulin — Sat, 24 Aug 2019 23:42:36 +0000

Excellent course! Many thanks of sharing the knowledge!

By: Noven

Noven — Wed, 30 Jan 2019 00:54:51 +0000

In reply to kassambara. Hi, Kassambara. Please make a post/tutorial about K-Prototype Clustering for mixed attribute and how to get the cluster accuracy.

By: poorwa_kunwar

poorwa_kunwar — Wed, 16 Jan 2019 21:18:23 +0000

In reply to kassambara. Thankyou for your reply. But the number of observations is 90 lacs or 9 million and not 90.

By: kassambara

kassambara — Wed, 16 Jan 2019 21:10:13 +0000

In reply to poorwa_kunwar.

You can also try the CLARA algorithm (https://www.datanovia.com/en/lessons/clara-in-r-clustering-large-applications/) for large data set.

For me, 90 observations is not a big dataset… But, it depends on the number of variables you have in the dataset

By: poorwa_kunwar

poorwa_kunwar — Wed, 16 Jan 2019 12:05:23 +0000

I am working on a very large dataset (over 90 lac observations) and also the dataset has both categorical and continuous variables. I tried using gowerand PAM but it simply fails to work because the dataset is too large. I’m thinking of using k-prototypes algorithm in the clustMixType package. Do you have any suggestions? Thanks.

By: kassambara

kassambara — Sat, 08 Dec 2018 06:47:54 +0000

In reply to Connie. Hi Connie, My previous comment shows just an example of how to perform clustering on mixed data. Note that, Clara algorithm doesn't take a distance matrix as input, so you can't apply it on Gower distance. For soft clustering, I would suggest the fuzzy clustering method using the fanny() R function [in cluster R package]. It supports distance matrix as an input. You might be interested by the Hierarchical Clustering on Principal Components (HCPC), which can be also used for performing clustering on mixed data.

By: Connie

Connie — Thu, 06 Dec 2018 02:38:01 +0000

In reply to kassambara.

Kassambara, thank you for quick reply. Your explanation is always clear and straightforward. As I have 20,000 observations, my first thought is to use CLARA. I will adopt hierarchical clustering as you suggested. As a beginner to cluster analysis, may I ask why hierarchical clustering is better than CLARA in my case? That is the question I need to answer when I write the method section of the paper.

When I read Fraley’s paper (2002), I like the idea of ‘soft’ clustering, which however has some limitations such as large dataset. I don’t want to make things complicated at first place. But in future, after running basic method, is it possible to apply ‘soft’ clustering in my case? Which soft clustering method you recommend? Thank you!

By: kassambara

kassambara — Wed, 05 Dec 2018 14:00:28 +0000

In reply to Connie. For a mixed data, you can, first, compute a distance matrix between variables using the daisy() R function [in cluster package]. Next, you can apply hierarchical clustering on the computed distance matrix. For example:

library(cluster)
library(factoextra)

# Load data
data(flower)
head(flower, 3)

# Compute the gower distance matrix and visualize
gower.dist <- daisy(flower, metric = "gower")
fviz_dist(gower.dist)

# Perform aglomerative hierarchical clustering
hc.clust <- agnes(gower.dist)
fviz_dend(hc.clust)

By: Connie

Connie — Wed, 05 Dec 2018 13:17:46 +0000

Thank you so much for the very clear and excellent teaching on cluster analysis! I am wondering if I want to cluster observations based on three ordered categorical variables and one continuous variable in panel data, which method should I use? I would appreciate if you would like to answer my question.