Comments on: Model Based Clustering Essentials

By: Tayesh

Tayesh — Thu, 22 Oct 2020 12:31:31 +0000

Hi!

Thank you very much for a very clear and hands-on post. If I am understanding it well, model based clustering is based on the assumption that the covariates are normally distributed. What if one or more of your variables follow another distribution say Poisson. What do you? Looking forward to hearing from you.

thanks,
Tayesh

By: Mahesh

Mahesh — Tue, 20 Aug 2019 13:25:17 +0000

Hi Kas,

How to find out observations where points falling outside of all clusters ellipse in fviz_clust classification? Can we get index numbers or data-frame of that observations? Also, how we can find out boundary of each ellipse clusters?

Thanks,
Mahesh

By: kassambara

kassambara — Wed, 31 Jul 2019 17:18:18 +0000

In reply to Teadora Tyler.

Thank you for your positive feedback, highly appreciated!

Yes you can use model based clustering on two-dimensional data sets:

# Data preparation
data("geyser", package = "MASS")
df <- scale(geyser) 

# Model baseq clusering
library(mclust)
library(factoextra)
mc <- Mclust(df) 
fviz_mclust(mc, "uncertainty", palette = "jco")

You might find the following article useful for evaluating and validating clustering: https://www.datanovia.com/en/courses/cluster-validation-essentials/

By: Teadora Tyler

Teadora Tyler — Wed, 31 Jul 2019 15:50:26 +0000

Thank you SO much for this to-the-point, beautiful post!
It really helped me with my first steps.

I rarely come across such easy-to-understand and useful post in this topic that helps beginners too!

Do you think using this mclust approach is adequate for a dataset that contains only X and Y coordinates of objects? I mean it works beautifully, but is it the proper way? I can get very lost in all the possible spatial clustering methods. I was also looking at DBSCAN but mclust gives much better plots (I think).

Thanks again:)
Teadora

By: kassambara

kassambara — Mon, 12 Nov 2018 05:38:00 +0000

In reply to San Emmanuel.

Hi San Emmanuel,

Thank you for your feedback!

The choice of the (dis)simality metric shoud be based on the research question and the type of dataset.

For example, Euclidian distance is best for variables with continuous data while Bray Curtis is best for categorical or binary data.

Particularly for continuous data it is expected that all variables are in the “same” scale and with the “same” distribution. So if your variables are not, you will need to standardize or normalize them.

If You want to reflect ecological differences, then Bray-Curtis will do a much better job, since it used to quantify the compositional dissimilarity between two different sites, based on counts at each site.

The Bray–Curtis dissimilarity is often erroneously called a distance. It is not a distance since it does not satisfy triangle inequality, and should always be called a dissimilarity to avoid confusion.

The use of Euclidean (metric distance) and Bray-Curtis (semi metric) depends on your data and the way you want to handle it. Metric distances comply with the triangle inequality criterion (the sum of two sides of a triangle equal must be greatet or equal than the other side) while semi metric don’t.

This is particularly relevant when zeros are not true absences (eg when you sample species from a site, you’ll never know for sure if the species is truly absent or you failed to sample it but is present, or in your case metals).

This is very important because if your zeros aren’t true absences and you use Euclidean distance, the dissimilarities among sites won’t be a good description of your data, that is, two sites with a bunch of shared zeros will be more similar to each other this two sites with a few shared observations. This is why, when dealing with composition data, it is more appropriate to use Bray-Curtis over Euclidean distance.

By: San Emmanuel

San Emmanuel — Sun, 11 Nov 2018 18:15:12 +0000

Hi Kas,

Thanks for sharing, like Chafia, I agree that this is very helpful.

I have .a quick question about the intuition around scale and using a distance matrix (or method). I find that in certain instance, data is scaled and in others, a distance method such as Bray Curtis or Jaccardi is used. Am referring to microbiome studies. What are your thoughts?

Thanks,

By: kassambara

kassambara — Sun, 28 Oct 2018 19:22:11 +0000

In reply to Chafia.

Hi Chafia,

Thank you very much for the feedback.
These kind of appreciations really help and motivate us to perform well and deliver better contents forever.

Thank you again.

Best regards

By: Chafia

Chafia — Sun, 28 Oct 2018 15:52:45 +0000

MERCI BEAUCOUP
THANK YOU SO MUCH

FOR THE COLORS YOU PUT ON THE DATA TO MAKE THEM
REALLY TALK TO US
I ENJOY PLOTTING AND MODELLING AND CLUSTRING
LEARNING FROM YOU HOW TO DO A BEAUTIFULL DATA ANALYSIS

THANK YOU SIR TO SHARE THIS JOY
WISH YOU THE BEST
REGARDS