{"id":8066,"date":"2018-10-21T12:02:19","date_gmt":"2018-10-21T10:02:19","guid":{"rendered":"https:\/\/www.datanovia.com\/en\/?post_type=dt_lessons&#038;p=8066"},"modified":"2018-10-21T14:06:10","modified_gmt":"2018-10-21T12:06:10","slug":"choosing-the-best-clustering-algorithms","status":"publish","type":"dt_lessons","link":"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/","title":{"rendered":"Choosing the Best Clustering Algorithms"},"content":{"rendered":"<div id=\"rdoc\">\n<p><strong>Choosing the best clustering method<\/strong> for a given data can be a hard task for the analyst. This article describes the R package <strong>clValid<\/strong> <span class=\"citation\">(Brock et al. 2008)<\/span>, which can be used to compare simultaneously multiple clustering algorithms in a single function call for identifying the best clustering approach and the optimal number of clusters.<\/p>\n<div class=\"block\">\n<p>We\u2019ll start by describing the different measures in the clValid package for comparing clustering algorithms. Next, we\u2019ll present the function <em>clValid<\/em>(). Finally, we\u2019ll provide R scripts for validating clustering results and comparing clustering algorithms.<\/p>\n<\/div>\n<p>Contents:<\/p>\n<div id=\"TOC\">\n<ul>\n<li><a href=\"#measures-for-comparing-clustering-algorithms\">Measures for comparing clustering algorithms<\/a><\/li>\n<li><a href=\"#compare-clustering-algorithms-in-r\">Compare clustering algorithms in R<\/a><\/li>\n<li><a href=\"#summary\">Summary<\/a><\/li>\n<li><a href=\"#references\">References<\/a><\/li>\n<\/ul>\n<\/div>\n<div class='dt-sc-hr-invisible-medium  '><\/div>\n<div class='dt-sc-ico-content type1'><div class='custom-icon' ><a href='https:\/\/www.datanovia.com\/en\/product\/practical-guide-to-cluster-analysis-in-r\/' target='_blank'><span class='fa fa-book'><\/span><\/a><\/div><h4><a href='https:\/\/www.datanovia.com\/en\/product\/practical-guide-to-cluster-analysis-in-r\/' target='_blank'> Related Book <\/a><\/h4>Practical Guide to Cluster Analysis in R<\/div>\n<div class='dt-sc-hr-invisible-medium  '><\/div>\n<div id=\"measures-for-comparing-clustering-algorithms\" class=\"section level2\">\n<h2>Measures for comparing clustering algorithms<\/h2>\n<p>The clValid package compares clustering algorithms using two cluster validation measures:<\/p>\n<ol style=\"list-style-type: decimal;\">\n<li><em>Internal measures<\/em>, which uses intrinsic information in the data to assess the quality of the clustering. Internal measures include the connectivity, the silhouette coefficient and the Dunn index as described in the Chapter <a href=\"https:\/\/www.datanovia.com\/en\/lessons\/cluster-validation-statistics\/\">cluster validation statistics<\/a>.<\/li>\n<li><em>Stability measures<\/em>, a special version of internal measures, which evaluates the consistency of a clustering result by comparing it with the clusters obtained after each column is removed, one at a time.<\/li>\n<\/ol>\n<p>Cluster stability measures include:<\/p>\n<ul>\n<li>The average proportion of non-overlap (APN)<\/li>\n<li>The average distance (AD)<\/li>\n<li>The average distance between means (ADM)<\/li>\n<li>The figure of merit (FOM)<\/li>\n<\/ul>\n<p>The APN, AD, and ADM are all based on the cross-classification table of the original clustering on the full data with the clustering based on the removal of one column.<\/p>\n<ul>\n<li>The APN measures the average proportion of observations not placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed.<\/li>\n<li>The AD measures the average distance between observations placed in the same cluster under both cases (full data set and removal of one column).<\/li>\n<li>The ADM measures the average distance between cluster centers for observations placed in the same cluster under both cases.<\/li>\n<li>The FOM measures the average intra-cluster variance of the deleted column, where the clustering is based on the remaining (undeleted) columns.<\/li>\n<\/ul>\n<div class=\"warning\">\n<p>The values of APN, ADM and FOM ranges from 0 to 1, with smaller value corresponding with highly consistent clustering results. AD has a value between 0 and infinity, and smaller values are also preferred.<\/p>\n<\/div>\n<div class=\"notice\">\n<p>Note that, the clValid package provides also biological validation measures, which evaluates the ability of a clustering algorithm to produce biologically meaningful clusters. An application is microarray or RNAseq data where observations corresponds to genes.<\/p>\n<\/div>\n<\/div>\n<div id=\"compare-clustering-algorithms-in-r\" class=\"section level2\">\n<h2>Compare clustering algorithms in R<\/h2>\n<p>We\u2019ll use the function <strong>clValid<\/strong>() [in the <em>clValid<\/em> package], which simplified format is as follow:<\/p>\n<pre class=\"r\"><code>clValid(obj, nClust, clMethods = \"hierarchical\", \r\n        validation = \"stability\", maxitems = 600,\r\n        metric = \"euclidean\", method = \"average\")<\/code><\/pre>\n<div class=\"block\">\n<ul>\n<li><strong>obj<\/strong>: A numeric matrix or data frame. Rows are the items to be clustered and columns are samples.<\/li>\n<li><strong>nClust<\/strong>: A numeric vector specifying the numbers of clusters to be evaluated. e.g., 2:10<\/li>\n<li><strong>clMethods<\/strong>: The clustering method to be used. Available options are \u201chierarchical\u201d, \u201ckmeans\u201d, \u201cdiana\u201d, \u201cfanny\u201d, \u201csom\u201d, \u201cmodel\u201d, \u201csota\u201d, \u201cpam\u201d, \u201cclara\u201d, and \u201cagnes\u201d, with multiple choices allowed.<\/li>\n<li><strong>validation<\/strong>: The type of validation measures to be used. Allowed values are \u201cinternal\u201d, \u201cstability\u201d, and \u201cbiological\u201d, with multiple choices allowed.<\/li>\n<li><strong>maxitems<\/strong>: The maximum number of items (rows in matrix) which can be clustered.<\/li>\n<li><strong>metric<\/strong>: The metric used to determine the distance matrix. Possible choices are \u201ceuclidean\u201d, \u201ccorrelation\u201d, and \u201cmanhattan\u201d.<\/li>\n<li><strong>method<\/strong>: For hierarchical clustering (hclust and agnes), the agglomeration method to be used. Available choices are \u201cward\u201d, \u201csingle\u201d, \u201ccomplete\u201d and \u201caverage\u201d.<\/li>\n<\/ul>\n<\/div>\n<p>For example, consider the iris data set, the <em>clValid<\/em>() function can be used as follow.<\/p>\n<p>We start by cluster internal measures, which include the connectivity, silhouette width and Dunn index. It\u2019s possible to compute simultaneously these internal measures for multiple clustering algorithms in combination with a range of cluster numbers.<\/p>\n<pre class=\"r\"><code>library(clValid)\r\n# Iris data set:\r\n# - Remove Species column and scale\r\ndf &lt;- scale(iris[, -5])\r\n\r\n# Compute clValid\r\nclmethods &lt;- c(\"hierarchical\",\"kmeans\",\"pam\")\r\nintern &lt;- clValid(df, nClust = 2:6, \r\n              clMethods = clmethods, validation = \"internal\")\r\n# Summary\r\nsummary(intern)<\/code><\/pre>\n<pre><code>##  Length   Class    Mode \r\n##       1 clValid      S4<\/code><\/pre>\n<div class=\"success\">\n<p>It can be seen that hierarchical clustering with two clusters performs the best in each case (i.e., for connectivity, Dunn and Silhouette measures). Regardless of the clustering algorithm, the optimal number of clusters seems to be two using the three measures.<\/p>\n<\/div>\n<p>The stability measures can be computed as follow:<\/p>\n<pre class=\"r\"><code># Stability measures\r\nclmethods &lt;- c(\"hierarchical\",\"kmeans\",\"pam\")\r\nstab &lt;- clValid(df, nClust = 2:6, clMethods = clmethods, \r\n                validation = \"stability\")\r\n# Display only optimal Scores\r\noptimalScores(stab)<\/code><\/pre>\n<pre><code>##       Score       Method Clusters\r\n## APN 0.00327 hierarchical        2\r\n## AD  1.00429          pam        6\r\n## ADM 0.01609 hierarchical        2\r\n## FOM 0.45575          pam        6<\/code><\/pre>\n<div class=\"success\">\n<p>For the APN and ADM measures, hierarchical clustering with two clusters again gives the best score. For the other measures, PAM with six clusters has the best score.<\/p>\n<\/div>\n<\/div>\n<div id=\"summary\" class=\"section level2\">\n<h2>Summary<\/h2>\n<p>Here, we described how to compare clustering algorithms using the <em>clValid<\/em> R package.<\/p>\n<\/div>\n<div id=\"references\" class=\"section level2 unnumbered\">\n<h2>References<\/h2>\n<div id=\"refs\" class=\"references\">\n<div id=\"ref-brock2008\">\n<p>Brock, Guy, Vasyl Pihur, Susmita Datta, and Somnath Datta. 2008. \u201cClValid: An R Package for Cluster Validation.\u201d <em>Journal of Statistical Software<\/em> 25 (4): 1\u201322. <a class=\"uri\" href=\"https:\/\/www.jstatsoft.org\/v025\/i04\">https:\/\/www.jstatsoft.org\/v025\/i04<\/a>.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p><!--end rdoc--><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this article, we\u2019ll start by describing the different measures in the clValid R package for comparing clustering algorithms. Next, we\u2019ll present the function clValid(). Finally, we\u2019ll provide R scripts for validating clustering results and comparing clustering algorithms.<\/p>\n","protected":false},"author":1,"featured_media":7799,"parent":0,"menu_order":0,"comment_status":"open","ping_status":"closed","template":"","class_list":["post-8066","dt_lessons","type-dt_lessons","status-publish","has-post-thumbnail","hentry"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Choosing the Best Clustering Algorithms - Datanovia<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Choosing the Best Clustering Algorithms - Datanovia\" \/>\n<meta property=\"og:description\" content=\"In this article, we\u2019ll start by describing the different measures in the clValid R package for comparing clustering algorithms. Next, we\u2019ll present the function clValid(). Finally, we\u2019ll provide R scripts for validating clustering results and comparing clustering algorithms.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/\" \/>\n<meta property=\"og:site_name\" content=\"Datanovia\" \/>\n<meta property=\"article:modified_time\" content=\"2018-10-21T12:06:10+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030410.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/\",\"url\":\"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/\",\"name\":\"Choosing the Best Clustering Algorithms - Datanovia\",\"isPartOf\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030410.jpg\",\"datePublished\":\"2018-10-21T10:02:19+00:00\",\"dateModified\":\"2018-10-21T12:06:10+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/#primaryimage\",\"url\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030410.jpg\",\"contentUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030410.jpg\",\"width\":1024,\"height\":512},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.datanovia.com\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Lessons\",\"item\":\"https:\/\/www.datanovia.com\/en\/lessons\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Choosing the Best Clustering Algorithms\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#website\",\"url\":\"https:\/\/www.datanovia.com\/en\/\",\"name\":\"Datanovia\",\"description\":\"Data Mining and Statistics for Decision Support\",\"publisher\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.datanovia.com\/en\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#organization\",\"name\":\"Datanovia\",\"url\":\"https:\/\/www.datanovia.com\/en\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png\",\"contentUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png\",\"width\":98,\"height\":99,\"caption\":\"Datanovia\"},\"image\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Choosing the Best Clustering Algorithms - Datanovia","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/","og_locale":"en_US","og_type":"article","og_title":"Choosing the Best Clustering Algorithms - Datanovia","og_description":"In this article, we\u2019ll start by describing the different measures in the clValid R package for comparing clustering algorithms. Next, we\u2019ll present the function clValid(). Finally, we\u2019ll provide R scripts for validating clustering results and comparing clustering algorithms.","og_url":"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/","og_site_name":"Datanovia","article_modified_time":"2018-10-21T12:06:10+00:00","og_image":[{"width":1024,"height":512,"url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030410.jpg","type":"image\/jpeg"}],"twitter_card":"summary_large_image","twitter_misc":{"Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/","url":"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/","name":"Choosing the Best Clustering Algorithms - Datanovia","isPartOf":{"@id":"https:\/\/www.datanovia.com\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/#primaryimage"},"image":{"@id":"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/#primaryimage"},"thumbnailUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030410.jpg","datePublished":"2018-10-21T10:02:19+00:00","dateModified":"2018-10-21T12:06:10+00:00","breadcrumb":{"@id":"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/#primaryimage","url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030410.jpg","contentUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030410.jpg","width":1024,"height":512},{"@type":"BreadcrumbList","@id":"https:\/\/www.datanovia.com\/en\/lessons\/choosing-the-best-clustering-algorithms\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.datanovia.com\/en\/"},{"@type":"ListItem","position":2,"name":"Lessons","item":"https:\/\/www.datanovia.com\/en\/lessons\/"},{"@type":"ListItem","position":3,"name":"Choosing the Best Clustering Algorithms"}]},{"@type":"WebSite","@id":"https:\/\/www.datanovia.com\/en\/#website","url":"https:\/\/www.datanovia.com\/en\/","name":"Datanovia","description":"Data Mining and Statistics for Decision Support","publisher":{"@id":"https:\/\/www.datanovia.com\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.datanovia.com\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.datanovia.com\/en\/#organization","name":"Datanovia","url":"https:\/\/www.datanovia.com\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/","url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png","contentUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png","width":98,"height":99,"caption":"Datanovia"},"image":{"@id":"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/"}}]}},"multi-rating":{"mr_rating_results":[]},"_links":{"self":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/dt_lessons\/8066","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/dt_lessons"}],"about":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/types\/dt_lessons"}],"author":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/comments?post=8066"}],"version-history":[{"count":0,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/dt_lessons\/8066\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/media\/7799"}],"wp:attachment":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/media?parent=8066"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}