{"id":7644,"date":"2018-10-14T14:57:44","date_gmt":"2018-10-14T14:57:44","guid":{"rendered":"https:\/\/www.datanovia.com\/en\/?post_type=dt_lessons&#038;p=7644"},"modified":"2018-10-21T08:26:28","modified_gmt":"2018-10-21T06:26:28","slug":"data-preparation-and-r-packages-for-cluster-analysis","status":"publish","type":"dt_lessons","link":"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/","title":{"rendered":"Data Preparation and R Packages for Cluster Analysis"},"content":{"rendered":"<div id=\"rdoc\">\n<p>In this chapter, we start by presenting the <strong>data format and preparation<\/strong> for <strong>cluster<\/strong> analysis. Next, we introduce two main R packages - <strong>cluster<\/strong> and <strong>factoextra<\/strong> - for computing and visualizing clusters.<\/p>\n<div class='dt-sc-hr-invisible-medium  '><\/div>\n<div class='dt-sc-ico-content type1'><div class='custom-icon' ><a href='https:\/\/www.datanovia.com\/en\/product\/practical-guide-to-cluster-analysis-in-r\/' target='_blank'><span class='fa fa-book'><\/span><\/a><\/div><h4><a href='https:\/\/www.datanovia.com\/en\/product\/practical-guide-to-cluster-analysis-in-r\/' target='_blank'> Related Book <\/a><\/h4><p>Practical Guide to Cluster Analysis<\/p><\/div>\n<div class='dt-sc-hr-invisible-medium  '><\/div>\n<div id=\"data-preparation\" class=\"section level2\">\n<h2>Data preparation<\/h2>\n<p>To perform a cluster analysis in R, generally, the data should be prepared as follow:<\/p>\n<ol style=\"list-style-type: decimal;\">\n<li>Rows are observations (individuals) and columns are variables<\/li>\n<li>Any missing value in the data must be removed or estimated.<\/li>\n<li>The data must be standardized (i.e., scaled) to make variables comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one. Read more about data standardization in chapter @ref(clustering-distance-measures).<\/li>\n<\/ol>\n<p>Here, we\u2019ll use the built-in R data set \u201cUSArrests\u201d, which contains statistics in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. It includes also the percent of the population living in urban areas.<\/p>\n<pre class=\"r\"><code>data(\"USArrests\")  # Load the data set\r\ndf &lt;- USArrests    # Use df as shorter name<\/code><\/pre>\n<ol style=\"list-style-type: decimal;\">\n<li>To remove any missing value that might be present in the data, type this:<\/li>\n<\/ol>\n<pre class=\"r\"><code>df &lt;- na.omit(df)<\/code><\/pre>\n<ol style=\"list-style-type: decimal;\" start=\"2\">\n<li>As we don\u2019t want the clustering algorithm to depend to an arbitrary variable unit, we start by scaling\/standardizing the data using the R function <em>scale<\/em>():<\/li>\n<\/ol>\n<pre class=\"r\"><code>df &lt;- scale(df)\r\nhead(df, n = 3)<\/code><\/pre>\n<pre><code>##         Murder Assault UrbanPop     Rape\r\n## Alabama 1.2426   0.783   -0.521 -0.00342\r\n## Alaska  0.5079   1.107   -1.212  2.48420\r\n## Arizona 0.0716   1.479    0.999  1.04288<\/code><\/pre>\n<\/div>\n<div id=\"required-r-packages\" class=\"section level2\">\n<h2>Required R Packages<\/h2>\n<p>In this book, we\u2019ll use mainly the following R packages:<\/p>\n<ul>\n<li><strong>cluster<\/strong> for computing clustering algorithms, and<\/li>\n<li><strong>factoextra<\/strong> for ggplot2-based elegant visualization of clustering results. The official online documentation is available at: <a class=\"uri\" href=\"https:\/\/rpkgs.datanovia.com\/factoextra\/\">https:\/\/rpkgs.datanovia.com\/factoextra\/<\/a>.<\/li>\n<\/ul>\n<p>factoextra contains many functions for cluster analysis and visualization, including:<\/p>\n<table>\n<thead>\n<tr class=\"header\">\n<th>Functions<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr class=\"odd\">\n<td><em>dist<\/em>(fviz_dist, get_dist)<\/td>\n<td>Distance Matrix Computation and Visualization<\/td>\n<\/tr>\n<tr class=\"even\">\n<td><em>get_clust_tendency<\/em><\/td>\n<td>Assessing Clustering Tendency<\/td>\n<\/tr>\n<tr class=\"odd\">\n<td><em>fviz_nbclust<\/em>(fviz_gap_stat)<\/td>\n<td>Determining the Optimal Number of Clusters<\/td>\n<\/tr>\n<tr class=\"even\">\n<td><em>fviz_dend<\/em><\/td>\n<td>Enhanced Visualization of Dendrogram<\/td>\n<\/tr>\n<tr class=\"odd\">\n<td><em>fviz_cluster<\/em><\/td>\n<td>Visualize Clustering Results<\/td>\n<\/tr>\n<tr class=\"even\">\n<td><em>fviz_mclust<\/em><\/td>\n<td>Visualize Model-based Clustering Results<\/td>\n<\/tr>\n<tr class=\"odd\">\n<td><em>fviz_silhouette<\/em><\/td>\n<td>Visualize Silhouette Information from Clustering<\/td>\n<\/tr>\n<tr class=\"even\">\n<td><em>hcut<\/em><\/td>\n<td>Computes Hierarchical Clustering and Cut the Tree<\/td>\n<\/tr>\n<tr class=\"odd\">\n<td><em>hkmeans<\/em><\/td>\n<td>Hierarchical k-means clustering<\/td>\n<\/tr>\n<tr class=\"even\">\n<td><em>eclust<\/em><\/td>\n<td>Visual enhancement of clustering analysis<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>To install the two packages, type this:<\/p>\n<pre class=\"r\"><code>install.packages(c(\"cluster\", \"factoextra\"))<\/code><\/pre>\n<\/div>\n<div id=\"summary\" class=\"section level2\">\n<h2>Summary<\/h2>\n<p>This chapter introduces how to prepare your data for cluster analysis and describes the essential R package for cluster analysis.<\/p>\n<\/div>\n<\/div>\n<p><!--end rdoc--><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This chapter introduces how to prepare your data for cluster analysis and describes the essential R package for cluster analysis.<\/p>\n","protected":false},"author":1,"featured_media":8010,"parent":0,"menu_order":0,"comment_status":"open","ping_status":"closed","template":"","class_list":["post-7644","dt_lessons","type-dt_lessons","status-publish","has-post-thumbnail","hentry"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Data Preparation and R Packages for Cluster Analysis - Datanovia<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Data Preparation and R Packages for Cluster Analysis - Datanovia\" \/>\n<meta property=\"og:description\" content=\"This chapter introduces how to prepare your data for cluster analysis and describes the essential R package for cluster analysis.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/\" \/>\n<meta property=\"og:site_name\" content=\"Datanovia\" \/>\n<meta property=\"article:modified_time\" content=\"2018-10-21T06:26:28+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/IMG_0074.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/\",\"url\":\"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/\",\"name\":\"Data Preparation and R Packages for Cluster Analysis - Datanovia\",\"isPartOf\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/IMG_0074.jpg\",\"datePublished\":\"2018-10-14T14:57:44+00:00\",\"dateModified\":\"2018-10-21T06:26:28+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/#primaryimage\",\"url\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/IMG_0074.jpg\",\"contentUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/IMG_0074.jpg\",\"width\":1024,\"height\":512},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.datanovia.com\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Lessons\",\"item\":\"https:\/\/www.datanovia.com\/en\/lessons\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Data Preparation and R Packages for Cluster Analysis\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#website\",\"url\":\"https:\/\/www.datanovia.com\/en\/\",\"name\":\"Datanovia\",\"description\":\"Data Mining and Statistics for Decision Support\",\"publisher\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.datanovia.com\/en\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#organization\",\"name\":\"Datanovia\",\"url\":\"https:\/\/www.datanovia.com\/en\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png\",\"contentUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png\",\"width\":98,\"height\":99,\"caption\":\"Datanovia\"},\"image\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Data Preparation and R Packages for Cluster Analysis - Datanovia","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/","og_locale":"en_US","og_type":"article","og_title":"Data Preparation and R Packages for Cluster Analysis - Datanovia","og_description":"This chapter introduces how to prepare your data for cluster analysis and describes the essential R package for cluster analysis.","og_url":"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/","og_site_name":"Datanovia","article_modified_time":"2018-10-21T06:26:28+00:00","og_image":[{"width":1024,"height":512,"url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/IMG_0074.jpg","type":"image\/jpeg"}],"twitter_card":"summary_large_image","twitter_misc":{"Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/","url":"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/","name":"Data Preparation and R Packages for Cluster Analysis - Datanovia","isPartOf":{"@id":"https:\/\/www.datanovia.com\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/#primaryimage"},"image":{"@id":"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/#primaryimage"},"thumbnailUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/IMG_0074.jpg","datePublished":"2018-10-14T14:57:44+00:00","dateModified":"2018-10-21T06:26:28+00:00","breadcrumb":{"@id":"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/#primaryimage","url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/IMG_0074.jpg","contentUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/IMG_0074.jpg","width":1024,"height":512},{"@type":"BreadcrumbList","@id":"https:\/\/www.datanovia.com\/en\/lessons\/data-preparation-and-r-packages-for-cluster-analysis\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.datanovia.com\/en\/"},{"@type":"ListItem","position":2,"name":"Lessons","item":"https:\/\/www.datanovia.com\/en\/lessons\/"},{"@type":"ListItem","position":3,"name":"Data Preparation and R Packages for Cluster Analysis"}]},{"@type":"WebSite","@id":"https:\/\/www.datanovia.com\/en\/#website","url":"https:\/\/www.datanovia.com\/en\/","name":"Datanovia","description":"Data Mining and Statistics for Decision Support","publisher":{"@id":"https:\/\/www.datanovia.com\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.datanovia.com\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.datanovia.com\/en\/#organization","name":"Datanovia","url":"https:\/\/www.datanovia.com\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/","url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png","contentUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png","width":98,"height":99,"caption":"Datanovia"},"image":{"@id":"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/"}}]}},"multi-rating":{"mr_rating_results":[]},"_links":{"self":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/dt_lessons\/7644","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/dt_lessons"}],"about":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/types\/dt_lessons"}],"author":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/comments?post=7644"}],"version-history":[{"count":1,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/dt_lessons\/7644\/revisions"}],"predecessor-version":[{"id":8054,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/dt_lessons\/7644\/revisions\/8054"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/media\/8010"}],"wp:attachment":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/media?parent=7644"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}