{"id":7674,"date":"2018-10-18T00:37:21","date_gmt":"2018-10-17T22:37:21","guid":{"rendered":"https:\/\/www.datanovia.com\/en\/?post_type=dt_lessons&#038;p=7674"},"modified":"2018-10-21T08:28:42","modified_gmt":"2018-10-21T06:28:42","slug":"k-means-clustering-in-r-algorith-and-practical-examples","status":"publish","type":"dt_lessons","link":"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/","title":{"rendered":"K-Means Clustering in R: Algorithm and Practical Examples"},"content":{"rendered":"<div id=\"rdoc\">\n<p><strong>K-means clustering<\/strong> <span class=\"citation\">(MacQueen 1967)<\/span> is one of the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. <em>k clusters<\/em>), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high <em>intra-class similarity<\/em>), whereas objects from different clusters are as dissimilar as possible (i.e., low <em>inter-class similarity<\/em>). In k-means clustering, each cluster is represented by its center (i.e, <em>centroid<\/em>) which corresponds to the mean of points assigned to the cluster.<\/p>\n<div class=\"block\">\n<p>In this article, you will learn:<\/p>\n<ul>\n<li>The basic steps of <strong>k-means algorithm<\/strong><\/li>\n<li>How to compute <strong>k-means in R<\/strong> software using practical examples<\/li>\n<li>Advantages and disavantages of k-means clustering<\/li>\n<\/ul>\n<\/div>\n<p>Contents:<\/p>\n<div id=\"TOC\">\n<ul>\n<li><a href=\"#k-means-basic-ideas\">K-means basic ideas<\/a><\/li>\n<li><a href=\"#k-means-algorithm\">K-means algorithm<\/a><\/li>\n<li><a href=\"#computing-k-means-clustering-in-r\">Computing k-means clustering in R<\/a>\n<ul>\n<li><a href=\"#data\">Data<\/a><\/li>\n<li><a href=\"#required-r-packages-and-functions\">Required R packages and functions<\/a><\/li>\n<li><a href=\"#estimating-the-optimal-number-of-clusters\">Estimating the optimal number of clusters<\/a><\/li>\n<li><a href=\"#computing-k-means-clustering\">Computing k-means clustering<\/a><\/li>\n<li><a href=\"#accessing-to-the-results-of-kmeans-function\">Accessing to the results of kmeans() function<\/a><\/li>\n<li><a href=\"#visualizing-k-means-clusters\">Visualizing k-means clusters<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"#k-means-clustering-advantages-and-disadvantages\">K-means clustering advantages and disadvantages<\/a><\/li>\n<li><a href=\"#alternative-to-k-means-clustering\">Alternative to k-means clustering<\/a><\/li>\n<li><a href=\"#summary\">Summary<\/a><\/li>\n<li><a href=\"#references\">References<\/a><\/li>\n<\/ul>\n<\/div>\n<div class='dt-sc-hr-invisible-medium  '><\/div>\n<div class='dt-sc-ico-content type1'><div class='custom-icon' ><a href='https:\/\/www.datanovia.com\/en\/product\/practical-guide-to-cluster-analysis-in-r\/' target='_blank'><span class='fa fa-book'><\/span><\/a><\/div><h4><a href='https:\/\/www.datanovia.com\/en\/product\/practical-guide-to-cluster-analysis-in-r\/' target='_blank'> Related Book <\/a><\/h4>Practical Guide to Cluster Analysis in R<\/div>\n<div class='dt-sc-hr-invisible-medium  '><\/div>\n<div id=\"k-means-basic-ideas\" class=\"section level2\">\n<h2>K-means basic ideas<\/h2>\n<p>The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized.<\/p>\n<p>There are several k-means algorithms available. The standard algorithm is the Hartigan-Wong algorithm <span class=\"citation\">(Hartigan and Wong 1979)<\/span>, which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid:<\/p>\n<p><span class=\"math display\">\\[<br \/>\nW(C_k) = \\sum\\limits_{x_i \\in C_k} (x_i - \\mu_k)^2<br \/>\n\\]<\/span><\/p>\n<ul>\n<li><span class=\"math inline\">\\(x_i\\)<\/span> design a data point belonging to the cluster <span class=\"math inline\">\\(C_k\\)<\/span><\/li>\n<li><span class=\"math inline\">\\(\\mu_k\\)<\/span> is the mean value of the points assigned to the cluster <span class=\"math inline\">\\(C_k\\)<\/span><\/li>\n<\/ul>\n<p>Each observation (<span class=\"math inline\">\\(x_i\\)<\/span>) is assigned to a given cluster such that the sum of squares (SS) distance of the observation to their assigned cluster centers <span class=\"math inline\">\\(\\mu_k\\)<\/span> is a minimum.<\/p>\n<p>We define the total within-cluster variation as follow:<\/p>\n<p><span class=\"math display\">\\[<br \/>\ntot.withinss = \\sum\\limits_{k=1}^k W(C_k) = \\sum\\limits_{k=1}^k \\sum\\limits_{x_i \\in C_k} (x_i - \\mu_k)^2<br \/>\n\\]<\/span><\/p>\n<p><span class=\"success\">The <em>total within-cluster sum of square<\/em> measures the compactness (i.e <em>goodness<\/em>) of the clustering and we want it to be as small as possible.<\/span><\/p>\n<\/div>\n<div id=\"k-means-algorithm\" class=\"section level2\">\n<h2>K-means algorithm<\/h2>\n<p>The first step when using k-means clustering is to indicate the number of clusters (k) that will be generated in the final solution.<\/p>\n<p>The algorithm starts by randomly selecting k objects from the data set to serve as the initial centers for the clusters. The selected objects are also known as cluster means or centroids.<\/p>\n<p>Next, each of the remaining objects is assigned to it\u2019s closest centroid, where closest is defined using the <a href=\"https:\/\/www.datanovia.com\/en\/lessons\/clustering-distance-measures\/\">Euclidean distance<\/a> between the object and the cluster mean. This step is called \u201ccluster assignment step\u201d. Note that, to use correlation distance, the data are input as z-scores.<\/p>\n<p>After the assignment step, the algorithm computes the new mean value of each cluster. The term cluster \u201ccentroid update\u201d is used to design this step. Now that the centers have been recalculated, every observation is checked again to see if it might be closer to a different cluster. All the objects are reassigned again using the updated cluster means.<\/p>\n<p>The cluster assignment and centroid update steps are iteratively repeated until the cluster assignments stop changing (i.e until <em>convergence<\/em> is achieved). That is, the clusters formed in the current iteration are the same as those obtained in the previous iteration.<\/p>\n<p>K-means algorithm can be summarized as follow:<\/p>\n<div class=\"block\">\n<ol style=\"list-style-type: decimal;\">\n<li>Specify the number of clusters (K) to be created (by the analyst)<\/li>\n<li>Select randomly k objects from the dataset as the initial cluster centers or means<\/li>\n<li>Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid<\/li>\n<li>For each of the k clusters update the <em>cluster centroid<\/em> by calculating the new mean values of all the data points in the cluster. The centoid of a <span class=\"math inline\"><em>K<\/em><sub><em>t<\/em><em>h<\/em><\/sub><\/span> cluster is a vector of length <span class=\"math inline\"><em>p<\/em><\/span> containing the means of all variables for the observations in the <span class=\"math inline\"><em>k<\/em><sub><em>t<\/em><em>h<\/em><\/sub><\/span> cluster; <em>p<\/em> is the number of variables.<\/li>\n<li>Iteratively minimize the total within sum of square. That is, iterate steps 3 and 4 until the cluster assignments stop changing or the maximum number of iterations is reached. By default, the <strong>R<\/strong> software uses 10 as the default value for the maximum number of iterations.<\/li>\n<\/ol>\n<\/div>\n<\/div>\n<div id=\"computing-k-means-clustering-in-r\" class=\"section level2\">\n<h2>Computing k-means clustering in R<\/h2>\n<div id=\"data\" class=\"section level3\">\n<h3>Data<\/h3>\n<p>We\u2019ll use the demo data sets \u201cUSArrests\u201d. The data should be prepared as described in chapter @ref(data-preparation-and-r-packages). The data must contains only continuous variables, as the k-means algorithm uses variable means. As we don\u2019t want the k-means algorithm to depend to an arbitrary variable unit, we start by scaling the data using the R function <em>scale()<\/em> as follow:<\/p>\n<pre class=\"r\"><code>data(\"USArrests\")      # Loading the data set\r\ndf &lt;- scale(USArrests) # Scaling the data\r\n\r\n# View the firt 3 rows of the data\r\nhead(df, n = 3)<\/code><\/pre>\n<pre><code>##         Murder Assault UrbanPop     Rape\r\n## Alabama 1.2426   0.783   -0.521 -0.00342\r\n## Alaska  0.5079   1.107   -1.212  2.48420\r\n## Arizona 0.0716   1.479    0.999  1.04288<\/code><\/pre>\n<\/div>\n<div id=\"required-r-packages-and-functions\" class=\"section level3\">\n<h3>Required R packages and functions<\/h3>\n<p>The standard R function for k-means clustering is <em>kmeans<\/em>() [<em>stats<\/em> package], which simplified format is as follow:<\/p>\n<pre class=\"r\"><code>kmeans(x, centers, iter.max = 10, nstart = 1)<\/code><\/pre>\n<div class=\"block\">\n<ul>\n<li><strong>x<\/strong>: numeric matrix, numeric data frame or a numeric vector<\/li>\n<li><strong>centers<\/strong>: Possible values are the number of clusters (k) or a set of initial (distinct) cluster centers. If a number, a random set of (distinct) rows in x is chosen as the initial centers.<\/li>\n<li><strong>iter.max<\/strong>: The maximum number of iterations allowed. Default value is 10.<\/li>\n<li><strong>nstart<\/strong>: The number of random starting partitions when centers is a number. Trying nstart &gt; 1 is often recommended.<\/li>\n<\/ul>\n<\/div>\n<p>To create a beautiful graph of the clusters generated with the <em>kmeans<\/em>() function, will use the <em>factoextra<\/em> package.<\/p>\n<ul>\n<li>Installing <em>factoextra<\/em> package as:<\/li>\n<\/ul>\n<pre class=\"r\"><code>install.packages(\"factoextra\")<\/code><\/pre>\n<ul>\n<li>Loading <em>factoextra<\/em>:<\/li>\n<\/ul>\n<pre class=\"r\"><code>library(factoextra)<\/code><\/pre>\n<\/div>\n<div id=\"estimating-the-optimal-number-of-clusters\" class=\"section level3\">\n<h3>Estimating the optimal number of clusters<\/h3>\n<p>The k-means clustering requires the users to specify the number of clusters to be generated.<\/p>\n<p><span class=\"question\">One fundamental question is: How to choose the right number of expected clusters (k)?<\/span><\/p>\n<p>Different methods will be presented in the chapter \u201ccluster evaluation and validation statistics\u201d.<\/p>\n<p>Here, we provide a simple solution. The idea is to compute k-means clustering using different values of clusters k. Next, the wss (within sum of square) is drawn according to the number of clusters. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.<\/p>\n<p>The R function <em>fviz_nbclust<\/em>() [in <em>factoextra<\/em> package] provides a convenient solution to estimate the optimal number of clusters.<\/p>\n<\/p>\n<div class=\"error\">Here, there are contents\/codes hidden to non-premium members. Signup now to read all of our premium contents and to be awarded a certificate of course completion.<br \/>\n<a href='https:\/\/www.datanovia.com\/en\/pricing\/' target='_self'  class='dt-sc-button   medium  '  style=\"background-color:#FF6600;border-color:#FF6600;color:#ffffff;\">Claim Your Membership Now<\/a>.<\/div>\n<p>\n<p><img decoding=\"async\" src=\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/dn-tutorials\/002-partitional-clustering\/figures\/006b-kmeans-clustering-k-means-optimal-clusters-wss-1.png\" width=\"518.4\" \/><\/p>\n<div class=\"success\">\n<p>The plot above represents the variance within the clusters. It decreases as k increases, but it can be seen a bend (or \u201celbow\u201d) at k = 4. This bend indicates that additional clusters beyond the fourth have little value.. In the next section, we\u2019ll classify the observations into 4 clusters.<\/p>\n<\/div>\n<\/div>\n<div id=\"computing-k-means-clustering\" class=\"section level3\">\n<h3>Computing k-means clustering<\/h3>\n<p>As k-means clustering algorithm starts with k randomly selected centroids, it\u2019s always recommended to use the <em>set.seed()<\/em> function in order to set a seed for <em>R\u2019s random number generator<\/em>. The aim is to make reproducible the results, so that the reader of this article will obtain exactly the same results as those shown below.<\/p>\n<p>The R code below performs <em>k-means clustering<\/em> with k = 4:<\/p>\n<pre class=\"r\"><code># Compute k-means with k = 4\r\nset.seed(123)\r\nkm.res &lt;- kmeans(df, 4, nstart = 25)<\/code><\/pre>\n<div class=\"warning\">\n<p>As the final result of k-means clustering result is sensitive to the random starting assignments, we specify <em>nstart = 25<\/em>. This means that R will try 25 different random starting assignments and then select the best results corresponding to the one with the lowest within cluster variation. The default value of <em>nstart<\/em> in R is one. But, it\u2019s strongly recommended to compute <em>k-means clustering<\/em> with a large value of <em>nstart<\/em> such as 25 or 50, in order to have a more stable result.<\/p>\n<\/div>\n<pre class=\"r\"><code># Print the results\r\nprint(km.res)<\/code><\/pre>\n<pre><code>## K-means clustering with 4 clusters of sizes 13, 16, 13, 8\r\n## \r\n## Cluster means:\r\n##   Murder Assault UrbanPop    Rape\r\n## 1 -0.962  -1.107   -0.930 -0.9668\r\n## 2 -0.489  -0.383    0.576 -0.2617\r\n## 3  0.695   1.039    0.723  1.2769\r\n## 4  1.412   0.874   -0.815  0.0193\r\n## \r\n## Clustering vector:\r\n##        Alabama         Alaska        Arizona       Arkansas     California \r\n##              4              3              3              4              3 \r\n##       Colorado    Connecticut       Delaware        Florida        Georgia \r\n##              3              2              2              3              4 \r\n##         Hawaii          Idaho       Illinois        Indiana           Iowa \r\n##              2              1              3              2              1 \r\n##         Kansas       Kentucky      Louisiana          Maine       Maryland \r\n##              2              1              4              1              3 \r\n##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri \r\n##              2              3              1              4              3 \r\n##        Montana       Nebraska         Nevada  New Hampshire     New Jersey \r\n##              1              1              3              1              2 \r\n##     New Mexico       New York North Carolina   North Dakota           Ohio \r\n##              3              3              4              1              2 \r\n##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina \r\n##              2              2              2              2              4 \r\n##   South Dakota      Tennessee          Texas           Utah        Vermont \r\n##              1              4              3              2              1 \r\n##       Virginia     Washington  West Virginia      Wisconsin        Wyoming \r\n##              2              2              1              1              2 \r\n## \r\n## Within cluster sum of squares by cluster:\r\n## [1] 11.95 16.21 19.92  8.32\r\n##  (between_SS \/ total_SS =  71.2 %)\r\n## \r\n## Available components:\r\n## \r\n## [1] \"cluster\"      \"centers\"      \"totss\"        \"withinss\"    \r\n## [5] \"tot.withinss\" \"betweenss\"    \"size\"         \"iter\"        \r\n## [9] \"ifault\"<\/code><\/pre>\n<div class=\"success\">\n<p>The printed output displays:<\/p>\n<ul>\n<li>the cluster means or centers: a matrix, which rows are cluster number (1 to 4) and columns are variables<\/li>\n<li>the clustering vector: A vector of integers (from 1:k) indicating the cluster to which each point is allocated<\/li>\n<\/ul>\n<\/div>\n<p>It\u2019s possible to compute the mean of each variables by clusters using the original data:<\/p>\n<pre class=\"r\"><code>aggregate(USArrests, by=list(cluster=km.res$cluster), mean)<\/code><\/pre>\n<pre><code>##   cluster Murder Assault UrbanPop Rape\r\n## 1       1   3.60    78.5     52.1 12.2\r\n## 2       2   5.66   138.9     73.9 18.8\r\n## 3       3  10.82   257.4     76.0 33.2\r\n## 4       4  13.94   243.6     53.8 21.4<\/code><\/pre>\n<p>If you want to add the point classifications to the original data, use this:<\/p>\n<pre class=\"r\"><code>dd &lt;- cbind(USArrests, cluster = km.res$cluster)\r\nhead(dd)<\/code><\/pre>\n<pre><code>##            Murder Assault UrbanPop Rape cluster\r\n## Alabama      13.2     236       58 21.2       4\r\n## Alaska       10.0     263       48 44.5       3\r\n## Arizona       8.1     294       80 31.0       3\r\n## Arkansas      8.8     190       50 19.5       4\r\n## California    9.0     276       91 40.6       3\r\n## Colorado      7.9     204       78 38.7       3<\/code><\/pre>\n<\/div>\n<div id=\"accessing-to-the-results-of-kmeans-function\" class=\"section level3\">\n<h3>Accessing to the results of kmeans() function<\/h3>\n<p><strong>kmeans()<\/strong> function returns a list of components, including:<\/p>\n<ul>\n<li><strong>cluster<\/strong>: A vector of integers (from 1:k) indicating the cluster to which each point is allocated<\/li>\n<li><strong>centers<\/strong>: A matrix of cluster centers (cluster means)<\/li>\n<li><strong>totss<\/strong>: The total sum of squares (TSS), i.e <span class=\"math inline\">\\(\\sum{(x_i - \\bar{x})^2}\\)<\/span>. TSS measures the total variance in the data.<\/li>\n<li><strong>withinss<\/strong>: Vector of within-cluster sum of squares, one component per cluster<\/li>\n<li><strong>tot.withinss<\/strong>: Total within-cluster sum of squares, i.e. <span class=\"math inline\">\\(sum(withinss)\\)<\/span><\/li>\n<li><strong>betweenss<\/strong>: The between-cluster sum of squares, i.e. <span class=\"math inline\">\\(totss - tot.withinss\\)<\/span><\/li>\n<li><strong>size<\/strong>: The number of observations in each cluster<\/li>\n<\/ul>\n<p>These components can be accessed as follow:<\/p>\n<pre class=\"r\"><code># Cluster number for each of the observations\r\nkm.res$cluster<\/code><\/pre>\n<pre class=\"r\"><code>head(km.res$cluster, 4)<\/code><\/pre>\n<pre><code>##  Alabama   Alaska  Arizona Arkansas \r\n##        4        3        3        4<\/code><\/pre>\n<p>\u2026..<\/p>\n<pre class=\"r\"><code># Cluster size\r\nkm.res$size<\/code><\/pre>\n<pre><code>## [1] 13 16 13  8<\/code><\/pre>\n<pre class=\"r\"><code># Cluster means\r\nkm.res$centers<\/code><\/pre>\n<pre><code>##   Murder Assault UrbanPop    Rape\r\n## 1 -0.962  -1.107   -0.930 -0.9668\r\n## 2 -0.489  -0.383    0.576 -0.2617\r\n## 3  0.695   1.039    0.723  1.2769\r\n## 4  1.412   0.874   -0.815  0.0193<\/code><\/pre>\n<\/div>\n<div id=\"visualizing-k-means-clusters\" class=\"section level3\">\n<h3>Visualizing k-means clusters<\/h3>\n<p>It is a good idea to plot the cluster results. These can be used to assess the choice of the number of clusters as well as comparing two different cluster analyses.<\/p>\n<p>Now, we want to visualize the data in a scatter plot with coloring each data point according to its cluster assignment.<\/p>\n<p>The problem is that the data contains more than 2 variables and the question is what variables to choose for the xy scatter plot.<\/p>\n<p>A solution is to reduce the number of dimensions by applying a dimensionality reduction algorithm, such as <a href=\"http:\/\/www.sthda.com\/english\/wiki\/factominer-and-factoextra-principal-component-analysis-visualization-r-software-and-data-mining\"><strong>Principal Component Analysis (PCA)<\/strong><\/a>, that operates on the four variables and outputs two new variables (that represent the original variables) that you can use to do the plot.<\/p>\n<div class=\"success\">\n<p>In other words, if we have a multi-dimensional data set, a solution is to perform Principal Component Analysis (PCA) and to plot data points according to the first two principal components coordinates.<\/p>\n<\/div>\n<p>The function <em>fviz_cluster<\/em>() [<em>factoextra<\/em> package] can be used to easily visualize k-means clusters. It takes k-means results and the original data as arguments. In the resulting plot, observations are represented by points, using principal components if the number of variables is greater than 2. It\u2019s also possible to draw concentration ellipse around each cluster.<\/p>\n<\/p>\n<div class=\"error\">Here, there are contents\/codes hidden to non-premium members. Signup now to read all of our premium contents and to be awarded a certificate of course completion.<br \/>\n<a href='https:\/\/www.datanovia.com\/en\/pricing\/' target='_self'  class='dt-sc-button   medium  '  style=\"background-color:#FF6600;border-color:#FF6600;color:#ffffff;\">Claim Your Membership Now<\/a>.<\/div>\n<p>\n<p><img decoding=\"async\" src=\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/dn-tutorials\/002-partitional-clustering\/figures\/006b-kmeans-clustering-k-means-plot-ggplot2-factoextra-1.png\" width=\"672\" \/><\/p>\n<\/div>\n<\/div>\n<div id=\"k-means-clustering-advantages-and-disadvantages\" class=\"section level2\">\n<h2>K-means clustering advantages and disadvantages<\/h2>\n<p>K-means clustering is very simple and fast algorithm. It can efficiently deal with very large data sets. However there are some weaknesses, including:<\/p>\n<div class=\"warning\">\n<ol style=\"list-style-type: decimal;\">\n<li>It assumes prior knowledge of the data and requires the analyst to choose the appropriate number of cluster (k) in advance<\/li>\n<li>The final results obtained is sensitive to the initial random selection of cluster centers. Why is it a problem? Because, for every different run of the algorithm on the same dataset, you may choose different set of initial centers. This may lead to different clustering results on different runs of the algorithm.<\/li>\n<li>It\u2019s sensitive to outliers.<\/li>\n<li>If you rearrange your data, it\u2019s very possible that you\u2019ll get a different solution every time you change the ordering of your data.<\/li>\n<\/ol>\n<\/div>\n<p>Possible solutions to these weaknesses, include:<\/p>\n<div class=\"success\">\n<ol style=\"list-style-type: decimal;\">\n<li>Solution to issue 1: Compute k-means for a range of k values, for example by varying k between 2 and 10. Then, choose the best k by comparing the clustering results obtained for the different k values.<\/li>\n<li>Solution to issue 2: Compute K-means algorithm several times with different initial cluster centers. The run with the lowest total within-cluster sum of square is selected as the final clustering solution.<\/li>\n<li>To avoid distortions caused by excessive outliers, it\u2019s possible to use PAM algorithm, which is less sensitive to outliers.<\/li>\n<\/ol>\n<\/div>\n<\/div>\n<div id=\"alternative-to-k-means-clustering\" class=\"section level2\">\n<h2>Alternative to k-means clustering<\/h2>\n<p>A robust alternative to k-means is PAM, which is based on medoids. As discussed in the next chapter, the PAM clustering can be computed using the function <em>pam<\/em>() [<em>cluster<\/em> package]. The function <em>pamk<\/em>( ) [fpc package] is a wrapper for PAM that also prints the suggested number of clusters based on optimum average silhouette width.<\/p>\n<\/div>\n<div id=\"summary\" class=\"section level2\">\n<h2>Summary<\/h2>\n<p>K-means clustering can be used to classify observations into k groups, based on their similarity. Each group is represented by the mean value of points in the group, known as the cluster centroid.<\/p>\n<p>K-means algorithm requires users to specify the number of cluster to generate. The R function <em>kmeans<\/em>() [<em>stats<\/em> package] can be used to compute k-means algorithm. The simplified format is kmeans(x, centers), where \u201cx\u201d is the data and centers is the number of clusters to be produced.<\/p>\n<p>After, computing k-means clustering, the R function <em>fviz_cluster<\/em>() [<em>factoextra<\/em> package] can be used to visualize the results. The format is fviz_cluster(km.res, data), where km.res is k-means results and data corresponds to the original data sets.<\/p>\n<\/div>\n<div id=\"references\" class=\"section level2 unnumbered\">\n<h2>References<\/h2>\n<div id=\"refs\" class=\"references\">\n<div id=\"ref-hartigan1979\">\n<p>Hartigan, JA, and MA Wong. 1979. \u201cAlgorithm AS 136: A K-means clustering algorithm.\u201d <em>Applied Statistics<\/em>. Royal Statistical Society, 100\u2013108.<\/p>\n<\/div>\n<div id=\"ref-macqueen1967\">\n<p>MacQueen, J. 1967. \u201cSome Methods for Classification and Analysis of Multivariate Observations.\u201d In <em>Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics<\/em>, 281\u201397. Berkeley, Calif.: University of California Press. <a class=\"uri\" href=\"http:\/\/projecteuclid.org:443\/euclid.bsmsp\/1200512992\">http:\/\/projecteuclid.org:443\/euclid.bsmsp\/1200512992<\/a>.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p><!--end rdoc--><\/p>\n","protected":false},"excerpt":{"rendered":"<p>K-means clustering is one of the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups. In this tutorial, you will learn: 1) the basic steps of k-means algorithm; 2) How to compute k-means in R software using practical examples; and 3) Advantages and disavantages of k-means clustering<\/p>\n","protected":false},"author":1,"featured_media":7968,"parent":0,"menu_order":0,"comment_status":"open","ping_status":"closed","template":"","class_list":["post-7674","dt_lessons","type-dt_lessons","status-publish","has-post-thumbnail","hentry"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>K-Means Clustering in R: Algorithm and Practical Examples - Datanovia<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"K-Means Clustering in R: Algorithm and Practical Examples - Datanovia\" \/>\n<meta property=\"og:description\" content=\"K-means clustering is one of the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups. In this tutorial, you will learn: 1) the basic steps of k-means algorithm; 2) How to compute k-means in R software using practical examples; and 3) Advantages and disavantages of k-means clustering\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/\" \/>\n<meta property=\"og:site_name\" content=\"Datanovia\" \/>\n<meta property=\"article:modified_time\" content=\"2018-10-21T06:28:42+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030315.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"13 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/\",\"url\":\"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/\",\"name\":\"K-Means Clustering in R: Algorithm and Practical Examples - Datanovia\",\"isPartOf\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030315.jpg\",\"datePublished\":\"2018-10-17T22:37:21+00:00\",\"dateModified\":\"2018-10-21T06:28:42+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/#primaryimage\",\"url\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030315.jpg\",\"contentUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030315.jpg\",\"width\":1024,\"height\":512},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.datanovia.com\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Lessons\",\"item\":\"https:\/\/www.datanovia.com\/en\/lessons\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"K-Means Clustering in R: Algorithm and Practical Examples\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#website\",\"url\":\"https:\/\/www.datanovia.com\/en\/\",\"name\":\"Datanovia\",\"description\":\"Data Mining and Statistics for Decision Support\",\"publisher\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.datanovia.com\/en\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#organization\",\"name\":\"Datanovia\",\"url\":\"https:\/\/www.datanovia.com\/en\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png\",\"contentUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png\",\"width\":98,\"height\":99,\"caption\":\"Datanovia\"},\"image\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"K-Means Clustering in R: Algorithm and Practical Examples - Datanovia","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/","og_locale":"en_US","og_type":"article","og_title":"K-Means Clustering in R: Algorithm and Practical Examples - Datanovia","og_description":"K-means clustering is one of the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups. In this tutorial, you will learn: 1) the basic steps of k-means algorithm; 2) How to compute k-means in R software using practical examples; and 3) Advantages and disavantages of k-means clustering","og_url":"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/","og_site_name":"Datanovia","article_modified_time":"2018-10-21T06:28:42+00:00","og_image":[{"width":1024,"height":512,"url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030315.jpg","type":"image\/jpeg"}],"twitter_card":"summary_large_image","twitter_misc":{"Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/","url":"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/","name":"K-Means Clustering in R: Algorithm and Practical Examples - Datanovia","isPartOf":{"@id":"https:\/\/www.datanovia.com\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/#primaryimage"},"image":{"@id":"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/#primaryimage"},"thumbnailUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030315.jpg","datePublished":"2018-10-17T22:37:21+00:00","dateModified":"2018-10-21T06:28:42+00:00","breadcrumb":{"@id":"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/#primaryimage","url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030315.jpg","contentUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030315.jpg","width":1024,"height":512},{"@type":"BreadcrumbList","@id":"https:\/\/www.datanovia.com\/en\/lessons\/k-means-clustering-in-r-algorith-and-practical-examples\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.datanovia.com\/en\/"},{"@type":"ListItem","position":2,"name":"Lessons","item":"https:\/\/www.datanovia.com\/en\/lessons\/"},{"@type":"ListItem","position":3,"name":"K-Means Clustering in R: Algorithm and Practical Examples"}]},{"@type":"WebSite","@id":"https:\/\/www.datanovia.com\/en\/#website","url":"https:\/\/www.datanovia.com\/en\/","name":"Datanovia","description":"Data Mining and Statistics for Decision Support","publisher":{"@id":"https:\/\/www.datanovia.com\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.datanovia.com\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.datanovia.com\/en\/#organization","name":"Datanovia","url":"https:\/\/www.datanovia.com\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/","url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png","contentUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png","width":98,"height":99,"caption":"Datanovia"},"image":{"@id":"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/"}}]}},"multi-rating":{"mr_rating_results":[]},"_links":{"self":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/dt_lessons\/7674","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/dt_lessons"}],"about":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/types\/dt_lessons"}],"author":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/comments?post=7674"}],"version-history":[{"count":1,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/dt_lessons\/7674\/revisions"}],"predecessor-version":[{"id":8023,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/dt_lessons\/7674\/revisions\/8023"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/media\/7968"}],"wp:attachment":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/media?parent=7674"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}