{"id":8067,"date":"2018-10-21T12:14:52","date_gmt":"2018-10-21T10:14:52","guid":{"rendered":"https:\/\/www.datanovia.com\/en\/?post_type=dt_lessons&#038;p=8067"},"modified":"2018-10-21T14:07:47","modified_gmt":"2018-10-21T12:07:47","slug":"computing-p-value-for-hierarchical-clustering","status":"publish","type":"dt_lessons","link":"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/","title":{"rendered":"Computing P-value for Hierarchical Clustering"},"content":{"rendered":"<p>&nbsp;<\/p>\n<div id=\"rdoc\">\n<p>Clusters can be found in a data set by chance due to clustering noise or sampling error. This article describes the R package <strong>pvclust<\/strong> <span class=\"citation\">(Suzuki and Shimodaira 2015)<\/span> which uses bootstrap resampling techniques to <strong>compute p-value<\/strong> for each <strong>hierarchical clusters<\/strong>.<\/p>\n<p>Contents:<\/p>\n<div id=\"TOC\">\n<ul>\n<li><a href=\"#algorithm\">Algorithm<\/a><\/li>\n<li><a href=\"#required-packages\">Required packages<\/a><\/li>\n<li><a href=\"#data-preparation\">Data preparation<\/a><\/li>\n<li><a href=\"#compute-p-value-for-hierarchical-clustering\">Compute p-value for hierarchical clustering<\/a>\n<ul>\n<li><a href=\"#description-of-pvclust-function\">Description of pvclust() function<\/a><\/li>\n<li><a href=\"#usage-of-pvclust-function\">Usage of pvclust() function<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"#references\">References<\/a><\/li>\n<\/ul>\n<\/div>\n<div class='dt-sc-hr-invisible-medium  '><\/div>\n<div class='dt-sc-ico-content type1'><div class='custom-icon' ><a href='https:\/\/www.datanovia.com\/en\/product\/practical-guide-to-cluster-analysis-in-r\/' target='_blank'><span class='fa fa-book'><\/span><\/a><\/div><h4><a href='https:\/\/www.datanovia.com\/en\/product\/practical-guide-to-cluster-analysis-in-r\/' target='_blank'> Related Book <\/a><\/h4>Practical Guide to Cluster Analysis in R<\/div>\n<div class='dt-sc-hr-invisible-medium  '><\/div>\n<div id=\"algorithm\" class=\"section level2\">\n<h2>Algorithm<\/h2>\n<ol style=\"list-style-type: decimal;\">\n<li>Generated thousands of bootstrap samples by randomly sampling elements of the data<\/li>\n<li>Compute hierarchical clustering on each bootstrap copy<\/li>\n<li>For each cluster:\n<ul>\n<li>compute the <em>bootstrap probability<\/em> (<em>BP<\/em>) value which corresponds to the frequency that the cluster is identified in bootstrap copies.<\/li>\n<li>Compute the <em>approximately unbiased<\/em> (AU) probability values (p-values) by multiscale bootstrap resampling<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<div class=\"success\">\n<p>Clusters with AU &gt;= 95% are considered to be strongly supported by data.<\/p>\n<\/div>\n<\/div>\n<div id=\"required-packages\" class=\"section level2\">\n<h2>Required packages<\/h2>\n<ol style=\"list-style-type: decimal;\">\n<li>Install <strong>pvclust<\/strong>:<\/li>\n<\/ol>\n<pre class=\"r\"><code>install.packages(\"pvclust\")<\/code><\/pre>\n<ol style=\"list-style-type: decimal;\" start=\"2\">\n<li>Load <strong>pvclust<\/strong>:<\/li>\n<\/ol>\n<pre class=\"r\"><code>library(pvclust)<\/code><\/pre>\n<\/div>\n<div id=\"data-preparation\" class=\"section level2\">\n<h2>Data preparation<\/h2>\n<p>We\u2019ll use <em>lung<\/em> data set [in <em>pvclust<\/em> package]. It contains the gene expression profile of 916 genes of 73 lung tissues including 67 tumors. Columns are samples and rows are genes.<\/p>\n<pre class=\"r\"><code>library(pvclust)\r\n# Load the data\r\ndata(\"lung\")\r\nhead(lung[, 1:4])<\/code><\/pre>\n<pre><code>##               fetal_lung 232-97_SCC 232-97_node 68-96_Adeno\r\n## IMAGE:196992       -0.40       4.28        3.68       -1.35\r\n## IMAGE:587847       -2.22       5.21        4.75       -0.91\r\n## IMAGE:1049185      -1.35      -0.84       -2.88        3.35\r\n## IMAGE:135221        0.68       0.56       -0.45       -0.20\r\n## IMAGE:298560          NA       4.14        3.58       -0.40\r\n## IMAGE:119882       -3.23      -2.84       -2.72       -0.83<\/code><\/pre>\n<pre class=\"r\"><code># Dimension of the data\r\ndim(lung)<\/code><\/pre>\n<pre><code>## [1] 916  73<\/code><\/pre>\n<p>We\u2019ll use only a subset of the data set for the clustering analysis. The R function <em>sample<\/em>() can be used to extract a random subset of 30 samples:<\/p>\n<pre class=\"r\"><code>set.seed(123)\r\nss &lt;- sample(1:73, 30) # extract 20 samples out of\r\ndf &lt;- lung[, ss]<\/code><\/pre>\n<\/div>\n<div id=\"compute-p-value-for-hierarchical-clustering\" class=\"section level2\">\n<h2>Compute p-value for hierarchical clustering<\/h2>\n<div id=\"description-of-pvclust-function\" class=\"section level3\">\n<h3>Description of pvclust() function<\/h3>\n<p>The function <em>pvclust<\/em>() can be used as follow:<\/p>\n<pre class=\"r\"><code>pvclust(data, method.hclust = \"average\",\r\n        method.dist = \"correlation\", nboot = 1000)<\/code><\/pre>\n<p>Note that, the computation time can be strongly decreased using parallel computation version called <em>parPvclust<\/em>(). (Read ?parPvclust() for more information.)<\/p>\n<pre class=\"r\"><code>parPvclust(cl=NULL, data, method.hclust = \"average\",\r\n           method.dist = \"correlation\", nboot = 1000,\r\n           iseed = NULL)<\/code><\/pre>\n<div class=\"block\">\n<ul>\n<li><strong>data<\/strong>: numeric data matrix or data frame.<\/li>\n<li><strong>method.hclust<\/strong>: the agglomerative method used in hierarchical clustering. Possible values are one of \u201caverage\u201d, \u201cward\u201d, \u201csingle\u201d, \u201ccomplete\u201d, \u201cmcquitty\u201d, \u201cmedian\u201d or \u201ccentroid\u201d. The default is \u201caverage\u201d. See method argument in <strong>?hclust<\/strong>.<\/li>\n<li><strong>method.dist<\/strong>: the distance measure to be used. Possible values are one of \u201ccorrelation\u201d, \u201cuncentered\u201d, \u201cabscor\u201d or those which are allowed for <strong>method<\/strong> argument in <strong>dist()<\/strong> function, such \u201ceuclidean\u201d and \u201cmanhattan\u201d.<\/li>\n<li><strong>nboot<\/strong>: the number of bootstrap replications. The default is 1000.<\/li>\n<li><strong>iseed<\/strong>: an integrer for random seeds. Use iseed argument to achieve reproducible results.<\/li>\n<\/ul>\n<\/div>\n<p>The function <em>pvclust<\/em>() returns an object of class <em>pvclust<\/em> containing many elements including <em>hclust<\/em> which contains hierarchical clustering result for the original data generated by the function <em>hclust<\/em>().<\/p>\n<\/div>\n<div id=\"usage-of-pvclust-function\" class=\"section level3\">\n<h3>Usage of pvclust() function<\/h3>\n<p><em>pvclust<\/em>() performs clustering on the columns of the data set, which correspond to samples in our case. If you want to perform the clustering on the variables (here, genes) you have to transpose the data set using the function <em>t<\/em>().<\/p>\n<p>The R code below computes <em>pvclust<\/em>() using 10 as the number of bootstrap replications (for speed):<\/p>\n<pre class=\"r\"><code>library(pvclust)\r\nset.seed(123)\r\nres.pv &lt;- pvclust(df, method.dist=\"cor\", \r\n                  method.hclust=\"average\", nboot = 10)<\/code><\/pre>\n<pre class=\"r\"><code># Default plot\r\nplot(res.pv, hang = -1, cex = 0.5)\r\npvrect(res.pv)<\/code><\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/dn-tutorials\/004-cluster-validation\/figures\/018-p-value-for-hierarchical-clustering-pvclust-p-value-hierarchical-clustering-1.png\" width=\"518.4\" \/><\/p>\n<div class=\"success\">\n<p>Values on the dendrogram are <em>AU p-values<\/em> (Red, left), <em>BP values<\/em> (green, right), and <span class=\"math inline\"><em>c<\/em><em>l<\/em><em>u<\/em><em>s<\/em><em>t<\/em><em>e<\/em><em>r<\/em><em>l<\/em><em>a<\/em><em>b<\/em><em>e<\/em><em>l<\/em><em>s<\/em><\/span> (grey, bottom). Clusters with AU &gt; = 95% are indicated by the rectangles and are considered to be strongly supported by data.<\/p>\n<\/div>\n<p>To extract the objects from the significant clusters, use the function <em>pvpick<\/em>():<\/p>\n<pre class=\"r\"><code>clusters &lt;- pvpick(res.pv)\r\nclusters<\/code><\/pre>\n<p>Parallel computation can be applied as follow:<\/p>\n<pre class=\"r\"><code># Create a parallel socket cluster\r\nlibrary(parallel)\r\ncl &lt;- makeCluster(2, type = \"PSOCK\")\r\n# parallel version of pvclust\r\nres.pv &lt;- parPvclust(cl, df, nboot=1000)\r\nstopCluster(cl)<\/code><\/pre>\n<\/div>\n<\/div>\n<div id=\"references\" class=\"section level2 unnumbered\">\n<h2>References<\/h2>\n<div id=\"refs\" class=\"references\">\n<div id=\"ref-suzuki2015\">\n<p>Suzuki, Ryota, and Hidetoshi Shimodaira. 2015. <em>Pvclust: Hierarchical Clustering with P-Values via Multiscale Bootstrap Resampling<\/em>. <a class=\"uri\" href=\"https:\/\/CRAN.R-project.org\/package=pvclust\">https:\/\/CRAN.R-project.org\/package=pvclust<\/a>.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p><!--end rdoc--><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This article describes the R package pvclust, which uses bootstrap resampling techniques to compute p-value for each hierarchical clusters.<\/p>\n","protected":false},"author":1,"featured_media":7839,"parent":0,"menu_order":0,"comment_status":"open","ping_status":"closed","template":"","class_list":["post-8067","dt_lessons","type-dt_lessons","status-publish","has-post-thumbnail","hentry"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Computing P-value for Hierarchical Clustering - Datanovia<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Computing P-value for Hierarchical Clustering - Datanovia\" \/>\n<meta property=\"og:description\" content=\"This article describes the R package pvclust, which uses bootstrap resampling techniques to compute p-value for each hierarchical clusters.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/\" \/>\n<meta property=\"og:site_name\" content=\"Datanovia\" \/>\n<meta property=\"article:modified_time\" content=\"2018-10-21T12:07:47+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030149.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/\",\"url\":\"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/\",\"name\":\"Computing P-value for Hierarchical Clustering - Datanovia\",\"isPartOf\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030149.jpg\",\"datePublished\":\"2018-10-21T10:14:52+00:00\",\"dateModified\":\"2018-10-21T12:07:47+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/#primaryimage\",\"url\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030149.jpg\",\"contentUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030149.jpg\",\"width\":1024,\"height\":512},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.datanovia.com\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Lessons\",\"item\":\"https:\/\/www.datanovia.com\/en\/lessons\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Computing P-value for Hierarchical Clustering\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#website\",\"url\":\"https:\/\/www.datanovia.com\/en\/\",\"name\":\"Datanovia\",\"description\":\"Data Mining and Statistics for Decision Support\",\"publisher\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.datanovia.com\/en\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#organization\",\"name\":\"Datanovia\",\"url\":\"https:\/\/www.datanovia.com\/en\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png\",\"contentUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png\",\"width\":98,\"height\":99,\"caption\":\"Datanovia\"},\"image\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Computing P-value for Hierarchical Clustering - Datanovia","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/","og_locale":"en_US","og_type":"article","og_title":"Computing P-value for Hierarchical Clustering - Datanovia","og_description":"This article describes the R package pvclust, which uses bootstrap resampling techniques to compute p-value for each hierarchical clusters.","og_url":"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/","og_site_name":"Datanovia","article_modified_time":"2018-10-21T12:07:47+00:00","og_image":[{"width":1024,"height":512,"url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030149.jpg","type":"image\/jpeg"}],"twitter_card":"summary_large_image","twitter_misc":{"Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/","url":"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/","name":"Computing P-value for Hierarchical Clustering - Datanovia","isPartOf":{"@id":"https:\/\/www.datanovia.com\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/#primaryimage"},"image":{"@id":"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/#primaryimage"},"thumbnailUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030149.jpg","datePublished":"2018-10-21T10:14:52+00:00","dateModified":"2018-10-21T12:07:47+00:00","breadcrumb":{"@id":"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/#primaryimage","url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030149.jpg","contentUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/P1030149.jpg","width":1024,"height":512},{"@type":"BreadcrumbList","@id":"https:\/\/www.datanovia.com\/en\/lessons\/computing-p-value-for-hierarchical-clustering\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.datanovia.com\/en\/"},{"@type":"ListItem","position":2,"name":"Lessons","item":"https:\/\/www.datanovia.com\/en\/lessons\/"},{"@type":"ListItem","position":3,"name":"Computing P-value for Hierarchical Clustering"}]},{"@type":"WebSite","@id":"https:\/\/www.datanovia.com\/en\/#website","url":"https:\/\/www.datanovia.com\/en\/","name":"Datanovia","description":"Data Mining and Statistics for Decision Support","publisher":{"@id":"https:\/\/www.datanovia.com\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.datanovia.com\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.datanovia.com\/en\/#organization","name":"Datanovia","url":"https:\/\/www.datanovia.com\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/","url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png","contentUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png","width":98,"height":99,"caption":"Datanovia"},"image":{"@id":"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/"}}]}},"multi-rating":{"mr_rating_results":[]},"_links":{"self":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/dt_lessons\/8067","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/dt_lessons"}],"about":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/types\/dt_lessons"}],"author":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/comments?post=8067"}],"version-history":[{"count":0,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/dt_lessons\/8067\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/media\/7839"}],"wp:attachment":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/media?parent=8067"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}