{"id":7235,"date":"2018-10-01T21:24:11","date_gmt":"2018-10-01T21:24:11","guid":{"rendered":"https:\/\/www.datanovia.com\/en\/?post_type=dt_lessons&#038;p=7235"},"modified":"2018-10-19T07:20:34","modified_gmt":"2018-10-19T05:20:34","slug":"identify-and-remove-duplicate-data-in-r","status":"publish","type":"dt_lessons","link":"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/","title":{"rendered":"Identify and Remove Duplicate Data in R"},"content":{"rendered":"<div id=\"rdoc\">\n<p>This tutorial describes how to <strong>identify<\/strong> and <strong>remove duplicate<\/strong> data in R.<\/p>\n<p>You will learn how to use the following R base and <strong>dplyr<\/strong> functions:<\/p>\n<ol style=\"list-style-type: decimal;\">\n<li><strong>R<\/strong> base functions\n<ul>\n<li><code>duplicated()<\/code>: for identifying duplicated elements and<\/li>\n<li><code>unique()<\/code>: for extracting unique elements,<\/li>\n<\/ul>\n<\/li>\n<li><strong>distinct()<\/strong> [<strong>dplyr<\/strong> package] to remove duplicate rows in a data frame.<\/li>\n<\/ol>\n<p><img decoding=\"async\" src=\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/dn-tutorials\/data-manipulation-in-r\/images\/remove-duplicate-data-r.png\" alt=\"Identify and Remove Duplicate Data in R\" \/><\/p>\n<p>Contents:<\/p>\n<div id=\"TOC\">\n<ul>\n<li><a href=\"#required-packages\">Required packages<\/a><\/li>\n<li><a href=\"#demo-dataset\">Demo dataset<\/a><\/li>\n<li><a href=\"#find-and-drop-duplicate-elements\">Find and drop duplicate elements<\/a><\/li>\n<li><a href=\"#extract-unique-elements\">Extract unique elements<\/a><\/li>\n<li><a href=\"#remove-duplicate-rows-in-a-data-frame\">Remove duplicate rows in a data frame<\/a><\/li>\n<li><a href=\"#summary\">Summary<\/a><\/li>\n<\/ul>\n<\/div>\n<div id=\"required-packages\" class=\"section level2\">\n<h2>Required packages<\/h2>\n<p>Load the <code>tidyverse<\/code> packages, which include <code>dplyr<\/code>:<\/p>\n<pre class=\"r\"><code>library(tidyverse)<\/code><\/pre>\n<\/div>\n<div id=\"demo-dataset\" class=\"section level2\">\n<h2>Demo dataset<\/h2>\n<p>We\u2019ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.<\/p>\n<pre class=\"r\"><code>my_data &lt;- as_tibble(iris)\r\nmy_data<\/code><\/pre>\n<pre><code>## # A tibble: 150 x 5\r\n##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species\r\n##          &lt;dbl&gt;       &lt;dbl&gt;        &lt;dbl&gt;       &lt;dbl&gt; &lt;fct&gt;  \r\n## 1          5.1         3.5          1.4         0.2 setosa \r\n## 2          4.9         3            1.4         0.2 setosa \r\n## 3          4.7         3.2          1.3         0.2 setosa \r\n## 4          4.6         3.1          1.5         0.2 setosa \r\n## 5          5           3.6          1.4         0.2 setosa \r\n## 6          5.4         3.9          1.7         0.4 setosa \r\n## # ... with 144 more rows<\/code><\/pre>\n<\/div>\n<div id=\"find-and-drop-duplicate-elements\" class=\"section level2\">\n<h2>Find and drop duplicate elements<\/h2>\n<p>The R function <code>duplicated()<\/code> returns a logical vector where TRUE specifies which elements of a vector or data frame are duplicates.<\/p>\n<p>Given the following vector:<\/p>\n<pre class=\"r\"><code>x &lt;- c(1, 1, 4, 5, 4, 6)<\/code><\/pre>\n<ul>\n<li>To find the position of duplicate elements in x, use this:<\/li>\n<\/ul>\n<pre class=\"r\"><code>duplicated(x)<\/code><\/pre>\n<pre><code>## [1] FALSE  TRUE FALSE FALSE  TRUE FALSE<\/code><\/pre>\n<ul>\n<li>Extract duplicate elements:<\/li>\n<\/ul>\n<pre class=\"r\"><code>x[duplicated(x)]<\/code><\/pre>\n<pre><code>## [1] 1 4<\/code><\/pre>\n<ul>\n<li>If you want to remove duplicated elements, use !duplicated(), where ! is a logical negation:<\/li>\n<\/ul>\n<pre class=\"r\"><code>x[!duplicated(x)]<\/code><\/pre>\n<pre><code>## [1] 1 4 5 6<\/code><\/pre>\n<ul>\n<li>Following this way, you can remove duplicate rows from a data frame based on a column values, as follow:<\/li>\n<\/ul>\n<pre class=\"r\"><code># Remove duplicates based on Sepal.Width columns\r\nmy_data[!duplicated(my_data$Sepal.Width), ]<\/code><\/pre>\n<pre><code>## # A tibble: 23 x 5\r\n##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species\r\n##          &lt;dbl&gt;       &lt;dbl&gt;        &lt;dbl&gt;       &lt;dbl&gt; &lt;fct&gt;  \r\n## 1          5.1         3.5          1.4         0.2 setosa \r\n## 2          4.9         3            1.4         0.2 setosa \r\n## 3          4.7         3.2          1.3         0.2 setosa \r\n## 4          4.6         3.1          1.5         0.2 setosa \r\n## 5          5           3.6          1.4         0.2 setosa \r\n## 6          5.4         3.9          1.7         0.4 setosa \r\n## # ... with 17 more rows<\/code><\/pre>\n<div class=\"warning\">\n<p><strong>!<\/strong> is a logical negation. <strong>!duplicated<\/strong>() means that we don\u2019t want duplicate rows.<\/p>\n<\/div>\n<\/div>\n<div id=\"extract-unique-elements\" class=\"section level2\">\n<h2>Extract unique elements<\/h2>\n<p>Given the following vector:<\/p>\n<pre class=\"r\"><code>x &lt;- c(1, 1, 4, 5, 4, 6)<\/code><\/pre>\n<p>You can extract unique elements as follow:<\/p>\n<pre class=\"r\"><code>unique(x)<\/code><\/pre>\n<pre><code>## [1] 1 4 5 6<\/code><\/pre>\n<p>It\u2019s also possible to apply <strong>unique<\/strong>() on a data frame, for removing duplicated rows as follow:<\/p>\n<pre class=\"r\"><code>unique(my_data)<\/code><\/pre>\n<pre><code>## # A tibble: 149 x 5\r\n##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species\r\n##          &lt;dbl&gt;       &lt;dbl&gt;        &lt;dbl&gt;       &lt;dbl&gt; &lt;fct&gt;  \r\n## 1          5.1         3.5          1.4         0.2 setosa \r\n## 2          4.9         3            1.4         0.2 setosa \r\n## 3          4.7         3.2          1.3         0.2 setosa \r\n## 4          4.6         3.1          1.5         0.2 setosa \r\n## 5          5           3.6          1.4         0.2 setosa \r\n## 6          5.4         3.9          1.7         0.4 setosa \r\n## # ... with 143 more rows<\/code><\/pre>\n<\/div>\n<div id=\"remove-duplicate-rows-in-a-data-frame\" class=\"section level2\">\n<h2>Remove duplicate rows in a data frame<\/h2>\n<p>The function <code>distinct()<\/code> [<em>dplyr<\/em> package] can be used to keep only unique\/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It\u2019s an efficient version of the R base function <code>unique()<\/code>.<\/p>\n<p>Remove duplicate rows based on all columns:<\/p>\n<pre class=\"r\"><code>my_data %&gt;% distinct()<\/code><\/pre>\n<pre><code>## # A tibble: 149 x 5\r\n##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species\r\n##          &lt;dbl&gt;       &lt;dbl&gt;        &lt;dbl&gt;       &lt;dbl&gt; &lt;fct&gt;  \r\n## 1          5.1         3.5          1.4         0.2 setosa \r\n## 2          4.9         3            1.4         0.2 setosa \r\n## 3          4.7         3.2          1.3         0.2 setosa \r\n## 4          4.6         3.1          1.5         0.2 setosa \r\n## 5          5           3.6          1.4         0.2 setosa \r\n## 6          5.4         3.9          1.7         0.4 setosa \r\n## # ... with 143 more rows<\/code><\/pre>\n<p>Remove duplicate rows based on certain columns (variables):<\/p>\n<pre class=\"r\"><code># Remove duplicated rows based on Sepal.Length\r\nmy_data %&gt;% distinct(Sepal.Length, .keep_all = TRUE)<\/code><\/pre>\n<pre><code>## # A tibble: 35 x 5\r\n##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species\r\n##          &lt;dbl&gt;       &lt;dbl&gt;        &lt;dbl&gt;       &lt;dbl&gt; &lt;fct&gt;  \r\n## 1          5.1         3.5          1.4         0.2 setosa \r\n## 2          4.9         3            1.4         0.2 setosa \r\n## 3          4.7         3.2          1.3         0.2 setosa \r\n## 4          4.6         3.1          1.5         0.2 setosa \r\n## 5          5           3.6          1.4         0.2 setosa \r\n## 6          5.4         3.9          1.7         0.4 setosa \r\n## # ... with 29 more rows<\/code><\/pre>\n<pre class=\"r\"><code># Remove duplicated rows based on \r\n# Sepal.Length and Petal.Width\r\nmy_data %&gt;% distinct(Sepal.Length, Petal.Width, .keep_all = TRUE)<\/code><\/pre>\n<pre><code>## # A tibble: 110 x 5\r\n##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species\r\n##          &lt;dbl&gt;       &lt;dbl&gt;        &lt;dbl&gt;       &lt;dbl&gt; &lt;fct&gt;  \r\n## 1          5.1         3.5          1.4         0.2 setosa \r\n## 2          4.9         3            1.4         0.2 setosa \r\n## 3          4.7         3.2          1.3         0.2 setosa \r\n## 4          4.6         3.1          1.5         0.2 setosa \r\n## 5          5           3.6          1.4         0.2 setosa \r\n## 6          5.4         3.9          1.7         0.4 setosa \r\n## # ... with 104 more rows<\/code><\/pre>\n<div class=\"notice\">\n<p>The option .kep_all is used to keep all variables in the data.<\/p>\n<\/div>\n<\/div>\n<div id=\"summary\" class=\"section level2\">\n<h2>Summary<\/h2>\n<p>In this chapter, we describe key functions for identifying and removing duplicate data:<\/p>\n<ul>\n<li>Remove duplicate rows based on one or more column values: <code>my_data %&gt;% dplyr::distinct(Sepal.Length)<\/code><\/li>\n<li>R base function to extract unique elements from vectors and data frames: <code>unique(my_data)<\/code><\/li>\n<li>R base function to determine duplicate elements: <code>duplicated(my_data)<\/code><\/li>\n<\/ul>\n<\/div>\n<\/div>\n<p><!--end rdoc--><\/p>\n","protected":false},"excerpt":{"rendered":"<p>You will learn how to identify and to remove duplicate data using R base and dplyr functions.<\/p>\n","protected":false},"author":1,"featured_media":7730,"parent":0,"menu_order":3,"comment_status":"open","ping_status":"closed","template":"","class_list":["post-7235","dt_lessons","type-dt_lessons","status-publish","has-post-thumbnail","hentry","lesson_complexity-easy"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Identify and Remove Duplicate Data in R - Datanovia<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Identify and Remove Duplicate Data in R - Datanovia\" \/>\n<meta property=\"og:description\" content=\"You will learn how to identify and to remove duplicate data using R base and dplyr functions.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/\" \/>\n<meta property=\"og:site_name\" content=\"Datanovia\" \/>\n<meta property=\"article:modified_time\" content=\"2018-10-19T05:20:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/IMG_0300.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/\",\"url\":\"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/\",\"name\":\"Identify and Remove Duplicate Data in R - Datanovia\",\"isPartOf\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/IMG_0300.jpg\",\"datePublished\":\"2018-10-01T21:24:11+00:00\",\"dateModified\":\"2018-10-19T05:20:34+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/#primaryimage\",\"url\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/IMG_0300.jpg\",\"contentUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/IMG_0300.jpg\",\"width\":1024,\"height\":512},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.datanovia.com\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Lessons\",\"item\":\"https:\/\/www.datanovia.com\/en\/lessons\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Identify and Remove Duplicate Data in R\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#website\",\"url\":\"https:\/\/www.datanovia.com\/en\/\",\"name\":\"Datanovia\",\"description\":\"Data Mining and Statistics for Decision Support\",\"publisher\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.datanovia.com\/en\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#organization\",\"name\":\"Datanovia\",\"url\":\"https:\/\/www.datanovia.com\/en\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png\",\"contentUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png\",\"width\":98,\"height\":99,\"caption\":\"Datanovia\"},\"image\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Identify and Remove Duplicate Data in R - Datanovia","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/","og_locale":"en_US","og_type":"article","og_title":"Identify and Remove Duplicate Data in R - Datanovia","og_description":"You will learn how to identify and to remove duplicate data using R base and dplyr functions.","og_url":"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/","og_site_name":"Datanovia","article_modified_time":"2018-10-19T05:20:34+00:00","og_image":[{"width":1024,"height":512,"url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/IMG_0300.jpg","type":"image\/jpeg"}],"twitter_card":"summary_large_image","twitter_misc":{"Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/","url":"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/","name":"Identify and Remove Duplicate Data in R - Datanovia","isPartOf":{"@id":"https:\/\/www.datanovia.com\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/#primaryimage"},"image":{"@id":"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/#primaryimage"},"thumbnailUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/IMG_0300.jpg","datePublished":"2018-10-01T21:24:11+00:00","dateModified":"2018-10-19T05:20:34+00:00","breadcrumb":{"@id":"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/#primaryimage","url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/IMG_0300.jpg","contentUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/10\/IMG_0300.jpg","width":1024,"height":512},{"@type":"BreadcrumbList","@id":"https:\/\/www.datanovia.com\/en\/lessons\/identify-and-remove-duplicate-data-in-r\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.datanovia.com\/en\/"},{"@type":"ListItem","position":2,"name":"Lessons","item":"https:\/\/www.datanovia.com\/en\/lessons\/"},{"@type":"ListItem","position":3,"name":"Identify and Remove Duplicate Data in R"}]},{"@type":"WebSite","@id":"https:\/\/www.datanovia.com\/en\/#website","url":"https:\/\/www.datanovia.com\/en\/","name":"Datanovia","description":"Data Mining and Statistics for Decision Support","publisher":{"@id":"https:\/\/www.datanovia.com\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.datanovia.com\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.datanovia.com\/en\/#organization","name":"Datanovia","url":"https:\/\/www.datanovia.com\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/","url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png","contentUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png","width":98,"height":99,"caption":"Datanovia"},"image":{"@id":"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/"}}]}},"multi-rating":{"mr_rating_results":[]},"_links":{"self":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/dt_lessons\/7235","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/dt_lessons"}],"about":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/types\/dt_lessons"}],"author":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/comments?post=7235"}],"version-history":[{"count":0,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/dt_lessons\/7235\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/media\/7730"}],"wp:attachment":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/media?parent=7235"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}