Data Manipulation in R

Identify and Remove Duplicate Data in R

This tutorial describes how to identify and remove duplicate data in R.

You will learn how to use the following R base and dplyr functions:

  1. R base functions
    • duplicated(): for identifying duplicated elements and
    • unique(): for extracting unique elements,
  2. distinct() [dplyr package] to remove duplicate rows in a data frame.

Identify and Remove Duplicate Data in R

Contents:

Required packages

Load the tidyverse packages, which include dplyr:

library(tidyverse)

Demo dataset

We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.

my_data <- as_tibble(iris)
my_data
## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa 
## # ... with 144 more rows

Find and drop duplicate elements

The R function duplicated() returns a logical vector where TRUE specifies which elements of a vector or data frame are duplicates.

Given the following vector:

x <- c(1, 1, 4, 5, 4, 6)
  • To find the position of duplicate elements in x, use this:
duplicated(x)
## [1] FALSE  TRUE FALSE FALSE  TRUE FALSE
  • Extract duplicate elements:
x[duplicated(x)]
## [1] 1 4
  • If you want to remove duplicated elements, use !duplicated(), where ! is a logical negation:
x[!duplicated(x)]
## [1] 1 4 5 6
  • Following this way, you can remove duplicate rows from a data frame based on a column values, as follow:
# Remove duplicates based on Sepal.Width columns
my_data[!duplicated(my_data$Sepal.Width), ]
## # A tibble: 23 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa 
## # ... with 17 more rows

! is a logical negation. !duplicated() means that we don’t want duplicate rows.

Extract unique elements

Given the following vector:

x <- c(1, 1, 4, 5, 4, 6)

You can extract unique elements as follow:

unique(x)
## [1] 1 4 5 6

It’s also possible to apply unique() on a data frame, for removing duplicated rows as follow:

unique(my_data)
## # A tibble: 149 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa 
## # ... with 143 more rows

Remove duplicate rows in a data frame

The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique().

Remove duplicate rows based on all columns:

my_data %>% distinct()
## # A tibble: 149 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa 
## # ... with 143 more rows

Remove duplicate rows based on certain columns (variables):

# Remove duplicated rows based on Sepal.Length
my_data %>% distinct(Sepal.Length, .keep_all = TRUE)
## # A tibble: 35 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa 
## # ... with 29 more rows
# Remove duplicated rows based on 
# Sepal.Length and Petal.Width
my_data %>% distinct(Sepal.Length, Petal.Width, .keep_all = TRUE)
## # A tibble: 110 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa 
## # ... with 104 more rows

The option .kep_all is used to keep all variables in the data.

Summary

In this chapter, we describe key functions for identifying and removing duplicate data:

  • Remove duplicate rows based on one or more column values: my_data %>% dplyr::distinct(Sepal.Length)
  • R base function to extract unique elements from vectors and data frames: unique(my_data)
  • R base function to determine duplicate elements: duplicated(my_data)

Subset Data Frame Rows in R (Prev Lesson)
(Next Lesson) Reorder Data Frame Rows in R
Back to Data Manipulation in R

Comments ( 11 )

  • Abouelela

    you are missing a comma here after the row x[duplicated(x)]. It should be like this x[duplicated(x), ]

    • x is a vector, so you don’t need to add a comma

    • Sergio Vicente

      Any way, your comment was very useful to me, ’cause I am working with a data frame (in my case). Tks a lot.

  • Gal Inbar

    Iam using Data table , and also very useful !!
    Thanks !!

    • Thank you for your positive feedback. Highly appreciated!

  • Phat

    Error: Length of logical index vector for `[` must equal number of columns (or 1):
    * `.data` has 1348 columns
    * Index vector has length 1191

    • please, clarify your question and provide reproducible example

  • Julián

    Thanks, it is a simple and useful tutorial.

  • Que excelente tutorial, simple y sencillo, pero en el punto. Lo he utilizado varias veces.

Post a Reply

Teacher
Alboukadel Kassambara
Role : Founder of Datanovia
Read More