Data Manipulation in R

Identify and Remove Duplicate Data in R

This tutorial describes how to identify and remove duplicate data in R.

You will learn how to use the following R base and dplyr functions:

  1. R base functions
    • duplicated(): for identifying duplicated elements and
    • unique(): for extracting unique elements,
  2. distinct() [dplyr package] to remove duplicate rows in a data frame.

Identify and Remove Duplicate Data in R



Contents:

Required packages

Load the tidyverse packages, which include dplyr:

library(tidyverse)

Demo dataset

We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.

my_data <- as_tibble(iris)
my_data
## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa 
## # ... with 144 more rows

Find and drop duplicate elements

The R function duplicated() returns a logical vector where TRUE specifies which elements of a vector or data frame are duplicates.

Given the following vector:

x <- c(1, 1, 4, 5, 4, 6)
  • To find the position of duplicate elements in x, use this:
duplicated(x)
## [1] FALSE  TRUE FALSE FALSE  TRUE FALSE
  • Extract duplicate elements:
x[duplicated(x)]
## [1] 1 4
  • If you want to remove duplicated elements, use !duplicated(), where ! is a logical negation:
x[!duplicated(x)]
## [1] 1 4 5 6
  • Following this way, you can remove duplicate rows from a data frame based on a column values, as follow:
# Remove duplicates based on Sepal.Width columns
my_data[!duplicated(my_data$Sepal.Width), ]
## # A tibble: 23 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa 
## # ... with 17 more rows

! is a logical negation. !duplicated() means that we don’t want duplicate rows.

Extract unique elements

Given the following vector:

x <- c(1, 1, 4, 5, 4, 6)

You can extract unique elements as follow:

unique(x)
## [1] 1 4 5 6

It’s also possible to apply unique() on a data frame, for removing duplicated rows as follow:

unique(my_data)
## # A tibble: 149 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa 
## # ... with 143 more rows

Remove duplicate rows in a data frame

The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique().

Remove duplicate rows based on all columns:

my_data %>% distinct()
## # A tibble: 149 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa 
## # ... with 143 more rows

Remove duplicate rows based on certain columns (variables):

# Remove duplicated rows based on Sepal.Length
my_data %>% distinct(Sepal.Length, .keep_all = TRUE)
## # A tibble: 35 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa 
## # ... with 29 more rows
# Remove duplicated rows based on 
# Sepal.Length and Petal.Width
my_data %>% distinct(Sepal.Length, Petal.Width, .keep_all = TRUE)
## # A tibble: 110 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa 
## # ... with 104 more rows

The option .kep_all is used to keep all variables in the data.

Summary

In this chapter, we describe key functions for identifying and removing duplicate data:

  • Remove duplicate rows based on one or more column values: my_data %>% dplyr::distinct(Sepal.Length)
  • R base function to extract unique elements from vectors and data frames: unique(my_data)
  • R base function to determine duplicate elements: duplicated(my_data)



Subset Data Frame Rows in R (Prev Lesson)
(Next Lesson) Reorder Data Frame Rows in R
Back to Data Manipulation in R

Comments ( 20 )

  • Abouelela

    you are missing a comma here after the row x[duplicated(x)]. It should be like this x[duplicated(x), ]

    • Kassambara

      x is a vector, so you don’t need to add a comma

    • Sergio Vicente

      Any way, your comment was very useful to me, ’cause I am working with a data frame (in my case). Tks a lot.

  • Gal Inbar

    Iam using Data table , and also very useful !!
    Thanks !!

    • Kassambara

      Thank you for your positive feedback. Highly appreciated!

  • Phat

    Error: Length of logical index vector for `[` must equal number of columns (or 1):
    * `.data` has 1348 columns
    * Index vector has length 1191

    • Kassambara

      please, clarify your question and provide reproducible example

  • Julián

    Thanks, it is a simple and useful tutorial.

    • Kassambara

      Thank you Juliàn for your feedback!

  • Stonemonroy

    Que excelente tutorial, simple y sencillo, pero en el punto. Lo he utilizado varias veces.

    • Kassambara

      Thank you for your positive feedback!

  • Robyn

    hi I’m trying to KEEP ONLY duplicate rows base on a column. I first tested for unique;
    unique(Jan_19)
    # A tibble: 178,492 x 22

    then the number of duplicates base on my CON column
    Jan_19[duplicated(Jan_19$CON), ]
    # A tibble: 251 x 22

    then tried to drop the rows where CON was not duplicated
    Jan_19 %>% !distinct(CON, .keep_all = TRUE)
    Error in distinct(CON, .keep_all = TRUE) : object ‘CON’ not found

    any advise? Thanks for the codes, quite useful

    • Kassambara

      You can use the following R code:

      library(dplyr)
      Jan_19 %>% distinct(CON, .keep_all = TRUE)
      
  • Andreas Rybicki

    Kassambara,

    the lesson “Identify and Remove Duplicate Data in R” was extremely helpful for my task,

    Question:
    two dataframes like “iris”, say iris for Country A and B,
    the dataframes are quite large, up to 1 mio rows and > 10 columns,
    I’d like to check, whether a row in B contains the same input in A.
    E.g. in ‘iris’ row 102 == 143;
    let’s assume row 102 is in iris country_A and row 143 in iris…._B. How could I identify any duplicates in these two DF’s?
    I searched in stackexchange but didn’t find any helpful solution.
    Thks

  • Zbig

    Now I have a slightly harder task:
    what to do if I want to remove only subsequent, immediate duplicates, but if they are divided by something I want to preserve them.
    Example: you have a data frame with object id, time and the place where it happened:
    df <- data.frame(id=c(1,1,1,2,2,2), time=rep(1:3, 2), place=c(1,2,1,1,1,2))
    and I would like to extract paths of these object – for example object 1 was at place 1, then 2, then back to 1 – and I would like to preserve that in data so that later I can see that it moved from 1 to 2 and then from 2 to 1
    any ideas?

    • Kassambara

      If you want to keep distinct rows based on multiple columns, you can go as follow:

      library(dplyr)
      df <- data.frame(id=c(1,1,1,2,2,2), time=rep(1:3, 2), place=c(1,2,1,1,1,2))
      df %>% distinct(id, time, place, .keep_all = TRUE)
  • Moses

    You are always on point! A quick one…
    What are the major check points in data management? I know there are duplicates, missing data, ….

  • Hasan Ayouby

    How to permanently remove the duplicates? because once i used this function, it acts only like a filter. but the original table stays intact.

    • Kassambara

      To overwrite, your original file, type this:

      my_data = my_data %>% 
        distinct(Sepal.Length, .keep_all = TRUE)

Give a comment

Want to post an issue with R? If yes, please make sure you have read this: How to Include Reproducible R Script Examples in Datanovia Comments

Teacher
Alboukadel Kassambara
Role : Founder of Datanovia
Read More