Data Wrangling with dplyr

Introduction

dplyr is one of the core packages in the tidyverse that makes data manipulation in R both fast and intuitive. With its straightforward syntax and powerful verbs, dplyr enables you to filter, select, mutate, group, and summarize your data with minimal code. In this tutorial, you will learn how to transform and summarize datasets using dplyr, along with practical examples to illustrate its capabilities.

Key dplyr Functions

dplyr provides a set of functions—often referred to as ‘verbs’—that form the foundation of data wrangling in R. Here are some of the most important ones:

filter(): Subset rows based on conditions.
select(): Choose columns based on names or patterns.
mutate(): Create new columns or modify existing ones.
group_by(): Group the data for summary operations.
summarize(): Compute summary statistics for grouped data.

Practical Examples

Example 1: Filtering and Selecting Data

Let’s use the built-in mtcars dataset to filter for cars with more than 6 cylinders and select only the miles-per-gallon (mpg), number of cylinders (cyl), and horsepower (hp) columns.

library(dplyr)

# Filter the dataset for cars with more than 6 cylinders and select specific columns
filtered_data <- mtcars %>%
  filter(cyl > 6) %>%
  select(mpg, cyl, hp)

print(filtered_data)

                     mpg cyl  hp
Hornet Sportabout   18.7   8 175
Duster 360          14.3   8 245
Merc 450SE          16.4   8 180
Merc 450SL          17.3   8 180
Merc 450SLC         15.2   8 180
Cadillac Fleetwood  10.4   8 205
Lincoln Continental 10.4   8 215
Chrysler Imperial   14.7   8 230
Dodge Challenger    15.5   8 150
AMC Javelin         15.2   8 150
Camaro Z28          13.3   8 245
Pontiac Firebird    19.2   8 175
Ford Pantera L      15.8   8 264
Maserati Bora       15.0   8 335

Example 2: Mutating and Summarizing Data

In this example, we’ll add a new column that calculates the power-to-weight ratio and then summarize the data by grouping it based on the number of cylinders.

library(dplyr)

# Add a new column for power-to-weight ratio and ummarize average mpg by cylinder count
summary_data <- mtcars %>%
  mutate(power_to_weight = hp / wt) %>%
  group_by(cyl) %>%
  summarize(
    avg_mpg = mean(mpg),
    avg_power_to_weight = mean(power_to_weight)
  )

print(summary_data)

# A tibble: 3 × 3
    cyl avg_mpg avg_power_to_weight
  <dbl>   <dbl>               <dbl>
1     4    26.7                37.9
2     6    19.7                39.9
3     8    15.1                53.9

Example 3: Chaining Multiple dplyr Verbs

This example demonstrates how to chain multiple dplyr operations to perform a comprehensive data transformation.

library(dplyr)

# Chain multiple operations: filter, select, and mutate
transformed_data <- mtcars %>%
  filter(mpg > 20) %>%
  select(mpg, cyl, disp, hp) %>%
  mutate(efficiency = mpg / disp)

print(transformed_data)

                mpg cyl  disp  hp efficiency
Mazda RX4      21.0   6 160.0 110 0.13125000
Mazda RX4 Wag  21.0   6 160.0 110 0.13125000
Datsun 710     22.8   4 108.0  93 0.21111111
Hornet 4 Drive 21.4   6 258.0 110 0.08294574
Merc 240D      24.4   4 146.7  62 0.16632584
Merc 230       22.8   4 140.8  95 0.16193182
Fiat 128       32.4   4  78.7  66 0.41168996
Honda Civic    30.4   4  75.7  52 0.40158520
Toyota Corolla 33.9   4  71.1  65 0.47679325
Toyota Corona  21.5   4 120.1  97 0.17901749
Fiat X1-9      27.3   4  79.0  66 0.34556962
Porsche 914-2  26.0   4 120.3  91 0.21612635
Lotus Europa   30.4   4  95.1 113 0.31966351
Volvo 142E     21.4   4 121.0 109 0.17685950

Best Practices

Use the Pipe Operator %>%:
This operator helps to chain multiple operations together, making your code more readable.
Write Clear, Descriptive Code:
Use meaningful variable names and add comments where necessary.
Test Incrementally:
Build your data transformations step by step and check intermediate outputs to ensure your code works as expected.

Conclusion

dplyr simplifies the process of data wrangling in R, allowing you to transform and summarize datasets with minimal and intuitive code. By mastering the core functions—filter, select, mutate, group_by, and summarize—you can streamline your data preparation workflows and prepare your data effectively for further analysis or visualization.

Explore More Articles

Note

Here are more articles from the same category to help you dive deeper into the topic.

Statistical Modeling with lm() and glm() in R

Building, Diagnosing, and Enhancing Regression Models

R Programming Data Science Statistical Modeling lm glm Regression Model Diagnostics Advanced

Alboukadel Kassambara, 2024-02-10, in Programming

Learn how to perform linear and generalized linear modeling in R using lm() and glm(). This expanded tutorial covers model fitting, diagnostics, interpretation, and advanced techniques such as…