Web Scraping with rvest

Extract Web Data Using rvest in R – Expanded Guide

Learn how to extract data from websites using the rvest package in R. This expanded tutorial covers not only the basics of HTML parsing and data extraction but also advanced techniques such as handling pagination, using sessions, and implementing error handling for robust web scraping.

Programming
Author
Affiliation
Published

February 10, 2024

Modified

March 11, 2025

Keywords

rvest tutorial, web scraping in R, HTML parsing R, extract data with rvest, advanced web scraping R

Introduction

Web scraping is an essential technique for data scientists and analysts who need to extract data from websites for analysis or research. In this tutorial, you’ll learn how to use rvest, a popular R package, to parse HTML, extract data, and handle common web scraping challenges. We will start with basic examples and then expand into more advanced topics like pagination, using sessions, and error handling.



Setting Up

Before you begin, ensure that the rvest package is installed and loaded:

#| label: install-rvest
# Install the package
install.packages("rvest")

# Load the package
library(rvest)

Basic Web Scraping Example

Let’s start with a simple example: fetching a webpage and extracting its title.

#| label: basic-scraping
# Define the URL to scrape
url <- "https://www.worldometers.info/world-population/population-by-country/"

# Read the HTML content from the URL
page <- read_html(url)

# Extract the page title using a CSS selector
page_title <- page %>% html_node("title") %>% html_text()
print(paste("Page Title:", page_title))

Output:

[1] "Page Title: Population by Country (2025) - Worldometer"

Extracting Tables

Many websites display data in tables. Use rvest to extract and convert tables into data frames:

#| label: extract-tables
# Find the first table on the page
table_node <- page %>% html_node("table")
# Convert the table to a data frame
table_data <- table_node %>% html_table(fill = TRUE)
print(head(table_data))

Output:

Extracted Table

Handling Pagination

For websites that split data across multiple pages, you can automate pagination. The following code extracts blog post titles and URLs from a paginated website by iterating through a set of page numbers, then combines the results into a data frame.

#| label: pagination-example
library(rvest)
library(dplyr)
library(purrr)
library(tibble)

# Define a function to scrape a page given its URL
scrape_page <- function(url) {
  tryCatch({
    read_html(url)
  }, error = function(e) {
    message("Error accessing URL: ", url)
    return(NULL)
  })
}

# The website uses a query parameter "#listing-listing-page=" for pagination
base_url <- "https://quarto.org/docs/blog/#listing-listing-page="
page_numbers <- 1:2  # Example: scrape first 2 pages

# Initialize an empty tibble to store posts
all_posts <- tibble(title = character(), url = character())

# Loop over each page number
for (page_number in page_numbers) {
  url <- paste0(base_url, page_number)
  page <- scrape_page(url)
  
  if (!is.null(page)) {
    # Extract blog posts: each title and URL is within an <h3> tag with class "no-anchor listing-title"
    posts <- page %>%
      html_nodes("h3.no-anchor.listing-title") %>%
      map_df(function(h3) {
        a_tag <- h3 %>% html_node("a")
        if (!is.null(a_tag)) {
          title <- a_tag %>% html_text(trim = TRUE)
          link <- a_tag %>% html_attr("href")
          tibble(title = title, url = link)
        } else {
          tibble(title = NA_character_, url = NA_character_)
        }
      })
    # Append the posts from this page to the overall list
    all_posts <- bind_rows(all_posts, posts)
  }
  
  # Respectful delay between requests
  Sys.sleep(1)
}

# Display the first few rows of the collected posts
print(head(all_posts))

Extracted Blog Posts

Using Sessions

Using sessions can help maintain state (e.g., cookies and headers) across multiple requests, improving efficiency when scraping multiple pages from the same site.

#| label: using-sessions
url <- "https://www.worldometers.info/world-population/population-by-country/"
# Create a session object
session <- session("https://www.worldometers.info")
# Use the session to navigate and scrape
page <- session %>% session_jump_to(url)
page_title <- page %>% read_html() %>% html_node("title") %>% html_text()
print(paste("Session-based Page Title:", page_title))

Error Handling

Integrating error handling ensures that your script can handle unexpected issues gracefully.

#| label: error-handling
# Use tryCatch to handle errors during scraping
safe_scrape <- function(url) {
  tryCatch({
    read_html(url)
  }, error = function(e) {
    message("Error: ", e$message)
    return(NULL)
  })
}

page <- safe_scrape("https://example.com/nonexistent")
if (is.null(page)) {
  print("Failed to retrieve the page. Please check the URL or try again later.")
}

Output:

Error: HTTP error 404.
[1] "Failed to retrieve the page. Please check the URL or try again later."

Best Practices for Web Scraping

  • Respect Website Policies:
    Always check the website’s robots.txt file and terms of service to ensure compliance with their scraping policies.
  • Implement Rate Limiting:
    Use delays (e.g., Sys.sleep()) between requests to avoid overwhelming the server.
  • Monitor for Changes:
    Websites can change their structure over time. Regularly update your selectors and error handling to accommodate these changes.
  • Document Your Code:
    Comment your scripts and structure them clearly for easy maintenance and reproducibility.

Conclusion

Expanding on the basics, this tutorial covers advanced techniques for web scraping with rvest, including handling pagination, using sessions, and implementing error handling. With these tools and best practices, you can build robust data extraction workflows for your data science projects.

Further Reading

Happy coding, and enjoy exploring the web with rvest!

Back to top

Reuse

Citation

BibTeX citation:
@online{kassambara2024,
  author = {Kassambara, Alboukadel},
  title = {Web {Scraping} with Rvest},
  date = {2024-02-10},
  url = {https://www.datanovia.com/learn/programming/r/tools/web-scraping-with-rvest.html},
  langid = {en}
}
For attribution, please cite this work as:
Kassambara, Alboukadel. 2024. “Web Scraping with Rvest.” February 10, 2024. https://www.datanovia.com/learn/programming/r/tools/web-scraping-with-rvest.html.