Web Scraping with BeautifulSoup

Introduction

Web scraping is a powerful technique to extract data from websites for analysis, research, or other data-driven projects. In this tutorial, you’ll learn how to use BeautifulSoup—a popular Python library—to parse HTML and extract meaningful information from web pages. We’ll cover not only basic examples like extracting page titles and links, but also advanced techniques such as extracting tables, handling pagination, and using sessions to maintain state across multiple requests.

Importing Required Packages

To keep our code organized and avoid repetition, we start by importing the necessary packages. This ensures that all subsequent code chunks have access to the required libraries.

#| label: import-packages
# This code imports the essential packages for web scraping.
import requests
from bs4 import BeautifulSoup

Basic Web Scraping Example

First, let’s demonstrate how to fetch a webpage, parse its HTML, and extract the page title.

#| label: basic-scraping
# Define the URL to scrape
url = "https://www.worldometers.info/world-population/population-by-country/"

# Fetch the webpage
response = requests.get(url)
# Parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")

# Extract and print the page title
page_title = soup.title.string
print("Page Title:", page_title)

Results:

Page Title: Population by Country (2025) - Worldometer

Navigating the HTML DOM

BeautifulSoup provides various methods to navigate and search the HTML document. Here’s how you can extract all hyperlinks from a page.

#| label: extract-links
# Find all hyperlinks on the page and print the first 6 links
links = soup.find_all("a")
for link in links[:6]:
    # Print the text and URL of each link
    print("Link Text:", link.text.strip(), "URL:", link.get("href"))

Results:

Link Text:  URL: /
Link Text: Population URL: /population/
Link Text: CO2 emissions URL: /co2-emissions/
Link Text: Coronavirus URL: /coronavirus/
Link Text: Countries URL: /geography/countries-of-the-world/
Link Text: Energy URL: /energy/

Extracting Tables

Many websites present data in tables. You can use BeautifulSoup to extract table data and convert it into a structured format.

#| label: extract-tables
# Find a table by its tag (or with additional attributes, e.g., class or id)
table = soup.find("table")
# Extract table headers
headers = [header.text.strip() for header in table.find_all("th")]

# Extract table rows
rows = []
for row in table.find_all("tr")[1:]:
    cells = [cell.text.strip() for cell in row.find_all("td")]
    if cells:
        rows.append(cells)

# Create a `DataFrame` using the extracted headers and rows
df = pd.DataFrame(rows, columns=headers)
df.head()

Handling Pagination

For websites that display data over multiple pages, you can automate the process of iterating through pages.

The following script scrapes the blog posts by extracting each post’s title (from the anchor text) and URL (from the anchor’s href attribute):

#| label: pagination-example
import time

def scrape_page(url):
    response = requests.get(url)
    return BeautifulSoup(response.text, "html.parser")

# The website uses a query parameter `#listing-listing-page=` for pagination
base_url = "https://quarto.org/docs/blog/#listing-listing-page="
page_number = 1
all_data = []

# Example: scrape first 2 pages
while page_number <= 2:
    url = base_url + str(page_number)
    soup = scrape_page(url)    
    # Extract blog posts: 
    # Each title and URL is within an `<h3 class='no-anchor listing-title'>` 
    posts = []
    for h3 in soup.find_all("h3", class_="no-anchor listing-title"):
        a_tag = h3.find("a")
        if a_tag:
            title = a_tag.text.strip()
            link = a_tag.get("href")
            posts.append({"title": title, "url": link})
    
    all_data.extend(posts)
    page_number += 1
    # Respectful delay between requests
    time.sleep(1)  

# Convert the output to a DataFrame and print the data
df = pd.DataFrame(all_data)
df.head()

How It Works

Pagination: The script builds the URL for each page using a query parameter (e.g., #listing-listing-page=1, #listing-listing-page=2).
Scraping the Page: For each page, the scrape_page function downloads and parses the HTML.
Extracting Posts: It looks for <h3> tags with the class 'no-anchor listing-title', finds the child <a> tag, and extracts the title (using .text) and the URL (using .get('href')).
Collecting Data: Each post’s data is stored as a dictionary in the all_data list.

Using Sessions

When scraping multiple pages from the same website, using a session can improve performance by reusing settings (like headers and cookies) and maintaining state.

#| label: using-sessions
# Create a session object
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0"})

# Use the session to make requests
url = "https://www.worldometers.info/world-population/population-by-country/"
response = session.get(url)
soup = BeautifulSoup(response.text, "html.parser")
print("Session-based Page Title:", soup.title.string)

Best Practices for Web Scraping

Respect Website Policies:
Always check a site’s robots.txt and terms of service before scraping.
Rate Limiting:
Use delays (e.g., time.sleep()) between requests to avoid overwhelming the website.
Error Handling:
Incorporate robust error handling to manage connection issues and unexpected HTML structures.

Note

Ensure that your web scraping activities comply with legal and ethical guidelines.

Conclusion

Web scraping with BeautifulSoup is an effective way to extract and process data from websites. By combining basic techniques with advanced strategies like table extraction, pagination handling, and session management, you can build robust data collection pipelines for your data science projects. Experiment with these examples, adjust them to fit your specific needs, and always practice ethical scraping.

Explore More Articles

Note

Here are more articles from the same category to help you dive deeper into the topic.

Web Scraping with BeautifulSoup

A Practical Guide on Extracting Data from Websites