Extract Text from PDF in R



Extract Text from PDF in R

This article describes how to extract text from PDF in R using the pdftools package.



Contents:

Installation

For MAC OSX and Windows, you can use the following code to install directly from CRAN repository:

install.packages("pdftools")

For Linux/Unix systems, you may need to first install the poppler library on your computer. Use the following bash code depending on your operating system:

  1. On Debian/Ubuntu: sudo apt-get install libpoppler-cpp-dev
  2. On Fedora or CentOS: sudo yum install poppler-cpp-devel
  3. On Mac OSX : brew install poppler

Load the package

library("pdftools")

Extract the PDF text content

# Download a demo pdf file
pdf.file <- "https://www.datanovia.com/en/https://www.datanovia.com/en/wp-content/uploads/dn-tutorials/book-preview/clustering_en_preview.pdf"
download.file(pdf.file, destfile = "clustering.pdf", mode = "wb")
# Extract the text for all pages
pdf.text <- pdf_text("clustering.pdf")
# Display the third page text
cat(pdf.text[[3]])
## 0.1. PREFACE                                                                            3
## 0.1       Preface
## Large amounts of data are collected every day from satellite images, bio-medical,
## security, marketing, web search, geo-spatial or other automatic equipment. Mining
## knowledge from these big data far exceeds human’s abilities.
## Clustering is one of the important data mining methods for discovering knowledge
## in multidimensional data. The goal of clustering is to identify pattern or groups of
## similar objects within a data set of interest.
## In the litterature, it is referred as “pattern recognition” or “unsupervised machine
## learning” - “unsupervised” because we are not guided by a priori ideas of which
## variables or samples belong in which clusters. “Learning” because the machine
## algorithm “learns” how to cluster.
## Cluster analysis is popular in many fields, including:
##    • In cancer research for classifying patients into subgroups according their gene
##        expression profile. This can be useful for identifying the molecular profile of
##        patients with good or bad prognostic, as well as for understanding the disease.
##    • In marketing for market segmentation by identifying subgroups of customers with
##        similar profiles and who might be receptive to a particular form of advertising.
##    • In City-planning for identifying groups of houses according to their type, value
##        and location.
##    This book provides a practical guide to unsupervised machine learning or cluster
##    analysis using R software. Additionally, we developped an R package named factoextra
##    to create, easily, a ggplot2-based elegant plots of cluster analysis results. Factoextra
##    official online documentation: http://www.sthda.com/english/rpkgs/factoextra

Render the pdf pages as images

# Renders pdf to bitmap array
bitmap <- pdf_render_page("clustering.pdf", page = 3)

# Save bitmap image
png::writePNG(bitmap, "images/clustering-page-3.png")
jpeg::writeJPEG(bitmap, "images/clustering-page.jpeg")
webp::write_webp(bitmap, "images/clustering-page.webp")

clustering pdf pages

Summary

This articles describes how to extract text from a PDF file and to render the pdf pages as images.







Comment ( 1 )

  • Madison

    Is it possible to extract text from just a specific area on the pdf? For example, just pulling the text from the blue area of the page in the example. I’m working on a project where I only want to extract the text from a specific colored area on the page.

Give a comment

Want to post an issue with R? If yes, please make sure you have read this: How to Include Reproducible R Script Examples in Datanovia Comments