{"id":10222,"date":"2019-10-23T23:32:42","date_gmt":"2019-10-23T21:32:42","guid":{"rendered":"https:\/\/www.datanovia.com\/en\/?p=10222"},"modified":"2019-12-25T10:38:52","modified_gmt":"2019-12-25T08:38:52","slug":"extract-text-from-pdf-in-r","status":"publish","type":"post","link":"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/","title":{"rendered":"Extract Text from PDF in R"},"content":{"rendered":"<div id=\"rdoc\">\n<p>This article describes how to <strong>extract text from PDF in R<\/strong> using the <strong>pdftools<\/strong> package.<\/p>\n<p>Contents:<\/p>\n<div id=\"TOC\">\n<ul>\n<li><a href=\"#installation\">Installation<\/a><\/li>\n<li><a href=\"#load-the-package\">Load the package<\/a><\/li>\n<li><a href=\"#extract-the-pdf-text-content\">Extract the PDF text content<\/a><\/li>\n<li><a href=\"#render-the-pdf-pages-as-images\">Render the pdf pages as images<\/a><\/li>\n<li><a href=\"#summary\">Summary<\/a><\/li>\n<\/ul>\n<\/div>\n<div id=\"installation\" class=\"section level2\">\n<h2>Installation<\/h2>\n<p>For MAC OSX and Windows, you can use the following code to install directly from CRAN repository:<\/p>\n<pre class=\"r\"><code>install.packages(\"pdftools\")<\/code><\/pre>\n<p>For Linux\/Unix systems, you may need to first install the <code>poppler<\/code> library on your computer. Use the following bash code depending on your operating system:<\/p>\n<ol style=\"list-style-type: decimal;\">\n<li>On Debian\/Ubuntu: <code>sudo apt-get install libpoppler-cpp-dev<\/code><\/li>\n<li>On Fedora or CentOS: <code>sudo yum install poppler-cpp-devel<\/code><\/li>\n<li>On Mac OSX : <code>brew install poppler<\/code><\/li>\n<\/ol>\n<\/div>\n<div id=\"load-the-package\" class=\"section level2\">\n<h2>Load the package<\/h2>\n<pre class=\"r\"><code>library(\"pdftools\")<\/code><\/pre>\n<\/div>\n<div id=\"extract-the-pdf-text-content\" class=\"section level2\">\n<h2>Extract the PDF text content<\/h2>\n<pre class=\"r\"><code># Download a demo pdf file\r\npdf.file &lt;- \"https:\/\/www.datanovia.com\/en\/https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/dn-tutorials\/book-preview\/clustering_en_preview.pdf\"\r\ndownload.file(pdf.file, destfile = \"clustering.pdf\", mode = \"wb\")<\/code><\/pre>\n<pre class=\"r\"><code># Extract the text for all pages\r\npdf.text &lt;- pdf_text(\"clustering.pdf\")\r\n# Display the third page text\r\ncat(pdf.text[[3]])<\/code><\/pre>\n<pre><code>## 0.1. PREFACE                                                                            3\r\n## 0.1       Preface\r\n## Large amounts of data are collected every day from satellite images, bio-medical,\r\n## security, marketing, web search, geo-spatial or other automatic equipment. Mining\r\n## knowledge from these big data far exceeds human\u2019s abilities.\r\n## Clustering is one of the important data mining methods for discovering knowledge\r\n## in multidimensional data. The goal of clustering is to identify pattern or groups of\r\n## similar objects within a data set of interest.\r\n## In the litterature, it is referred as \u201cpattern recognition\u201d or \u201cunsupervised machine\r\n## learning\u201d - \u201cunsupervised\u201d because we are not guided by a priori ideas of which\r\n## variables or samples belong in which clusters. \u201cLearning\u201d because the machine\r\n## algorithm \u201clearns\u201d how to cluster.\r\n## Cluster analysis is popular in many fields, including:\r\n##    \u2022 In cancer research for classifying patients into subgroups according their gene\r\n##        expression profile. This can be useful for identifying the molecular profile of\r\n##        patients with good or bad prognostic, as well as for understanding the disease.\r\n##    \u2022 In marketing for market segmentation by identifying subgroups of customers with\r\n##        similar profiles and who might be receptive to a particular form of advertising.\r\n##    \u2022 In City-planning for identifying groups of houses according to their type, value\r\n##        and location.\r\n##    This book provides a practical guide to unsupervised machine learning or cluster\r\n##    analysis using R software. Additionally, we developped an R package named factoextra\r\n##    to create, easily, a ggplot2-based elegant plots of cluster analysis results. Factoextra\r\n##    official online documentation: http:\/\/www.sthda.com\/english\/rpkgs\/factoextra<\/code><\/pre>\n<\/div>\n<div id=\"render-the-pdf-pages-as-images\" class=\"section level2\">\n<h2>Render the pdf pages as images<\/h2>\n<pre class=\"r\"><code># Renders pdf to bitmap array\r\nbitmap &lt;- pdf_render_page(\"clustering.pdf\", page = 3)\r\n\r\n# Save bitmap image\r\npng::writePNG(bitmap, \"images\/clustering-page-3.png\")\r\njpeg::writeJPEG(bitmap, \"images\/clustering-page.jpeg\")\r\nwebp::write_webp(bitmap, \"images\/clustering-page.webp\")<\/code><\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/dn-tutorials\/r-tutorial\/images\/clustering-page-3.png\" alt=\"clustering pdf pages\" \/><\/p>\n<\/div>\n<div id=\"summary\" class=\"section level2\">\n<h2>Summary<\/h2>\n<p>This articles describes how to extract text from a PDF file and to render the pdf pages as images.<\/p>\n<\/div>\n<\/div>\n<p><!--end rdoc--><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This article describes how to extract text from PDF in R using the pdftools package. Contents: Installation Load the package Extract the PDF text content Render the pdf pages as [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":9210,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"rating_form_position":"","rating_results_position":"","mr_structured_data_type":"","footnotes":""},"categories":[292],"tags":[],"class_list":["post-10222","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-text-mining"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Extract Text from PDF in R - Datanovia<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Extract Text from PDF in R - Datanovia\" \/>\n<meta property=\"og:description\" content=\"This article describes how to extract text from PDF in R using the pdftools package. Contents: Installation Load the package Extract the PDF text content Render the pdf pages as [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/\" \/>\n<meta property=\"og:site_name\" content=\"Datanovia\" \/>\n<meta property=\"article:published_time\" content=\"2019-10-23T21:32:42+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2019-12-25T08:38:52+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2019\/05\/Balbuzard.pe_.cheur_.24.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Alboukadel\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Alboukadel\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/\"},\"author\":{\"name\":\"Alboukadel\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#\/schema\/person\/7767cf2bd5c91a1610c6eb53a0ff069e\"},\"headline\":\"Extract Text from PDF in R\",\"datePublished\":\"2019-10-23T21:32:42+00:00\",\"dateModified\":\"2019-12-25T08:38:52+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/\"},\"wordCount\":125,\"commentCount\":1,\"publisher\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2019\/05\/Balbuzard.pe_.cheur_.24.jpg\",\"articleSection\":[\"Text Mining\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/\",\"url\":\"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/\",\"name\":\"Extract Text from PDF in R - Datanovia\",\"isPartOf\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2019\/05\/Balbuzard.pe_.cheur_.24.jpg\",\"datePublished\":\"2019-10-23T21:32:42+00:00\",\"dateModified\":\"2019-12-25T08:38:52+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/#primaryimage\",\"url\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2019\/05\/Balbuzard.pe_.cheur_.24.jpg\",\"contentUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2019\/05\/Balbuzard.pe_.cheur_.24.jpg\",\"width\":1024,\"height\":512,\"caption\":\"SONY DSC\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.datanovia.com\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Extract Text from PDF in R\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#website\",\"url\":\"https:\/\/www.datanovia.com\/en\/\",\"name\":\"Datanovia\",\"description\":\"Data Mining and Statistics for Decision Support\",\"publisher\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.datanovia.com\/en\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#organization\",\"name\":\"Datanovia\",\"url\":\"https:\/\/www.datanovia.com\/en\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png\",\"contentUrl\":\"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png\",\"width\":98,\"height\":99,\"caption\":\"Datanovia\"},\"image\":{\"@id\":\"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#\/schema\/person\/7767cf2bd5c91a1610c6eb53a0ff069e\",\"name\":\"Alboukadel\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.datanovia.com\/en\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/ed3108646c5c7c3d188324ab972f96ad7d9975b41b94014d7f68257791be395a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/ed3108646c5c7c3d188324ab972f96ad7d9975b41b94014d7f68257791be395a?s=96&d=mm&r=g\",\"caption\":\"Alboukadel\"},\"url\":\"https:\/\/www.datanovia.com\/en\/blog\/author\/kassambara\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Extract Text from PDF in R - Datanovia","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/","og_locale":"en_US","og_type":"article","og_title":"Extract Text from PDF in R - Datanovia","og_description":"This article describes how to extract text from PDF in R using the pdftools package. Contents: Installation Load the package Extract the PDF text content Render the pdf pages as [&hellip;]","og_url":"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/","og_site_name":"Datanovia","article_published_time":"2019-10-23T21:32:42+00:00","article_modified_time":"2019-12-25T08:38:52+00:00","og_image":[{"width":1024,"height":512,"url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2019\/05\/Balbuzard.pe_.cheur_.24.jpg","type":"image\/jpeg"}],"author":"Alboukadel","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Alboukadel","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/#article","isPartOf":{"@id":"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/"},"author":{"name":"Alboukadel","@id":"https:\/\/www.datanovia.com\/en\/#\/schema\/person\/7767cf2bd5c91a1610c6eb53a0ff069e"},"headline":"Extract Text from PDF in R","datePublished":"2019-10-23T21:32:42+00:00","dateModified":"2019-12-25T08:38:52+00:00","mainEntityOfPage":{"@id":"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/"},"wordCount":125,"commentCount":1,"publisher":{"@id":"https:\/\/www.datanovia.com\/en\/#organization"},"image":{"@id":"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/#primaryimage"},"thumbnailUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2019\/05\/Balbuzard.pe_.cheur_.24.jpg","articleSection":["Text Mining"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/","url":"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/","name":"Extract Text from PDF in R - Datanovia","isPartOf":{"@id":"https:\/\/www.datanovia.com\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/#primaryimage"},"image":{"@id":"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/#primaryimage"},"thumbnailUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2019\/05\/Balbuzard.pe_.cheur_.24.jpg","datePublished":"2019-10-23T21:32:42+00:00","dateModified":"2019-12-25T08:38:52+00:00","breadcrumb":{"@id":"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/#primaryimage","url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2019\/05\/Balbuzard.pe_.cheur_.24.jpg","contentUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2019\/05\/Balbuzard.pe_.cheur_.24.jpg","width":1024,"height":512,"caption":"SONY DSC"},{"@type":"BreadcrumbList","@id":"https:\/\/www.datanovia.com\/en\/blog\/extract-text-from-pdf-in-r\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.datanovia.com\/en\/"},{"@type":"ListItem","position":2,"name":"Extract Text from PDF in R"}]},{"@type":"WebSite","@id":"https:\/\/www.datanovia.com\/en\/#website","url":"https:\/\/www.datanovia.com\/en\/","name":"Datanovia","description":"Data Mining and Statistics for Decision Support","publisher":{"@id":"https:\/\/www.datanovia.com\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.datanovia.com\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.datanovia.com\/en\/#organization","name":"Datanovia","url":"https:\/\/www.datanovia.com\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/","url":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png","contentUrl":"https:\/\/www.datanovia.com\/en\/wp-content\/uploads\/2018\/09\/datanovia-logo.png","width":98,"height":99,"caption":"Datanovia"},"image":{"@id":"https:\/\/www.datanovia.com\/en\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/www.datanovia.com\/en\/#\/schema\/person\/7767cf2bd5c91a1610c6eb53a0ff069e","name":"Alboukadel","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.datanovia.com\/en\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/ed3108646c5c7c3d188324ab972f96ad7d9975b41b94014d7f68257791be395a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/ed3108646c5c7c3d188324ab972f96ad7d9975b41b94014d7f68257791be395a?s=96&d=mm&r=g","caption":"Alboukadel"},"url":"https:\/\/www.datanovia.com\/en\/blog\/author\/kassambara\/"}]}},"multi-rating":{"mr_rating_results":[]},"_links":{"self":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/posts\/10222","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/comments?post=10222"}],"version-history":[{"count":9,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/posts\/10222\/revisions"}],"predecessor-version":[{"id":10975,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/posts\/10222\/revisions\/10975"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/media\/9210"}],"wp:attachment":[{"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/media?parent=10222"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/categories?post=10222"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.datanovia.com\/en\/wp-json\/wp\/v2\/tags?post=10222"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}