Node Js Read or Extract Content from PDF File Tutorial

In this quick post, we will teach you how to eloquently read the content or data from the PDF file in the node js application.

PDF is an abbreviation that means Portable Document Format, and it’s a universal file format developed by Adobe.

It provides a facile, authentic way to present and exchange documents – regardless of the software, hardware, or operating systems.

To read the text from the pdf file, we will use the pdf parse package in node.

The pdf parse is a javascript-based module that works cross-platform and helps you extract texts from PDF files.

How to Read PDF Files Content in Node Js App

  • Step 1: Create Project Folder
  • Step 2: Make Package.json File
  • Step 3: Create App File
  • Step 4: Install PDF Parse Module
  • Step 5: Read PDF Data in Node
  • Step 6: Test Application

Create Project Folder

Head over to terminal, on the command prompt type the command and press enter to form the folder for building node app.

mkdir node-sco

Next, move towards the root of your project.

cd node-sco

Make Package.json File

Let us build the package file for node, to make the file you have to run the npm command from the terminal.

npm init

Create App File

Now, you have to create a server.js file this file allows you to write the logic for reading content from PDF file in node.

Make sure to register the server.js file in package.json’s script section.


...
...
"scripts": {
   "start": "node server.js"
},
...
...

Install PDF Parse Module

Head over to console type the following command and hit enter to install the pdf parse package in node.

npm install pdf-parse

Read PDF Data in Node

In this post, we will be using the following PDF file that you can download from the given URL.

If you have your own PDF file, make sure to keep it at the root of your project folder.

Download PDF File

Thereafter get inside the server.js file and paste the given code into the file.

const fs = require('fs')
const pdfParse = require('pdf-parse')

let extractPDF = async (file) => {
  let fileSync = fs.readFileSync(file)
  try {
    let Parse = await pdfParse(fileSync)
    console.log('Content: ', Parse.text)
    console.log('PDF pages: ', Parse.numpages)

    console.log('File content: ', Parse.info)
  } catch (e) {
    throw new Error(e)
  }
}

let pdfRead = './sample.pdf'
extractPDF(pdfRead)

Test Application

We now have to test the app, lets see how does our code read data from pdf file in node.

Open the console, add the given command and execute command.

node server.js

If all goes well then on your terminal screen you will see the data of pdf file.

Content:  

 A Simple PDF File 
 This is a small demonstration .pdf file - 
 just for use in the Virtual Mechanics tutorials. More text. And more 
 text. And more text. And more text. And more text. 
 And more text. And more text. And more text. And more text. And more 
 text. And more text. Boring, zzzzz. And more text. And more text. And 
 more text. And more text. And more text. And more text. And more text. 
 And more text. And more text. 
 And more text. And more text. And more text. And more text. And more 
 text. And more text. And more text. Even more. Continued on page 2 ...

 Simple PDF File 2 
 ...continued from page 1. Yet more text. And more text. And more text. 
 And more text. And more text. And more text. And more text. And more 
 text. Oh, how boring typing this stuff. But not as boring as watching 
 paint dry. And more text. And more text. And more text. And more text. 
 Boring.  More, a little more text. The end, and just as well. 
PDF pages:  2
File content:  {
  PDFFormatVersion: '1.3',
  IsAcroFormPresent: false,
  IsXFAPresent: false,
  Creator: 'Rave (http://www.nevrona.com/rave)',
  Producer: 'Nevrona Designs',
  CreationDate: 'D:20060301072826'
}

Summary

Node Js Read or Extract Content from PDF File Tutorial

So this was it; in this guide, we have tried to explain how to extract text content from pdf files in the node js application using the third-party library.

We are pretty much sure you now have a better understanding of working with pdf files in the node js application.