In this quick post, we will teach you how to eloquently read the content or data from the PDF file in the node js application.
PDF is an abbreviation that means Portable Document Format, and it’s a universal file format developed by Adobe.
It provides a facile, authentic way to present and exchange documents – regardless of the software, hardware, or operating systems.
To read the text from the pdf file, we will use the pdf parse package in node.
The pdf parse is a javascript-based module that works cross-platform and helps you extract texts from PDF files.
How to Read PDF Files Content in Node Js App
- Step 1: Create Project Folder
- Step 2: Make Package.json File
- Step 3: Create App File
- Step 4: Install PDF Parse Module
- Step 5: Read PDF Data in Node
- Step 6: Test Application
Create Project Folder
Head over to terminal, on the command prompt type the command and press enter to form the folder for building node app.
mkdir node-sco
Next, move towards the root of your project.
cd node-sco
Make Package.json File
Let us build the package file for node, to make the file you have to run the npm command from the terminal.
npm init
Create App File
Now, you have to create a server.js file this file allows you to write the logic for reading content from PDF file in node.
Make sure to register the server.js file in package.json’s script section.
...
...
"scripts": {
"start": "node server.js"
},
...
...
Install PDF Parse Module
Head over to console type the following command and hit enter to install the pdf parse package in node.
npm install pdf-parse
Read PDF Data in Node
In this post, we will be using the following PDF file that you can download from the given URL.
If you have your own PDF file, make sure to keep it at the root of your project folder.
Thereafter get inside the server.js file and paste the given code into the file.
const fs = require('fs')
const pdfParse = require('pdf-parse')
let extractPDF = async (file) => {
let fileSync = fs.readFileSync(file)
try {
let Parse = await pdfParse(fileSync)
console.log('Content: ', Parse.text)
console.log('PDF pages: ', Parse.numpages)
console.log('File content: ', Parse.info)
} catch (e) {
throw new Error(e)
}
}
let pdfRead = './sample.pdf'
extractPDF(pdfRead)
Test Application
We now have to test the app, lets see how does our code read data from pdf file in node.
Open the console, add the given command and execute command.
node server.js
If all goes well then on your terminal screen you will see the data of pdf file.
Content:
A Simple PDF File
This is a small demonstration .pdf file -
just for use in the Virtual Mechanics tutorials. More text. And more
text. And more text. And more text. And more text.
And more text. And more text. And more text. And more text. And more
text. And more text. Boring, zzzzz. And more text. And more text. And
more text. And more text. And more text. And more text. And more text.
And more text. And more text.
And more text. And more text. And more text. And more text. And more
text. And more text. And more text. Even more. Continued on page 2 ...
Simple PDF File 2
...continued from page 1. Yet more text. And more text. And more text.
And more text. And more text. And more text. And more text. And more
text. Oh, how boring typing this stuff. But not as boring as watching
paint dry. And more text. And more text. And more text. And more text.
Boring. More, a little more text. The end, and just as well.
PDF pages: 2
File content: {
PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Creator: 'Rave (http://www.nevrona.com/rave)',
Producer: 'Nevrona Designs',
CreationDate: 'D:20060301072826'
}
Summary
So this was it; in this guide, we have tried to explain how to extract text content from pdf files in the node js application using the third-party library.
We are pretty much sure you now have a better understanding of working with pdf files in the node js application.