I had some trouble parsing PDFs in a Next.js app. After a few hours of messing around and trying different libraries, I finally settled on a simple solution - no messing around with multer or formidable. I created a GitHub template you can use to get up and running quickly ๐.
You can find the template here: nextjs-pdf-parser.
This blog post is intended to walk you through how the template works, so you can implement this functionality in your own project. We will parse the text of a PDF. We will not implement any OCR to extract the text from images, as that is beyond the scope of this post.
๐ Libraries
FilePond: For the file uploading process, I've found FilePond to be an elegant, user-friendly solution that takes away the complexity of file uploads.
pdf2json: Once we've got our PDF, we turn to pdf2json for parsing. This library offers a multitude of features, making it useful for various applications beyond just parsing the text.
๐ Prerequisites and Important Considerations
As of writing this post, there are some crucial configurations and technical details to be aware of:
1. Configuration in next.config.js
To make the pdf2json
library compatible with Next.js, a specific configuration is required:
const nextConfig = {
experimental: {
serverComponentsExternalPackages: ['pdf2json'],
},
};
module.exports = nextConfig;
Without the above configuration it, you may encounter the error:
Error [ReferenceError]: nodeUtil is not defined
.
See here for more info.
2. Issues with pdfParser.getRawTextContent()
While working with the pdf2json
library, I came across an issue where trying to fetch raw text content from a parsed PDF using pdfParser.getRawTextContent()
in route.ts
, the result was a blank string. This is because of the incorrect TypeScript types in the library (as of writing this post).
If you face this issue, here are two potential solutions:
Fix the TypeScript Definition:
Modify the constructor in the type definition for the
PDFParser
as follows:declare class Pdfparser extends EventEmitter{ constructor(context: any, value: number); parseBuffer(buffer: Buffer): void; loadPDF(pdfFilePath: string, verbosity?: number):Promise<void> createParserStream():ParserStream on<K extends keyof EventMap>(eventName: K, listener: EventMap[K]): this }
You can access the above file which contains the type by right-clicking on
PDFParser
and selecting 'Go to Type Definition' in VS Code. It's worth noting that these types weren't set by the library's maintainer but by another developer through a PR, so some inaccuracies are to be expected.
Bypass Type Checking:
When declaring the
PDFParser
(which we will do later inapi/upload/route.ts
), you can bypass type checking:const pdfParser = new (PDFParser as any)(null, 1);
The Code ๐
1. File Upload Component
In our template, the file upload functionality is managed by the FileUpload
component:
import { FilePond } from 'react-filepond';
import 'filepond/dist/filepond.min.css';
export default function FileUpload() {
return (
<FilePond
server={{
process: '/api/upload',
fetch: null,
revert: null,
}}
/>
);
}
Here, we're using the FilePond
component from react-filepond
. The server
prop defines where the uploaded file will be processed, in this case, the /api/upload
endpoint (we implement this in the route.ts
file)
2. Sending a POST Request
When a user uploads a file using the FileUpload
component, the component sends a POST request to the server (specifically to the /api/upload
route). This is all managed under the hood by FilePond. See the FilePond docs for more info.
3. Processing the Upload:
We process the uploaded PDF and extract its content in /api/upload/route.ts
.
import { NextRequest, NextResponse } from 'next/server'; // To handle the request and response
import { promises as fs } from 'fs'; // To save the file temporarily
import { v4 as uuidv4 } from 'uuid'; // To generate a unique filename
import PDFParser from 'pdf2json'; // To parse the pdf
export async function POST(req: NextRequest) {
const formData: FormData = await req.formData();
const uploadedFiles = formData.getAll('filepond');
let fileName = '';
let parsedText = '';
if (uploadedFiles && uploadedFiles.length > 0) {
const uploadedFile = uploadedFiles[1];
console.log('Uploaded file:', uploadedFile);
// Check if uploadedFile is of type File
if (uploadedFile instanceof File) {
// Generate a unique filename
fileName = uuidv4();
// Convert the uploaded file into a temporary file
const tempFilePath = `/tmp/${fileName}.pdf`;
// Convert ArrayBuffer to Buffer
const fileBuffer = Buffer.from(await uploadedFile.arrayBuffer());
// Save the buffer as a file
await fs.writeFile(tempFilePath, fileBuffer);
// Parse the pdf using pdf2json. See pdf2json docs for more info.
// The reason I am bypassing type checks is because
// the default type definitions for pdf2json in the npm install
// do not allow for any constructor arguments.
// You can either modify the type definitions or bypass the type checks.
// I chose to bypass the type checks.
const pdfParser = new (PDFParser as any)(null, 1);
// See pdf2json docs for more info on how the below works.
pdfParser.on('pdfParser_dataError', (errData: any) =>
console.log(errData.parserError)
);
pdfParser.on('pdfParser_dataReady', () => {
console.log((pdfParser as any).getRawTextContent());
parsedText = (pdfParser as any).getRawTextContent();
});
pdfParser.loadPDF(tempFilePath);
} else {
console.log('Uploaded file is not in the expected format.');
}
} else {
console.log('No files found.');
}
const response = new NextResponse(parsedText);
response.headers.set('FileName', fileName);
return response;
}
How does it work?
Now, let us go through each part of the above code step by step to understand exactly what's going on.
Importing Dependencies
import { NextRequest, NextResponse } from 'next/server'; // To handle the request and response
import { promises as fs } from 'fs'; // To save the file temporarily
import { v4 as uuidv4 } from 'uuid'; // To generate a unique filename
import PDFParser from 'pdf2json'; // To parse the pdf
We're importing essential modules for our function:
NextRequest
andNextResponse
help handle incoming requests and craft responses.The
fs
module allows us to work with the filesystem, which we'll use to temporarily save the uploaded PDF.uuidv4
generates unique identifiers, which will help in naming our PDF files uniquely.PDFParser
frompdf2json
, allowing us to parse the content of PDFs. This library also has a lot more functionality, which you can take advantage of as you see fit.
Handling the POST Request
export async function POST(req: NextRequest) {
This line signifies the beginning of our asynchronous function, designed to handle POST requests. This is how we implement API Routing in next.js.
Extracting Uploaded Files
const formData: FormData = await req.formData();
const uploadedFiles = formData.getAll('filepond');
FilePond uploads file
example.pdf
asmultipart/form-data
using aPOST
request. FormData uses the same format a form would use if the encoding type were set to"multipart/form-data"
. Therefore, we are able to extract theFormData
from our request.Why do we have
.getAll('filepond)
?If we log the
formData
like so:console.log('Form data:', formData);
We will see the following:
Form data: FormData {
[Symbol(state)]: [
{ name: 'filepond', value: '{}' },
{ name: 'filepond', value: [File] }
]
}
We want to get all the objects with the name 'filepond'
.
Processing the Uploaded File
let fileName = '';
let parsedText = '';
if (uploadedFiles && uploadedFiles.length > 0) {
const uploadedFile = uploadedFiles[1];
console.log('Uploaded file:', uploadedFile);
We initialize two variables:
fileName
andparsedText
.We then check if there are uploaded files. If so, we proceed to process the first uploaded file.
Why do we have
uploadedFiles[1]
?If we log the
uploadedFiles
like so:console.log('Uploaded files:', uploadedFiles);
We will see the following:
Uploaded files: [ '{}', File { size: 152864, type: 'application/pdf', name: 'example.pdf', lastModified: 1691154708425 } ]
The first element in the array is empty, and we need the second
File
object, which is why we index the second element in the array withuploadedFiles[1]
. There are two elements in theuploadedFiles
array because that is what FilePond sends over. From the FilePond docs: "Along with the file object, FilePond also sends the file metadata to the server, both these objects are given the samename
."
Parsing the PDF File
if (uploadedFile instanceof File) {
fileName = uuidv4();
// Convert the uploaded file into a temporary file
const tempFilePath = `/tmp/${fileName}.pdf`;
// Convert ArrayBuffer to Buffer
const fileBuffer = Buffer.from(await uploadedFile.arrayBuffer());
// Save the buffer as a file
await fs.writeFile(tempFilePath, fileBuffer);
We confirm that our uploaded file is an instance of
File
.We generate a unique filename using
uuidv4
.Convert the uploaded file into a temporary path.
We then convert the uploaded file (which is in ArrayBuffer format) into a Node.js Buffer.
We save this buffer as a temporary file.
Using pdf2json to Extract Text
const pdfParser = new (PDFParser as any)(null, 1);
pdfParser.on('pdfParser_dataError', (errData: any) =>
console.log(errData.parserError)
);
pdfParser.on('pdfParser_dataReady', () => {
console.log((pdfParser as any).getRawTextContent());
parsedText = (pdfParser as any).getRawTextContent();
});
pdfParser.loadPDF(tempFilePath);
We initialize our PDF parser. Due to type definition constraints, we bypass type checks (see the prerequisites section above).
We set up two event listeners: one for errors and one for when the PDF data is ready.
Finally, we load our PDF into the parser.
Sending the Response
} else {
console.log('Uploaded file is not in the expected format.');
}
} else {
console.log('No files found.');
}
const response = new NextResponse();
response.headers.set('FileName', fileName);
return response;
}
If the
uploadedFile
isn't of typeFile
, we log a message saying it wasn't in the expected format.If the
uploadedFiles
is empty, we log a message saying there were no files found.Finally, we craft our response, adding our parsed text and setting the filename in the headers.
Wrapping Up
This is a simple approach to PDF uploading and parsing, where we abstract away many of the complexities. With FilePond and pdf2json at its core, it provides a robust and battle-tested solution to integrate PDF parsing into an application.
I hope you've found this guide useful! If you have any questions or suggestions, feel free to drop a comment, contribute to the GitHub repository, or follow me on Twitter. Happy coding! ๐