How to upload and parse a PDF in Next.js

I had some trouble parsing PDFs in a Next.js app. After a few hours of messing around and trying different libraries, I finally settled on a simple solution - no messing around with multer or formidable. I created a GitHub template you can use to get up and running quickly 🚀.

You can find the template here: nextjs-pdf-parser.

This blog post is intended to walk you through how the template works, so you can implement this functionality in your own project. We will parse the text of a PDF. We will not implement any OCR to extract the text from images, as that is beyond the scope of this post.

📚 Libraries

FilePond: For the file uploading process, I've found FilePond to be an elegant, user-friendly solution that takes away the complexity of file uploads.
pdf2json: Once we've got our PDF, we turn to pdf2json for parsing. This library offers a multitude of features, making it useful for various applications beyond just parsing the text.

📌 Prerequisites and Important Considerations

As of writing this post, there are some crucial configurations and technical details to be aware of:

1. Configuration in `next.config.js`

To make the pdf2json library compatible with Next.js, a specific configuration is required:

const nextConfig = {
  experimental: {
    serverComponentsExternalPackages: ['pdf2json'],
  },
};

module.exports = nextConfig;

Without the above configuration it, you may encounter the error:

Error [ReferenceError]: nodeUtil is not defined.

See here for more info.

2. Issues with `pdfParser.getRawTextContent()`

While working with the pdf2json library, I came across an issue where trying to fetch raw text content from a parsed PDF using pdfParser.getRawTextContent() in route.ts, the result was a blank string. This is because of the incorrect TypeScript types in the library (as of writing this post).

If you face this issue, here are two potential solutions:

Fix the TypeScript Definition:
- Modify the constructor in the type definition for the PDFParser as follows:
```
  declare class Pdfparser extends EventEmitter{
      constructor(context: any, value: number);
      parseBuffer(buffer: Buffer): void;
      loadPDF(pdfFilePath: string, verbosity?: number):Promise<void>
      createParserStream():ParserStream
      on<K extends keyof EventMap>(eventName: K, listener: EventMap[K]): this
  }
```
  You can access the above file which contains the type by right-clicking on PDFParser and selecting 'Go to Type Definition' in VS Code. It's worth noting that these types weren't set by the library's maintainer but by another developer through a PR, so some inaccuracies are to be expected.
Bypass Type Checking:

When declaring the PDFParser (which we will do later in api/upload/route.ts), you can bypass type checking:
```
  const pdfParser = new (PDFParser as any)(null, 1);
```

The Code 🛠

1. File Upload Component

In our template, the file upload functionality is managed by the FileUpload component:

import { FilePond } from 'react-filepond';
import 'filepond/dist/filepond.min.css';

export default function FileUpload() {
  return (
    <FilePond
      server={{
        process: '/api/upload',
        fetch: null,
        revert: null,
      }}
    />
  );
}

Here, we're using the FilePond component from react-filepond. The server prop defines where the uploaded file will be processed, in this case, the /api/upload endpoint (we implement this in the route.ts file)

2. Sending a POST Request

When a user uploads a file using the FileUpload component, the component sends a POST request to the server (specifically to the /api/upload route). This is all managed under the hood by FilePond. See the FilePond docs for more info.

3. Processing the Upload:

We process the uploaded PDF and extract its content in /api/upload/route.ts.

import { NextRequest, NextResponse } from 'next/server'; // To handle the request and response
import { promises as fs } from 'fs'; // To save the file temporarily
import { v4 as uuidv4 } from 'uuid'; // To generate a unique filename
import PDFParser from 'pdf2json'; // To parse the pdf

export async function POST(req: NextRequest) {
  const formData: FormData = await req.formData();
  const uploadedFiles = formData.getAll('filepond');
  let fileName = '';
  let parsedText = '';

  if (uploadedFiles && uploadedFiles.length > 0) {
    const uploadedFile = uploadedFiles[1];
    console.log('Uploaded file:', uploadedFile);

    // Check if uploadedFile is of type File
    if (uploadedFile instanceof File) {
      // Generate a unique filename
      fileName = uuidv4();

      // Convert the uploaded file into a temporary file
      const tempFilePath = `/tmp/${fileName}.pdf`;

      // Convert ArrayBuffer to Buffer
      const fileBuffer = Buffer.from(await uploadedFile.arrayBuffer());

      // Save the buffer as a file
      await fs.writeFile(tempFilePath, fileBuffer);

      // Parse the pdf using pdf2json. See pdf2json docs for more info.

      // The reason I am bypassing type checks is because
      // the default type definitions for pdf2json in the npm install
      // do not allow for any constructor arguments.
      // You can either modify the type definitions or bypass the type checks.
      // I chose to bypass the type checks.
      const pdfParser = new (PDFParser as any)(null, 1);

      // See pdf2json docs for more info on how the below works.
      pdfParser.on('pdfParser_dataError', (errData: any) =>
        console.log(errData.parserError)
      );
      pdfParser.on('pdfParser_dataReady', () => {
        console.log((pdfParser as any).getRawTextContent());
        parsedText = (pdfParser as any).getRawTextContent();
      });

      pdfParser.loadPDF(tempFilePath);
    } else {
      console.log('Uploaded file is not in the expected format.');
    }
  } else {
    console.log('No files found.');
  }

  const response = new NextResponse(parsedText);
  response.headers.set('FileName', fileName);
  return response;
}

How does it work?

Now, let us go through each part of the above code step by step to understand exactly what's going on.

Importing Dependencies

import { NextRequest, NextResponse } from 'next/server'; // To handle the request and response
import { promises as fs } from 'fs'; // To save the file temporarily
import { v4 as uuidv4 } from 'uuid'; // To generate a unique filename
import PDFParser from 'pdf2json'; // To parse the pdf

We're importing essential modules for our function:
- NextRequest and NextResponse help handle incoming requests and craft responses.
- The fs module allows us to work with the filesystem, which we'll use to temporarily save the uploaded PDF.
- uuidv4 generates unique identifiers, which will help in naming our PDF files uniquely.
- PDFParser from pdf2json, allowing us to parse the content of PDFs. This library also has a lot more functionality, which you can take advantage of as you see fit.

Handling the POST Request

export async function POST(req: NextRequest) {

This line signifies the beginning of our asynchronous function, designed to handle POST requests. This is how we implement API Routing in next.js.

Extracting Uploaded Files

  const formData: FormData = await req.formData();
  const uploadedFiles = formData.getAll('filepond');

FilePond uploads file example.pdf as multipart/form-data using a POST request. FormData uses the same format a form would use if the encoding type were set to "multipart/form-data". Therefore, we are able to extract the FormData from our request.
Why do we have .getAll('filepond)?
- If we log the formData like so:
```
  console.log('Form data:', formData);
```

We will see the following:

    Form data: FormData {
      [Symbol(state)]: [
        { name: 'filepond', value: '{}' },
        { name: 'filepond', value: [File] }
      ]
    }

We want to get all the objects with the name 'filepond'.

Processing the Uploaded File

  let fileName = '';
  let parsedText = '';

  if (uploadedFiles && uploadedFiles.length > 0) {
    const uploadedFile = uploadedFiles[1];
    console.log('Uploaded file:', uploadedFile);

We initialize two variables: fileName and parsedText.
We then check if there are uploaded files. If so, we proceed to process the first uploaded file.
Why do we have uploadedFiles[1]?
- If we log the uploadedFiles like so:
```
  console.log('Uploaded files:', uploadedFiles);
```
  We will see the following:
```
  Uploaded files: [
    '{}',
    File {
      size: 152864,
      type: 'application/pdf',
      name: 'example.pdf',
      lastModified: 1691154708425
    }
  ]
```
  The first element in the array is empty, and we need the second File object, which is why we index the second element in the array with uploadedFiles[1]. There are two elements in the uploadedFiles array because that is what FilePond sends over. From the FilePond docs: "Along with the file object, FilePond also sends the file metadata to the server, both these objects are given the same name."

Parsing the PDF File

    if (uploadedFile instanceof File) {
      fileName = uuidv4();

      // Convert the uploaded file into a temporary file
      const tempFilePath = `/tmp/${fileName}.pdf`;

      // Convert ArrayBuffer to Buffer
      const fileBuffer = Buffer.from(await uploadedFile.arrayBuffer());

      // Save the buffer as a file
      await fs.writeFile(tempFilePath, fileBuffer);

We confirm that our uploaded file is an instance of File.
We generate a unique filename using uuidv4.
Convert the uploaded file into a temporary path.
We then convert the uploaded file (which is in ArrayBuffer format) into a Node.js Buffer.
We save this buffer as a temporary file.

Using pdf2json to Extract Text

    const pdfParser = new (PDFParser as any)(null, 1);
      pdfParser.on('pdfParser_dataError', (errData: any) =>
        console.log(errData.parserError)
      );
      pdfParser.on('pdfParser_dataReady', () => {
        console.log((pdfParser as any).getRawTextContent());
        parsedText = (pdfParser as any).getRawTextContent();
      });

      pdfParser.loadPDF(tempFilePath);

We initialize our PDF parser. Due to type definition constraints, we bypass type checks (see the prerequisites section above).
We set up two event listeners: one for errors and one for when the PDF data is ready.
Finally, we load our PDF into the parser.

Sending the Response

    } else {
      console.log('Uploaded file is not in the expected format.');
    }
  } else {
    console.log('No files found.');
  }

  const response = new NextResponse();
  response.headers.set('FileName', fileName);
  return response;
}

If the uploadedFile isn't of type File, we log a message saying it wasn't in the expected format.
If the uploadedFiles is empty, we log a message saying there were no files found.
Finally, we craft our response, adding our parsed text and setting the filename in the headers.

Wrapping Up

This is a simple approach to PDF uploading and parsing, where we abstract away many of the complexities. With FilePond and pdf2json at its core, it provides a robust and battle-tested solution to integrate PDF parsing into an application.

I hope you've found this guide useful! If you have any questions or suggestions, feel free to drop a comment, contribute to the GitHub repository, or follow me on Twitter. Happy coding! 🚀