How to upload and parse a PDF in Next.js

ยท

8 min read

I had some trouble parsing PDFs in a Next.js app. After a few hours of messing around and trying different libraries, I finally settled on a simple solution - no messing around with multer or formidable. I created a GitHub template you can use to get up and running quickly ๐Ÿš€.

You can find the template here: nextjs-pdf-parser.

This blog post is intended to walk you through how the template works, so you can implement this functionality in your own project. We will parse the text of a PDF. We will not implement any OCR to extract the text from images, as that is beyond the scope of this post.

๐Ÿ“š Libraries

  • FilePond: For the file uploading process, I've found FilePond to be an elegant, user-friendly solution that takes away the complexity of file uploads.

  • pdf2json: Once we've got our PDF, we turn to pdf2json for parsing. This library offers a multitude of features, making it useful for various applications beyond just parsing the text.

๐Ÿ“Œ Prerequisites and Important Considerations

As of writing this post, there are some crucial configurations and technical details to be aware of:

1. Configuration in next.config.js

To make the pdf2json library compatible with Next.js, a specific configuration is required:

const nextConfig = {
  experimental: {
    serverComponentsExternalPackages: ['pdf2json'],
  },
};

module.exports = nextConfig;

Without the above configuration it, you may encounter the error:

Error [ReferenceError]: nodeUtil is not defined.

See here for more info.

2. Issues with pdfParser.getRawTextContent()

While working with the pdf2json library, I came across an issue where trying to fetch raw text content from a parsed PDF using pdfParser.getRawTextContent() in route.ts, the result was a blank string. This is because of the incorrect TypeScript types in the library (as of writing this post).

If you face this issue, here are two potential solutions:

  1. Fix the TypeScript Definition:

    • Modify the constructor in the type definition for the PDFParser as follows:

        declare class Pdfparser extends EventEmitter{
            constructor(context: any, value: number);
            parseBuffer(buffer: Buffer): void;
            loadPDF(pdfFilePath: string, verbosity?: number):Promise<void>
            createParserStream():ParserStream
            on<K extends keyof EventMap>(eventName: K, listener: EventMap[K]): this
        }
      

      You can access the above file which contains the type by right-clicking on PDFParser and selecting 'Go to Type Definition' in VS Code. It's worth noting that these types weren't set by the library's maintainer but by another developer through a PR, so some inaccuracies are to be expected.

  2. Bypass Type Checking:

  • When declaring the PDFParser (which we will do later in api/upload/route.ts), you can bypass type checking:

      const pdfParser = new (PDFParser as any)(null, 1);
    

The Code ๐Ÿ› 

1. File Upload Component

In our template, the file upload functionality is managed by the FileUpload component:

import { FilePond } from 'react-filepond';
import 'filepond/dist/filepond.min.css';

export default function FileUpload() {
  return (
    <FilePond
      server={{
        process: '/api/upload',
        fetch: null,
        revert: null,
      }}
    />
  );
}

Here, we're using the FilePond component from react-filepond. The server prop defines where the uploaded file will be processed, in this case, the /api/upload endpoint (we implement this in the route.ts file)

2. Sending a POST Request

When a user uploads a file using the FileUpload component, the component sends a POST request to the server (specifically to the /api/upload route). This is all managed under the hood by FilePond. See the FilePond docs for more info.

3. Processing the Upload:

We process the uploaded PDF and extract its content in /api/upload/route.ts.

import { NextRequest, NextResponse } from 'next/server'; // To handle the request and response
import { promises as fs } from 'fs'; // To save the file temporarily
import { v4 as uuidv4 } from 'uuid'; // To generate a unique filename
import PDFParser from 'pdf2json'; // To parse the pdf

export async function POST(req: NextRequest) {
  const formData: FormData = await req.formData();
  const uploadedFiles = formData.getAll('filepond');
  let fileName = '';
  let parsedText = '';

  if (uploadedFiles && uploadedFiles.length > 0) {
    const uploadedFile = uploadedFiles[1];
    console.log('Uploaded file:', uploadedFile);

    // Check if uploadedFile is of type File
    if (uploadedFile instanceof File) {
      // Generate a unique filename
      fileName = uuidv4();

      // Convert the uploaded file into a temporary file
      const tempFilePath = `/tmp/${fileName}.pdf`;

      // Convert ArrayBuffer to Buffer
      const fileBuffer = Buffer.from(await uploadedFile.arrayBuffer());

      // Save the buffer as a file
      await fs.writeFile(tempFilePath, fileBuffer);

      // Parse the pdf using pdf2json. See pdf2json docs for more info.

      // The reason I am bypassing type checks is because
      // the default type definitions for pdf2json in the npm install
      // do not allow for any constructor arguments.
      // You can either modify the type definitions or bypass the type checks.
      // I chose to bypass the type checks.
      const pdfParser = new (PDFParser as any)(null, 1);

      // See pdf2json docs for more info on how the below works.
      pdfParser.on('pdfParser_dataError', (errData: any) =>
        console.log(errData.parserError)
      );
      pdfParser.on('pdfParser_dataReady', () => {
        console.log((pdfParser as any).getRawTextContent());
        parsedText = (pdfParser as any).getRawTextContent();
      });

      pdfParser.loadPDF(tempFilePath);
    } else {
      console.log('Uploaded file is not in the expected format.');
    }
  } else {
    console.log('No files found.');
  }

  const response = new NextResponse(parsedText);
  response.headers.set('FileName', fileName);
  return response;
}

How does it work?

Now, let us go through each part of the above code step by step to understand exactly what's going on.

Importing Dependencies

import { NextRequest, NextResponse } from 'next/server'; // To handle the request and response
import { promises as fs } from 'fs'; // To save the file temporarily
import { v4 as uuidv4 } from 'uuid'; // To generate a unique filename
import PDFParser from 'pdf2json'; // To parse the pdf
  • We're importing essential modules for our function:

    • NextRequest and NextResponse help handle incoming requests and craft responses.

    • The fs module allows us to work with the filesystem, which we'll use to temporarily save the uploaded PDF.

    • uuidv4 generates unique identifiers, which will help in naming our PDF files uniquely.

    • PDFParser from pdf2json, allowing us to parse the content of PDFs. This library also has a lot more functionality, which you can take advantage of as you see fit.

Handling the POST Request

export async function POST(req: NextRequest) {

This line signifies the beginning of our asynchronous function, designed to handle POST requests. This is how we implement API Routing in next.js.

Extracting Uploaded Files

  const formData: FormData = await req.formData();
  const uploadedFiles = formData.getAll('filepond');
  • FilePond uploads file example.pdf as multipart/form-data using a POST request. FormData uses the same format a form would use if the encoding type were set to "multipart/form-data". Therefore, we are able to extract the FormData from our request.

  • Why do we have .getAll('filepond)?

    • If we log the formData like so:

        console.log('Form data:', formData);
      

We will see the following:

    Form data: FormData {
      [Symbol(state)]: [
        { name: 'filepond', value: '{}' },
        { name: 'filepond', value: [File] }
      ]
    }

We want to get all the objects with the name 'filepond'.

Processing the Uploaded File

  let fileName = '';
  let parsedText = '';

  if (uploadedFiles && uploadedFiles.length > 0) {
    const uploadedFile = uploadedFiles[1];
    console.log('Uploaded file:', uploadedFile);
  • We initialize two variables: fileName and parsedText.

  • We then check if there are uploaded files. If so, we proceed to process the first uploaded file.

  • Why do we have uploadedFiles[1]?

    • If we log the uploadedFiles like so:

        console.log('Uploaded files:', uploadedFiles);
      

      We will see the following:

        Uploaded files: [
          '{}',
          File {
            size: 152864,
            type: 'application/pdf',
            name: 'example.pdf',
            lastModified: 1691154708425
          }
        ]
      

      The first element in the array is empty, and we need the second File object, which is why we index the second element in the array with uploadedFiles[1]. There are two elements in the uploadedFiles array because that is what FilePond sends over. From the FilePond docs: "Along with the file object, FilePond also sends the file metadata to the server, both these objects are given the same name."

Parsing the PDF File

    if (uploadedFile instanceof File) {
      fileName = uuidv4();

      // Convert the uploaded file into a temporary file
      const tempFilePath = `/tmp/${fileName}.pdf`;

      // Convert ArrayBuffer to Buffer
      const fileBuffer = Buffer.from(await uploadedFile.arrayBuffer());

      // Save the buffer as a file
      await fs.writeFile(tempFilePath, fileBuffer);
  • We confirm that our uploaded file is an instance of File.

  • We generate a unique filename using uuidv4.

  • Convert the uploaded file into a temporary path.

  • We then convert the uploaded file (which is in ArrayBuffer format) into a Node.js Buffer.

  • We save this buffer as a temporary file.

Using pdf2json to Extract Text

    const pdfParser = new (PDFParser as any)(null, 1);
      pdfParser.on('pdfParser_dataError', (errData: any) =>
        console.log(errData.parserError)
      );
      pdfParser.on('pdfParser_dataReady', () => {
        console.log((pdfParser as any).getRawTextContent());
        parsedText = (pdfParser as any).getRawTextContent();
      });

      pdfParser.loadPDF(tempFilePath);
  • We initialize our PDF parser. Due to type definition constraints, we bypass type checks (see the prerequisites section above).

  • We set up two event listeners: one for errors and one for when the PDF data is ready.

  • Finally, we load our PDF into the parser.

Sending the Response

    } else {
      console.log('Uploaded file is not in the expected format.');
    }
  } else {
    console.log('No files found.');
  }

  const response = new NextResponse();
  response.headers.set('FileName', fileName);
  return response;
}
  • If the uploadedFile isn't of type File, we log a message saying it wasn't in the expected format.

  • If the uploadedFiles is empty, we log a message saying there were no files found.

  • Finally, we craft our response, adding our parsed text and setting the filename in the headers.

Wrapping Up

This is a simple approach to PDF uploading and parsing, where we abstract away many of the complexities. With FilePond and pdf2json at its core, it provides a robust and battle-tested solution to integrate PDF parsing into an application.

I hope you've found this guide useful! If you have any questions or suggestions, feel free to drop a comment, contribute to the GitHub repository, or follow me on Twitter. Happy coding! ๐Ÿš€

ย