How to Split PDF Files into Separate Documents Using Java

PDF files are commonly used for sharing and storing documents due to their consistent formatting and widespread compatibility. However, sometimes you may need to split a large PDF file into separate documents. This can be useful for extracting specific pages, creating individual reports, or managing document workflow.

In this article, we will explore different approaches to splitting PDF files using Java. We will cover libraries such as Apache PDFBox and iText, providing comprehensive coding examples for each. By the end of this guide, you will have a clear understanding of how to implement PDF splitting in Java.

Prerequisites

Before diving into the code, ensure you have the following:

Java Development Kit (JDK) installed
Maven or Gradle for dependency management
An IDE such as IntelliJ IDEA or Eclipse

Now, let’s explore the methods using different libraries.

Using Apache PDFBox

Introduction to Apache PDFBox

Apache PDFBox is an open-source Java library for working with PDF documents. It provides functionalities to create, manipulate, extract, and split PDF files.

Adding PDFBox Dependency

To use Apache PDFBox in your project, add the following dependency to your pom.xml (for Maven users):

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.27</version>
</dependency>

Splitting a PDF File with Apache PDFBox

Here’s a Java program that splits a PDF file into individual pages:

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.multipdf.Splitter;

import java.io.File;
import java.io.IOException;
import java.util.List;

public class PDFSplitter {
    public static void main(String[] args) {
        String sourceFile = "sample.pdf";
        String outputFolder = "output/";
        splitPDF(sourceFile, outputFolder);
    }

    public static void splitPDF(String filePath, String outputDir) {
        try {
            File file = new File(filePath);
            PDDocument document = PDDocument.load(file);

            Splitter splitter = new Splitter();
            List<PDDocument> pages = splitter.split(document);

            int pageNum = 1;
            for (PDDocument page : pages) {
                page.save(outputDir + "split_page_" + pageNum + ".pdf");
                page.close();
                pageNum++;
            }
            document.close();
            System.out.println("PDF split successfully!");
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Explanation

The PDDocument.load() method loads the PDF file.
Splitter class is used to split the document into individual pages.
Each extracted page is saved as a new PDF file.

Using iText Library

Introduction to iText

iText is another powerful Java library for PDF manipulation. It allows splitting, merging, and modifying PDF files efficiently.

Adding iText Dependency

To include iText in your project, add the following Maven dependency:

<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>itext7-core</artifactId>
    <version>7.1.16</version>
</dependency>

Splitting a PDF File with iText

Here’s an example of how to split a PDF file using iText:

import com.itextpdf.kernel.pdf.*;
import com.itextpdf.kernel.utils.PdfSplitter;
import com.itextpdf.kernel.utils.PdfDocument;

import java.io.File;
import java.io.IOException;
import java.util.List;

public class iTextPDFSplitter {
    public static void main(String[] args) {
        String inputFilePath = "sample.pdf";
        String outputFolder = "output/";
        splitPDF(inputFilePath, outputFolder);
    }

    public static void splitPDF(String sourceFile, String outputDir) {
        try {
            PdfDocument pdfDoc = new PdfDocument(new PdfReader(sourceFile));

            for (int i = 1; i <= pdfDoc.getNumberOfPages(); i++) {
                PdfDocument newPdf = new PdfDocument(new PdfWriter(outputDir + "page_" + i + ".pdf"));
                pdfDoc.copyPagesTo(i, i, newPdf);
                newPdf.close();
            }

            pdfDoc.close();
            System.out.println("PDF split successfully!");
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Explanation

PdfReader loads the input PDF file.
PdfDocument.copyPagesTo() extracts and saves each page as a separate file.

Comparing Apache PDFBox and iText

Feature	Apache PDFBox	iText
Open-source	Yes	Yes (AGPL/Commercial)
Ease of use	Simple API	Slightly complex
Performance	Good	Faster for large files
License	Apache 2.0	AGPL/Commercial

Best Practices for Splitting PDFs

Choose the right library: If you need a free solution, PDFBox is preferable; for commercial use, iText may be an option.
Handle exceptions properly: Always wrap file operations in try-catch blocks.
Ensure correct file paths: Use absolute paths to avoid errors.
Optimize performance: For large PDFs, consider using a multi-threaded approach.

Conclusion

Splitting PDFs into separate documents using Java is straightforward with the right libraries. Apache PDFBox and iText are two of the most popular choices. While PDFBox is easier to use and completely open-source, iText offers better performance but comes with licensing restrictions.

In this article, we demonstrated how to split PDF files using both libraries with detailed Java examples. By following the best practices mentioned, you can efficiently integrate PDF splitting functionality into your Java applications.

If you’re working on a project that requires extensive PDF manipulation, consider choosing the library that best fits your needs in terms of functionality and licensing.