Find & Delete Duplicate and Near-Duplicate PDF Pages
Introduction
This tutorial shows how to find and delete duplicate pages in the same PDF document using the AutoSplit™ plug-in for the Adobe® Acrobat®. This function searches a PDF document for duplicate pages and presents a list of pages for a review. The user can check the results and select/unselect pages from the list prior to deleting. The plug-in provides two different methods for detecting duplicate or near-duplicate pages:
  • 1. Comparing visual appearance of the pages as "images". This method provides a fast way for detecting pages that look exactly the same. Use it to find pages that are visually identical. This method is only comparing visual appearances of the pages as they are displayed in the Adobe® Acrobat® document view. It is based on creating a smaller (sampled) copies of the page views and comparing them "as images".
  • 2. Comparing page text regardless of its visual appearance. The second method extracting text content from each page and comparing pages as text strings. If two pages contain the same sequence of words, then they are considered the same regardless of the visual appearance and location on the page. Note that this method totally ignores any images or graphics that might appear on the page, it is also ignoring text appearance properties such as font style, size and color.
Prerequisites
You need a copy of the Adobe® Acrobat® along with the AutoSplit™ plug-in installed on your computer in order to use this tutorial. You can download trial versions of both the Adobe® Acrobat® and the AutoSplit™ plug-in.
Method 1 - Comparing Visual Appearance
Step 1 - Open a PDF File
Start the Adobe® Acrobat® application and open a PDF file using “File > Open…” menu.
Step 2 - Open the "Find Duplicate Pages" Dialog
Select "Plug-Ins > Split Documents > Find and Delete Duplicate Pages…" to open the "Find Duplicate Pages" dialog.
Step 3 - Specify Settings
Check the "Compare visual appearance for exact match (can be used to compare images)" option. Click "OK" to start searching for duplicate pages.
This method searches for the visualy identical pages. This method does not perform any text comparison. It is based on creating a smaller (sampled) copies of the page views and comparing them "as images". Identical pages could contain text and/or images, as in the example below:
The following example contains identical pages that do not contain any searchable text:
If pages are not visually identical, then the software does not detect them as duplicates:
If the color or style of the text is different, then the pages are also not considered identical:
Step 4 - Inspect Duplicate Pages
The "Delete Duplicate Pages" dialog will show a list of duplicate pages detected. Click on a page record to display a corresponding page in the viewer. Check a box to select/unslect a page for deletion. Click "Delete Checked Pages" to delete selected duplicate pages from the PDF document.
Optionally, click "Save Report..." to create the report in the *.htm file. Here is the example of the duplicate pages report:
Step 5 - Delete Duplicate Pages
Click "OK" in the dialog to delete selected duplicate pages from the PDF document.
Method 2 - Comparing Page Text
Step 1 - Open a PDF File
Start the Adobe® Acrobat® application and open a PDF file using “File > Open…” menu.
Step 2 - Open the "Find Duplicate Pages" Dialog
Select "Plug-Ins > Split Documents > Find and Delete Duplicate Pages…" to open the "Find Duplicate Pages" dialog.
Step 3 - Specify Settings
Check the "Compare only page text (ignore vusial appearance of the pages)" option to compare only text content. It is possible to use this method to find pages with similar, but not identical content by specifying a maximum allowed difference between two pages (in characters). All pages with less difference will be considered identical (near-duplicate). Check the "Ignore text case" box to perform text comparison regardless of the text case. Check the "Ignore text lines separation” box to perform text comparison without taking into account line breaks. Check the "Ignore punctuation" box to perform text comparison while ignoring the following 5 symbols: ", . ! ?-". Click "OK" to start searching for duplicate pages.
This method only compares searchable text that is present in the document. This method ignores text appearance such as font style, size and color. In the example below, pages are considered identical despite the difference in text color:
This method ignores the location of the text within the page. In the example below, pages are considered identical despite text and image being arranged differently on the page:
This method is ignores any images or graphics that might appear on the page. In the example below, pages are considered identical despite missing image on the second page:
The comparing page text method can be used with additional options: ignoring text case, line breaks and punctuation:
It is possible to use this method to find pages with similar, but not identical content by specifying a maximum allowed difference between two pages (in characters). In the example below, the allowed difference between pages is 5 characters.
Step 4 - Inspect Duplicate Pages
The "Delete Duplicate Pages" dialog will show a list of duplicate or near-duplicate pages. Click on a page record to display a corresponding page in the viewer. Check/uncheck a box to mark a page for deletion. Click "Delete Checked Pages" to delete selected duplicate pages from the PDF document.
Optionally, click "Save Report..." to create the report in the *.htm file. Here is the example of the duplicate pages report:
Step 5 - Delete Duplicate Pages
Click "OK" in the dialog to delete selected duplicate pages from the PDF document.