Deduplicating PDF Files (Emails) Using the AutoPortfolio™ Plug-in For Adobe® Acrobat®
What is Email/Document De-Duplication?
Emails are one of the most important types of litigation documents. It is often necessary to compile hundreds or even thousands of emails for a single court case. Typically, there is a significant number of emails that are part of the email "threads" and are redundant. This is due to the fact that email replies almost always include the content of the previous emails. It is sufficient to keep only the last email from each "thread" and discard the intermediate emails. The process of finding unique documents (emails) is often referred to as "de-duplication". Detecting and discarding documents that are redundant can greatly reduce the number of documents/emails that need to be prepared during the electronic discovery process.
Introduction
The AutoPortfolio™ plug-in provides functionality for de-duplication of PDF documents. These can be PDF files created from emails or any other kinds of text documents. The process is specifically fine-tuned for handling emails. The emails need to be converted into PDF format in order be used in the de-duplication. This allows using both emails and their attachments in the de-duplication process. The conversion into PDF format is provided by both the Adobe® Acrobat® and the AutoPortfolio™ plug-in.
What is a Duplicate File?
Any PDF file that has text that is either identical to or is fully contained in another PDF file is considered a duplicate. Note that only text content is compared. It is not possible to use this process for scanned PDF files that have not been run through the text recognition. The de-duplication process also does not compare images. There are other types of processing available for finding duplicate pages where comparison is performed "visually" without using the actual text. The de-duplication can also instantly detect files that are totally identically on the "binary level".
Email Handling
The algorithm is specialized for processing email text to avoid comparing email headers that may be different while the email text is the same. This can happen when the same email is received from multiple recipients or was emailed to a group of people and was received by the same person more than once. There can be multiple unique emails in a single email thread, if an original email or any of the replies contain attachments.
Workflow Outline
  1. Export email messages (or whole folders) from the Microsoft® Outlook® (or any other email app) into a PDF Portfolio format. This is a standard functionality provided by the Adobe® Acrobat®. The output is a single PDF Portfolio file with emails converted into PDF, but all attachments remain in the native file format. PDF Portfolio is an archive of other files, not a regular PDF document.
  2. Extract individual emails with attachments as separate PDF documents by using the AutoPortfolio™ plug-in for the Adobe® Acrobat®. Each email is exported from PDF Portfolio as a separate PDF file with attachments converted into PDF and appended right after the email text.
  3. Run de-duplication process to find redundant documents. Unique documents can be copied into another folder and duplicate files can be discarded.
  4. Check for duplicates using the additional set of files. The de-duplication process computes a special "fingerprint" file for every PDF document. It takes some time to create, but once it is computed, it is very fast to check new files for duplicates. Fingerprint files are computed only once.
  5. Combine unique documents into a single PDF document or into a PDF Portfolio.
Input Documents
In the tutorial we are going to use a sample email folder that contains 4 threads with 5 email replies each. The goal is to find emails that contain text from other emails and discard messages that are redundant. After that we will show how to combine unique documents into a single PDF document or a PDF Portfolio file.
Prerequisites
You need a copy of the Microsoft® Outlook® (or any other email application), the Adobe® Acrobat® along with the AutoPortfolio™ plug-in installed on your computer in order to use this tutorial. You can download trial versions of both the Adobe® Acrobat® and the AutoPortfolio™.
Exporting Email Messages into a PDF Portfolio↑overview
Step 1 - Export an Outlook® Email Folder to a PDF Porfolio File
Start the Microsoft® Outlook® application. Select an email folder (for example "Inbox") you want to convert and click the right mouse button, then select "Convert "Inbox" to Adobe PDF" from the pop-up menu.
Export email messages
Step 2 - Specify Output File Name and Location
Specify output file name and location in the "Save Adobe PDF File As" dialog that will appear on the screen. Press the "Save" button to start conversion.
Specify output file name
Step 3 - Inspect the Conversion Results
Once the conversion is finished, the output PDF Portfolio is going to be automatically opened in the Adobe® Acrobat®. Inspect the results and close the tab with PDF Portfolio file.
Inspect the conversion results
Extract Individual Emails as Separate PDF Documents↑overview
Step 4 - Open the "Extract Files From PDF Portfolio" Dialog
Select "Plug-Ins > AutoPortfolio Plug-in > Extract Files From Portfolio(s)..." from the main Adobe® Acrobat® menu.
Start extracting individual emails
Step 5 - Select an Input PDF Portfolio
Press the "Add Files..." button to specify an input PDF Portfolio file.
Add files
Select a PDF Portfolio file that contains emails. Click "Open".
Select a PDF Portfolio
Step 6 - Specify Email Sorting Order
The "Specify Sorting Order" dialog appears on the screen. Click on column headers to arrange email into a desired order. The sorting order is necessary for naming extracted files in a specific way to preserve the desired order of the emails in the output filenames. For example, sorting the output files by name will be the same as sorting by "Date" metadata field, because emails were sorted by "Date" prior to extraction.
Sort Records
Step 7 - Select Records For Extraction
All or only few specific emails can be selected for extraction. In the following example the records have been sorted by date and only 20 entries have been selected for processing. Click "OK" once done selecting records.
Select records for extraction
Step 8 - Specify Output Options
Click "Browse" and specify an output folder. Check output options if you want to extract and merge file attachments.
Specify output folder
Click "File Naming Options..." to specify output file naming scheme.
Click File Naming Options
The software allows adding auto-incrementing prefixes to all extracted PDF files and attachments. This provides a way to preserve a specific order of the files and their file-attachment relationships in the file names.
Check the "Add auto-incrementing prefix to all filenames and attachments" option to maintain original sorting order and preserve file-attachment relationships.  Specify desired prefixes for top-level files (emails) and attachments. Leave these fields blank, if no prefixes required. For example, enter FILE prefix for files and ATT for attachments, then output files will be named as follows:
  • 1_FILE_File1.pdf
  • 1_1_ATT_AttachmentA.pdf
  • 1_2_ATT_AttachmentB.pdf
  • 2_FILE_File2.pdf
  • 2_1_ATT_AttachmentC.pdf
  • 2_2_ATT_AttachmentD.pdf
  • 3_FILE_File3.pdf
Optionally, the software provides a way to name files and attachments stored inside PDF Portfolio using a custom combination of static text and metadata fields. It is a common requirement to name files using date and time information (to enable alphabetical sorting while preserving email dates) or using content of "To" or "From" metadata fields. It is possible to combine multiple metadata fields and text to form a file name.
Click "OK" to save and exit the dialog.
Specify file naming options
Step 9 - Start the Extraction Process
Click "OK" to start the extraction process.
Start the extraction
Step 10 - Inspect the Processing Report
Once the processing is completed, click "OK" to display the detailed report. The report is in HTML format and will be opened by a default web browser installed on your computer.
Read the report message
The report lists the file name, description, file creation and modification dates, file size in bytes, number of attachments, and MD5 hash value for each email/document and attachment extracted from the portfolio.
Inspect the processing report
Running the Deduplication Process↑overview
Step 11 - Open the "PDF Document Deduplication" Menu
Select "Plug-Ins > AutoPortfolio Plug-in > Deduplicate PDF Files..." from the main Adobe® Acrobat® menu.
Open the PDF Document Deduplication Menu
Step 12 - Select PDF Documents For Deduplication
Click "Select All Files From Folder".
Click Select all files from folder
Select the input folder that contains extracted PDF files. Click "OK" once done.
Select the input folder
Step 13 - Start Deduplication
The "Find Duplicate and Near-Duplicate Documents" dialog will be opened. It contains the list of input PDF files.
Press the "Deduplicate..." button to start the process. This operation will compute a special "fingerprint" file for input file. The "fingerprint" file provides a way to quickly compare two documents and check if text from one document is contained in the antoher file.
Start deduplication
Step 14 - Inspect the Results
The dialog reports the number of duplicate documents. Click "OK" to proceed.
Read the message
Once the deduplication process is completed, all duplicate files will be marked in red. The user can now use "File", "Select" and "Edit" menus to perform various operations on the results. Files can be either copied to another folder (use "File" menu selections) or saved as a load file (use "Save File List As..." button) or as an Excel-ready CSV spreadsheet. Note that if some PDF files cannot be opened or processed (due to password protection or document access rights), they will be highlighted in yellow and show "Processing Error" status in the "Is Duplicate" column.
Inspect the results
The plug-in creates a special "fingerprint" file for each input document. If a file already has a corresponding "fingerprint" file, then the existing file is used. The "fingerprint" file contains a text "map" of the document that allows a fast comparison of two files without the need to compare every byte of each file to every possible location in another file. Creating a "fingerprint" file takes some time, but since it is saved to disk it is a one-time processing. Once a file has a "fingerprint" computed, the comparison between two files is extremely fast. Do not delete "fingerprint" files if you want to run de-duplication multiple times.
Fingerprint files have been created
If there is no need in adding more files to the deduplication process, then go to the Step 19 - "Copy Unique Files to Folder".
Adding Files For the Deduplication Process↑overview
Step 15 - Add More Files For Deduplication
The plug-in allows to add new files to the deduplication process at any time. Click "Add Files" to add more documents to the deduplication process.
Add more files
Step 16 - Select Additional PDF Files
Select new PDF files for deduplication. Click "Open" once done.
Select more files
The dialog reports the number of files that have been added. Click "OK" to proceed.
The dialog reports the number of files
Step 17 - Start the Deduplication Process
Click "Deduplicate" to run the process again. Note that this time the deduplication process will run much faster, because the existing "fingerprint" files are used.
Start deduplication again
Step 18 - Inspect the Results
The dialog reports the number of duplicate documents that have been detected. Click "OK" to proceed.
Read the message
Once the deduplication process is completed, all duplicate files are highlighted with red. Note that this time the deduplication process has different results, because the newly added files contain a full text of all previous messages.
Inspect the results
Step 19 - Copy Unique Files to Folder
 Select "File > Copy Unique Files To Folder" from the "Find Duplicate and Near-Duplicate Documents" dialog menu to copy unique documents into another folder.
Copy unique files to folder
Specify a destination folder. Click "OK" once done.
Specify a folder
Step 20 - Inspect the Results
The dialog shows the number of files that have been copied to the destination folder. Click "OK" to proceed.
Read the report
All unique files have been copied into another folder for further processing.
All unique files have been copied
Unique files contain only the last email that includes all the previous emails with replies. Note that if any of the emails contained attachments, then there will be more unique emails for each email thread.
Unique file example
Combining PDF Files into a Single PDF or PDF Portfolio↑overview
Combine Files into a Single PDF
Select "File > Create > Combine Files into a Single PDF..." from the main Adobe® Acrobat® menu.
Start to combine files
Press the "Add Files" button.
Press add files
Select PDF files to combine them and click "Open".
Select files
Selected files would be added to the list. Click "Options" in the "Combine Files" toolbar to specify combining options.
Click Options
Select output file size. Specify combining options. Optionally, check the "Save as PDF Portfolio" option if it is necessary. Click "OK" to save and close the dialog.
Specify combining options
Click "Combine" in the "Combine Files" toolbar to start the merging files.
Combine files into a single PDF
Selected files would be combined into a single PDF document if the "Save as PDF Portfolio" option hasn`t been checked. Save created PDF document.
Save created PDF document
Create a PDF Portfolio ↑overview
Alternatively, it is possible to combine files into a PDF portfolio. PDF portfolio is not a single PDF document, it is an archive of separate files. Each file is stored inside portfolio as a separate entity.
Select "File > Create > PDF Portfolio..." from the main Adobe® Acrobat® menu.
Start creating PDF Portfolio
Select "Add Files... > Add Files..." in the "Create PDF Portfolio" dialog.
Add files
Select PDF files to add into PDF Portfolio and click "Open".
Select files
Selected files have been added to the list. Click "Create" in the "Create PDF Portfolio" dialog to start the process.
Create PDF Portfolio
The PDF Portfolio would be created form the selected files. Save created PDF Portfolio. Now you can easily print one or multiple PDF files from the PDF Portfolio.
Save created PDF document
↑overview
Click here for a list of all step-by-step tutorials available.