Deduplicating PDF Files (Emails) Using the AutoPortfolio™ Plug-in For Adobe® Acrobat®
- What is Email/Document De-Duplication?
- Emails are one of the most important types of litigation documents. It is often necessary to compile hundreds or even thousands of emails for a single court case. Typically, there is a significant number of emails that are part of the email "threads" and are redundant. This is due to the fact that email replies almost always include the content of the previous emails. It is sufficient to keep only the last email from each "thread" and discard the intermediate emails. The process of finding unique documents (emails) is often referred to as "de-duplication". Detecting and discarding documents that are redundant can greatly reduce the number of documents/emails that need to be prepared during the electronic discovery process.
- Introduction
- The AutoPortfolio™ plug-in provides functionality for de-duplication of PDF documents. These can be PDF files created from emails or any other kinds of text documents. The process is specifically fine-tuned for handling emails. The emails need to be converted into PDF format in order be used in the de-duplication. This allows using both emails and their attachments in the de-duplication process. The conversion into PDF format is provided by both the Adobe® Acrobat® and the AutoPortfolio™ plug-in.
- What is a Duplicate File?
- Any PDF file that has text that is either identical to or is fully contained in another PDF file is considered a duplicate. Note that only text content is compared. It is not possible to use this process for scanned PDF files that have not been run through the text recognition. The de-duplication process also does not compare images. There are other types of processing available for finding duplicate pages where comparison is performed "visually" without using the actual text. The de-duplication can also instantly detect files that are totally identically on the "binary level".
- Email Handling
- The algorithm is specialized for processing email text to avoid comparing email headers that may be different while the email text is the same. This can happen when the same email is received from multiple recipients or was emailed to a group of people and was received by the same person more than once. There can be multiple unique emails in a single email thread, if an original email or any of the replies contain attachments.
- Workflow Outline
- Export email messages (or whole folders) from the Microsoft® Outlook® (or any other email app) into a PDF Portfolio format. This is a standard functionality provided by the Adobe® Acrobat®. The output is a single PDF Portfolio file with emails converted into PDF, but all attachments remain in the native file format. PDF Portfolio is an archive of other files, not a regular PDF document.
- Extract individual emails with attachments as separate PDF documents by using the AutoPortfolio™ plug-in for the Adobe® Acrobat®. Each email is exported from PDF Portfolio as a separate PDF file with attachments converted into PDF and appended right after the email text.
- Run de-duplication process to find redundant documents. Unique documents can be copied into another folder and duplicate files can be discarded.
- Check for duplicates using the additional set of files. The de-duplication process computes a special "fingerprint" file for every PDF document. It takes some time to create, but once it is computed, it is very fast to check new files for duplicates. Fingerprint files are computed only once.
- Combine unique documents into a single PDF document or into a PDF Portfolio.
- Input Documents
- In the tutorial we are going to use a sample email folder that contains 4 threads with 5 email replies each. The goal is to find emails that contain text from other emails and discard messages that are redundant. After that we will show how to combine unique documents into a single PDF document or a PDF Portfolio file.
- Prerequisites
- You need a copy of the Microsoft® Outlook® (or any other email application), the Adobe® Acrobat® along with the AutoPortfolio™ plug-in installed on your computer in order to use this tutorial. You can download trial versions of both the Adobe® Acrobat® and the AutoPortfolio™.
- Exporting Email Messages into a PDF Portfolio↑overview
- Step 1 - Export an Outlook® Email Folder to a PDF Porfolio File
- Start the Microsoft® Outlook® application. Select an email folder (for example "Inbox") you want to convert and click the right mouse button, then select "Convert "Inbox" to Adobe PDF" from the pop-up menu.
- Step 2 - Specify Output File Name and Location
- Specify output file name and location in the "Save Adobe PDF File As" dialog that will appear on the screen. Press the "Save" button to start conversion.
- Step 3 - Inspect the Conversion Results
- Once the conversion is finished, the output PDF Portfolio is going to be automatically opened in the Adobe® Acrobat®. Inspect the results and close the tab with PDF Portfolio file.
- Extract Individual Emails as Separate PDF Documents↑overview
- Step 4 - Open the "Extract Files From PDF Portfolio" Dialog
- Select "Plug-Ins > AutoPortfolio Plug-in > Extract Files From Portfolio(s)..." from the main Adobe® Acrobat® menu.
- Step 5 - Select an Input PDF Portfolio
- Press the "Add Files..." button to specify an input PDF Portfolio file.
- Select a PDF Portfolio file that contains emails. Click "Open".
- Step 6 - Specify Email Sorting Order
- The "Specify Sorting Order" dialog appears on the screen. Click on column headers to arrange email into a desired order. The sorting order is necessary for naming extracted files in a specific way to preserve the desired order of the emails in the output filenames. For example, sorting the output files by name will be the same as sorting by "Date" metadata field, because emails were sorted by "Date" prior to extraction.
- Step 7 - Select Records For Extraction
- All or only few specific emails can be selected for extraction. In the following example the records have been sorted by date and only 20 entries have been selected for processing. Click "OK" once done selecting records.
- Step 8 - Specify Output Options
- Click "Browse" and specify an output folder. Check output options if you want to extract and merge file attachments.
- Click "File Naming Options..." to specify output file naming scheme.
- The software allows adding auto-incrementing prefixes to all extracted PDF files and attachments. This provides a way to preserve a specific order of the files and their file-attachment relationships in the file names.
- Check the "Add auto-incrementing prefix to all filenames and attachments" option to maintain original sorting order and preserve file-attachment relationships. Specify desired prefixes for top-level files (emails) and attachments. Leave these fields blank, if no prefixes required. For example, enter FILE prefix for files and ATT for attachments, then output files will be named as follows:
- 1_FILE_File1.pdf
- 1_1_ATT_AttachmentA.pdf
- 1_2_ATT_AttachmentB.pdf
- 2_FILE_File2.pdf
- 2_1_ATT_AttachmentC.pdf
- 2_2_ATT_AttachmentD.pdf
- 3_FILE_File3.pdf
- Optionally, the software provides a way to name files and attachments stored inside PDF Portfolio using a custom combination of static text and metadata fields. It is a common requirement to name files using date and time information (to enable alphabetical sorting while preserving email dates) or using content of "To" or "From" metadata fields. It is possible to combine multiple metadata fields and text to form a file name.
- Click "OK" to save and exit the dialog.
- Step 9 - Start the Extraction Process
- Click "OK" to start the extraction process.
- Step 10 - Inspect the Processing Report
- Once the processing is completed, click "OK" to display the detailed report. The report is in HTML format and will be opened by a default web browser installed on your computer.
- The report lists the file name, description, file creation and modification dates, file size in bytes, number of attachments, and MD5 hash value for each email/document and attachment extracted from the portfolio.
- Running the Deduplication Process↑overview
- Step 11 - Open the "PDF Document Deduplication" Menu
- Select "Plug-Ins > AutoPortfolio Plug-in > Deduplicate PDF Files..." from the main Adobe® Acrobat® menu.
- Step 12 - Select PDF Documents For Deduplication
- Click "Select All Files From Folder".
- Select the input folder that contains extracted PDF files. Click "OK" once done.
- Step 13 - Start Deduplication
- The "Find Duplicate and Near-Duplicate Documents" dialog will be opened. It contains the list of input PDF files.
- Press the "Deduplicate..." button to start the process. This operation will compute a special "fingerprint" file for input file. The "fingerprint" file provides a way to quickly compare two documents and check if text from one document is contained in the antoher file.
- Step 14 - Inspect the Results
- The dialog reports the number of duplicate documents. Click "OK" to proceed.
- Once the deduplication process is completed, all duplicate files will be marked in red. The user can now use "File", "Select" and "Edit" menus to perform various operations on the results. Files can be either copied to another folder (use "File" menu selections) or saved as a load file (use "Save File List As..." button) or as an Excel-ready CSV spreadsheet. Note that if some PDF files cannot be opened or processed (due to password protection or document access rights), they will be highlighted in yellow and show "Processing Error" status in the "Is Duplicate" column.
- The plug-in creates a special "fingerprint" file for each input document. If a file already has a corresponding "fingerprint" file, then the existing file is used. The "fingerprint" file contains a text "map" of the document that allows a fast comparison of two files without the need to compare every byte of each file to every possible location in another file. Creating a "fingerprint" file takes some time, but since it is saved to disk it is a one-time processing. Once a file has a "fingerprint" computed, the comparison between two files is extremely fast. Do not delete "fingerprint" files if you want to run de-duplication multiple times.
- If there is no need in adding more files to the deduplication process, then go to the Step 19 - "Copy Unique Files to Folder".
- Adding Files For the Deduplication Process↑overview
- Step 15 - Add More Files For Deduplication
- The plug-in allows to add new files to the deduplication process at any time. Click "Add Files" to add more documents to the deduplication process.
- Step 16 - Select Additional PDF Files
- Select new PDF files for deduplication. Click "Open" once done.
- The dialog reports the number of files that have been added. Click "OK" to proceed.
- Step 17 - Start the Deduplication Process
- Click "Deduplicate" to run the process again. Note that this time the deduplication process will run much faster, because the existing "fingerprint" files are used.
- Step 18 - Inspect the Results
- The dialog reports the number of duplicate documents that have been detected. Click "OK" to proceed.
- Once the deduplication process is completed, all duplicate files are highlighted with red. Note that this time the deduplication process has different results, because the newly added files contain a full text of all previous messages.
- Step 19 - Copy Unique Files to Folder
- Select "File > Copy Unique Files To Folder" from the "Find Duplicate and Near-Duplicate Documents" dialog menu to copy unique documents into another folder.
- Specify a destination folder. Click "OK" once done.
- Step 20 - Inspect the Results
- The dialog shows the number of files that have been copied to the destination folder. Click "OK" to proceed.
- All unique files have been copied into another folder for further processing.
- Unique files contain only the last email that includes all the previous emails with replies. Note that if any of the emails contained attachments, then there will be more unique emails for each email thread.
- Combining PDF Files into a Single PDF or PDF Portfolio↑overview
- Combine Files into a Single PDF
- Select "File > Create > Combine Files into a Single PDF..." from the main Adobe® Acrobat® menu.
- Press the "Add Files" button.
- Select PDF files to combine them and click "Open".
- Selected files would be added to the list. Click "Options" in the "Combine Files" toolbar to specify combining options.
- Select output file size. Specify combining options. Optionally, check the "Save as PDF Portfolio" option if it is necessary. Click "OK" to save and close the dialog.
- Click "Combine" in the "Combine Files" toolbar to start the merging files.
- Selected files would be combined into a single PDF document if the "Save as PDF Portfolio" option hasn`t been checked. Save created PDF document.
- Create a PDF Portfolio ↑overview
- Alternatively, it is possible to combine files into a PDF portfolio. PDF portfolio is not a single PDF document, it is an archive of separate files. Each file is stored inside portfolio as a separate entity.
- Select "File > Create > PDF Portfolio..." from the main Adobe® Acrobat® menu.
- Select "Add Files... > Add Files..." in the "Create PDF Portfolio" dialog.
- Select PDF files to add into PDF Portfolio and click "Open".
- Selected files have been added to the list. Click "Create" in the "Create PDF Portfolio" dialog to start the process.
- The PDF Portfolio would be created form the selected files. Save created PDF Portfolio. Now you can easily print one or multiple PDF files from the PDF Portfolio.
- ↑overview
- Click here for a list of all step-by-step tutorials available.