Extracting Text from PDF Documents By Search
Introduction
This tutorial shows how to extract text from PDF documents by text search using the AutoDocSearch™ plug-in for Adobe® Acrobat®. The AutoDocSearch software provides functionality for searching PDF files and extracting matching text into a spreadsheet or a text file. It can be used to search for and extract the following types of text:
  • Custom text strings such as as “John Smith” or “Monthly Statement”
  • Text patterns such as social security (SSN), phone and account numbers, emails, and dates etc.
  • Text that matches a keyword from a user-defined list, for example day of the week (Monday, Tuesday,..) or a month (January, February,…).

The search results can be formatted by adding a custom text and extracting only specific parts of the search output. The output can be formatted into many popular formats such as CSV, XML, and HTML etc.
Prerequisites
You need a copy of Adobe® Acrobat® along with the AutoDocSearch™ plug-in installed on your computer in order to use this tutorial. You can download trial versions of both the Adobe® Acrobat® and the AutoDocSearch™ plug-in.
Contents
The tutorial contains the following examples:
Extracting a Single Text Pattern
Overview ↑overview
The first example illustrates how to extract invoice numbers from a set of individual PDF documents (each file represents an invoice). The same method can be used to extract invoice numbers from a single PDF document that contains multiple invoices. The extracted text is saved into a spreadsheet-ready CSV file or a plain text file. Here is a screenshot of the sample PDF document with an invoice number highlighted:
Sample invoice page with extraction text highlighted
Specify a Search Pattern ↑overview
Select "Plug-ins > AutoDocSearch Plug-in > Search PDF Files..." from the main Adobe Acrobat menu to open the "Search PDF Files" dialog. Next, type the following search expression into the text box:
Invoice #: \d+
Enter a search expression
The AutoDocSearch uses regular expression (regex for short) syntax to perform a text search. Regular expressions are a powerful tool for finding and extracting complex patterns from text documents. This is common language for performing text search in many applications and programming languages with a large amount of examples available online.
Let's see what this search expression actually means. The Invoice #: part is searching for the text "as-is", while the \d+ part of the expression is a special regex character class \d that matches any digit [0-9]. The + is a quantifier that means "match a previous character one or more times". The \d+ will match any number of digits, for example: 326436 or 000001223. The Invoice #: \d+ expession finds all occurences of the "Invoice #: " text that is followed by one or more digit. For example: Invoice #: 19361 or Invoice #: 001.
Press "Next>>" button to advance to the next step.
Select PDF Files to Search ↑overview
Press "Add Files" button on the "Select Input Files" dialog to select files to search.
Add files for processing
Use "Open" dialog to select one or more PDF files. You can select multiple files by holding Shift key and clicking on the first and last files in the list.
Select files
Press "Open" button to accept selected files.
Files selected for processing
Now press OK button to start searching the files.
Examine the Search Results ↑overview
The matching text is displayed in the "Search Results" dialog. Resize the window for a better view and/or adjust the width of the Text Match column. Optionally, click on the results in the list to display a corresponding page location.
Inspect search results
Save Resuts as a Spreadsheet ↑overview
You can save search results into either a spreadsheet-ready CSV file or into a plain text file. Click "Extract Search Result" link at the bottom of the screen and select a desired output format. Let's examine the spreadsheet format first.
Select output file format for the extracted text
Select a spreadsheet option and press OK to select the output file location and name.
Select output file location
The output CSV file will be automatically open in Excel (or any other application that is designated to handle CSV files on your computer).
Output results in a spreadsheet format
Each text match is stored as a separate record. The spreadsheet contains the following columns:
  1. Full path to the PDF file that contains the text match
  2. Page number where the text is located
  3. Matching text
  4. Page coordinates that describe the location of the text on the page (Left, Right, Top, Bottom). Coordinates are provided in the PDF coordinate system with origin in the lower-left corner of the page, measurement units are points (72 points = 1 inch).
  5. Search rule number (1, 2, 3..). This is helpful in case if multiple search rules/expressions have been used.
Save Search Results into a Text File ↑overview
Alternatively, the matching text can be saved into a plain text file. Click "Extract Search Result..." link at the bottom of the "Search Results" dialog and select a text output option and press OK button:
Select text format option
Select output file location an name:
Select output file location
The output text file will be automatically open in Notepad (or any other application that is designated to handle text files on your computer).
Extract text
Each text match appears on a separate line. Unlike the spreadsheet option, the text export saves only matching text and does not provide any additonal information about the search results.
Extracting Multiple Text Patterns with Formatting
Overview
The second example illustrates how to extract multiple text patterns from a single PDF page. We are going to extract an invoice number and "amount due" from a set of individual PDF documents (each file represents an invoice). The extracted text is saved into a spreadsheet-ready CSV file or a plain text file. Here is a screenshot of the sample PDF document with an invoice number and "amount due" fields highlighted:
Sample invoice page with extraction text highlighted
The process is essentially the same as for the single text pattern extraction. The difference is in the more advanced search expression.
Specify Search Pattern ↑overview
Select "Plug-ins > AutoDocSearch Plug-in > Search PDF Files..." from the main Adobe Acrobat menu to open "Search PDF Files" dialog. Next, type the following search expression into the text box:
(Invoice #: \d+)[\s\S]+(Amount due: [$]\d+[,]?\d+[.]\d{2})(?#format=\1 \2)
Enter a search expression
This regular expression looks complicated at first. Let's examine it in detail. This expression consists of 4 logical parts:
  1. First capturing group that extracts an invoice number (for example: Invoice #: 37603)
  2. Text that appears between an invoice number and "amount due" field. It is matching one or more "space" and "non-space" characters.
  3. Second capturing group that extracts an "amount due" field (for example: Amount due $1,203.55).
  4. Special formatting expression that outputs only capturing groups 1 and 2 and ignores the rest of the text.
Regular expression diagram
Parentheses are used in regular expression to group a part of the expression together into a capturing group. It is used to refer to the specific part of the matching text after the match has happend. By using capturing groups we are able to refer to an invoice number and "amount due" field in the format expression such as \1 and \2 correspondingly. This allows to extract only these two text fields and ignore all other text that may appear between them.
IMPORTANT: Note that PDF documents are not "linear" text files. The text location on the page is totally unrelated to its actual positon in the text content. The order of the text can be different from an assumed reading order. Most often, the text does follow a natural top-to-bottom reading order, but it is totally up to the application that generated a specific PDF file. It is always a good idea to inspect the actual text order prior to writing any search expressions. You can do that by selecting text on the page and copy/pasting it into another text editor. Knowing the actual text order is cricual for writing a correct search expression.
Formatting the Search Results
The regular expression in this example is using an optional (?#format=....) expression. If this expression is omitted, then the search result will contain all text that is matched by the search pattern. In our case, this will include all text that appears between an invoice number and "amount due". The format expression can be used not only to refer to the specific parts of the search results, but also for changing the text order (for example, placing "amount due" field before the invoice number) and inserting a custom text into the output.
Let's see what we can do to the search results with different formatting expressions:
Format Expression Sample Output
(?#format=\1 \2) Invoice #: 37452 Amount due: $1,503.00
(?#format=\2 \1) Amount due: $1,503.00 Invoice #: 37452
(?#format=Accounts Payable: \1 (\2)) Accounts Payable: Invoice #: 37452 (Amount due: $1,503.00)
What if we need to extract only the invoice numbers and amounts, but without the labels (Invoice #: and Amount due:)? This can be easily accomplished by changing the capturing groups and placing them around the parts that matches the numbers and leaving the labels out. Here are some examples:
Regular Expression Sample Output
Invoice #: (\d+)[\s\S]+Amount due: ([$]\d+[,]?\d+[.]\d{2})(?#format=\1 \2) 37452 $1,503.00
Invoice #: (\d+)[\s\S]+Amount due: ([$]\d+[,]?\d+[.]\d{2})(?#format=Pay \2 for invoice \1) Pay $1,503.00 for invoice 37452
(Invoice #: \d+)[\s\S]+Amount due: ([$]\d+[,]?\d+[.]\d{2})(?#format=\1 (\2)) Invoice #: 37452 ($1,503.00)
See Advanced: Exporting Results as XML Data File section below for more examples of data formatting and structural output.
Select PDF Files to Search ↑overview
Press "Next>>" button (on "Search PDF Files" dialog) to advance to the next step. The rest of the step are identical to the first example. We are going to repeat them here for convinience.
Press "Add Files" button on the "Select Input Files" dialog to select files to search.
Add files for processing
Use "Open" dialog to select one or more PDF files. You can select multiple files by holding Shift key and clicking on the first and last files in the list.
Select files
Press "Open" button to accept selected files.
Files selected for processing
Now press OK button to start searching the files.
Examine the Search Results ↑overview
The matching text is displayed in the "Search Results" dialog. Resize the window for a better view and/or adjust the width of the Text Match column. Optionally, click on the results in the list to display a corresponding page location.
Inspect search results
Save Resuts as a Spreadsheet ↑overview
You can save search results into either a spreadsheet-ready CSV file or into a plain text file. Click "Extract Search Result" link at the bottom of the screen and select a desired output format. Let's examine the spreadsheet format first.
Select output file format for the extracted text
Select a spreadsheet option and press OK to select the output file location and name.
Select output file location
The output CSV file will be automatically open in Excel (or any other application that is designated to handle CSV files on your computer).
Output results in a spreadsheet format
Each text match is stored as a separate record.
Save Search Results into a Text File ↑overview
Alternatively, the matching text can be saved into a plain text file. Click "Extract Search Result..." link at the bottom of the "Search Results" dialog and select a text output option and press OK button:
Select text format option
Select output file location and name.
Select output file location
Please note that this dialog provides 3 different output file formats in the "Save as type" drop-down menu:
  1. Text File (*.txt)
  2. XML File (*.xml)
  3. HTML File (*.htm)
If XML or HTML options are selected, then the software will automatically add the appropriate file headers, however you are fully responsible for formating output results into the desired format. See the Advanced formatting topic for details on how to format output as XML or HTML.
The output text file will be automatically open in Notepad (or any other application that is designated to handle text files on your computer).
Extract text
Each text match appears on a separate line. Unlike the spreadsheet option, the text export saves only matching text and does not provide any additonal information about the search results.
Advanced: Exporting Results into Multi-Field Data Files
Exporting as XML Data File
The format expression can be used to export search results into multi-field data files such XML or CSV and placing each particular text field into its own data field or column. The output can be formatted as an XML data record. Use a plain text output format to save the results and use format expression to create individual data records. Make sure to select "Save as type" -> "XML File (*.xml)" in the output file selection dialog.  The search results can be formated into XML records just by using a format expression:
Regular expression:
Invoice #: (\d+)[\s\S]+Amount due: ([$]\d+[,]?\d+[.]\d{2})(?#format=<invoice><id>\1\</id><amount>\2\</amount></invoice>)
Sample Output:
XML output
Note that the extra \ symbol in from of < symbol in the regular expression. This is necessary to output < "as-is", not as a special character. Both < and > symbols are used to after the capturing group reference to change the case of the inserted text. For example, \1> will be replaced by an upper-case version of the capturing group 1, while \1< will be replaced with a lower case. The extra \ in front of the < is used to avoid interpreting it as case modifier.
Exporting as CSV (Excel) File
The following example shows how to output invoice numbers and dollar amounts as a 2-column CSV (Comma-separated values) file. Each field will be saved into a separate column to simplify further data processing. The CSV files can be directly opened by Excel or any other spreadsheet application. Make sure to use a plain text export option when saving the search results.
The search results from the "invoice" example can be formated into CSV records by using the following expression:
Regular expression:
Invoice #: (\d+)[\s\S]+Amount due: ([$]\d+[,]?\d+[.]\d{2})(?#format="\1","\2")
Sample Output:
Excel output
Exporting as HTML file
The following example shows how to format search results (invoice numbers and dollars amounts) as unordered list in HTML file.
Regular expression:
Invoice #: (\d+)[\s\S]+Amount due: ([$]\d+[,]?\d+[.]\d{2})(?#format=<hr><ul><li>Invoice \1\</li><li>Amount \2\</li></ul>)
Sample Output:
HTML output