Using Scripting to Format Extracted Text
Introduction
Acrobat JavaScript provides access to many properties and elements of PDF documents that can be used for data extraction purposes. JavaScript can also be used to format output field values after they are extracted, via custom scripting. In this tutorial, we will demonstrate how to use custom scripting to “post-process” text extracted from input documents. For example, use a script to set field values to all caps (see steps 1 - 3 below), or search and replace field text using regular expressions (see steps 4 & 5 below).
sample formatted values
The extraction process shown here uses sample PDF invoices. Data will be extracted from specific page locations - the image below shows how text appears in the input:
sample input PDF
What is JavaScript?
JavaScript is Adobe Acrobat's built-in scripting engine. Custom JavaScript scripts can be used for: data formatting; assigning field values based on a document's metadata properties; or custom processing logic. Each data field can optionally have a user-supplied script that is executed after the data value is extracted from the document. Please refer to Adobe Acrobat documentation for details on using Acrobat's JavaScript programming language.
Prerequisites
You need a copy of Adobe® Acrobat® along with the AutoExtract™ plug-in installed on your computer in order to use this tutorial. Both are available as trial versions.
Objective: Capitalize Field Values
Step 1 - Edit Data Field Settings
With the AutoExtract plug-in dialog open in Acrobat (Plug-Ins > Extract Data > Extract Data Records From Document Text…) either add a new data field definition, or edit an existing one. To edit data field settings, double-click on one in the list or select it and press "Edit Field...".
Here, we will modify the "Name" field so that customer names extracted from some sample invoices are fully capitalized in the output spreadsheet.
add/edit a field
Step 2 - Add a Script
Check "Set or change field value by running JavaScript code" and press "Edit Script...".
add a script
Type the desired JavaScript code - the code used here will convert all field values into FULL CAPS after being extracted from input documents:
event.value = event.value.toUpperCase();
Press "OK" to proceed.
type javascript code
Press "OK" again to save changes.
save changes
Step 3 - Inspect Output Data
Proceed to extract data from input documents using these settings. Open the output spreadsheet(s) and inspect the data field modified in the steps above. Here, field values in the "Name" column have become capitalized:
inspect capitalized values
Objective: Search & Replace Characters
Step 4 - Add a Search & Replace Script
As with steps 1 & 2 above, either add a new field or edit an existing one to add a search and replace post-processing script. We will demonstrate this by modifying the "Address" field so that any commas within extracted addresses are replaced with a space.
add/edit a field
Check "Set or change field value by running JavaScript code" and press "Edit Script...".
add a script
Type the desired JavaScript code - the code used here will search for the presence of commas, and replace them with a space after address text is extracted from input documents:
event.value = event.value.replace(/,/g, " ");
Press "OK" to proceed.
type javascript code
Here is another useful code that replaces new line characters with spaces. It will convert multiline text into a single line:
event.value = event.value.replace(/\n/g, " ");
Step 5 - Inspect Output Data
Proceed to extract data from input documents using these settings. Open the output spreadsheet(s) and inspect the relevant data field. Here, field values in the "Address" column no longer contain commas:
check replaced characters
Click here for a list of all step-by-step tutorials available.