Using Scripting to Format Extracted Text
- Introduction
- Acrobat JavaScript provides access to many properties and elements of PDF documents that can be used for data extraction purposes. JavaScript can also be used to format output field values after they are extracted, via custom scripting. In this tutorial, we will demonstrate how to use custom scripting to “post-process” text extracted from input documents. For example, use a script to set field values to all caps (see steps 1 - 3 below), or search and replace field text using regular expressions (see steps 4 & 5 below).
- The extraction process shown here uses sample PDF invoices. Data will be extracted from specific page locations - the image below shows how text appears in the input:
- What is JavaScript?
- JavaScript is Adobe Acrobat's built-in scripting engine. Custom JavaScript scripts can be used for: data formatting; assigning field values based on a document's metadata properties; or custom processing logic. Each data field can optionally have a user-supplied script that is executed after the data value is extracted from the document. Please refer to Adobe Acrobat documentation for details on using Acrobat's JavaScript programming language.
- Prerequisites
- You need a copy of Adobe® Acrobat® along with the AutoExtract™ plug-in installed on your computer in order to use this tutorial. Both are available as trial versions.
- Objective: Capitalize Field Values
- Step 1 - Edit Data Field Settings
- With the AutoExtract plug-in dialog open in Acrobat (Plug-Ins > Extract Data > Extract Data Records From Document Text…) either add a new data field definition, or edit an existing one. To edit data field settings, double-click on one in the list or select it and press "Edit Field...".
- Here, we will modify the "Name" field so that customer names extracted from some sample invoices are fully capitalized in the output spreadsheet.
- Step 2 - Add a Script
- Check "Set or change field value by running JavaScript code" and press "Edit Script...".
- Type the desired JavaScript code - the code used here will convert all field values into FULL CAPS after being extracted from input documents:
-
event.value = event.value.toUpperCase();
- Press "OK" to proceed.
- Press "OK" again to save changes.
- Step 3 - Inspect Output Data
- Proceed to extract data from input documents using these settings. Open the output spreadsheet(s) and inspect the data field modified in the steps above. Here, field values in the "Name" column have become capitalized:
- Objective: Search & Replace Characters
- Step 4 - Add a Search & Replace Script
- As with steps 1 & 2 above, either add a new field or edit an existing one to add a search and replace post-processing script. We will demonstrate this by modifying the "Address" field so that any commas within extracted addresses are replaced with a space.
- Check "Set or change field value by running JavaScript code" and press "Edit Script...".
- Type the desired JavaScript code - the code used here will search for the presence of commas, and replace them with a space after address text is extracted from input documents:
-
event.value = event.value.replace(/,/g, " ");
- Press "OK" to proceed.
- Here is another useful code that replaces new line characters with spaces. It will convert multiline text into a single line:
-
event.value = event.value.replace(/\n/g, " ");
- Step 5 - Inspect Output Data
- Proceed to extract data from input documents using these settings. Open the output spreadsheet(s) and inspect the relevant data field. Here, field values in the "Address" column no longer contain commas:
- Click here for a list of all step-by-step tutorials available.