Using Scripting to Redact Extracted Text
Introduction
In this tutorial, we will look at ways to redact sensitive data while using AutoExtract plug-in to extract text from PDF files. It's possible to either remove sensitive content entirely (see step 4) or replace it with multiple XXXXs (see step 5). The first example below shows how to redact social security numbers (SSNs) from sample PDF payslips. The SSNs are located in the same page location in each file. The redacting is done by using custom Acrobat JavaScript scripting.
sample redacted SSN
What is Acrobat JavaScript?
JavaScript is Adobe Acrobat's built-in scripting engine. Custom JavaScript scripts can be used for: data formatting; assigning field values based on a document's metadata properties; or for implementing a custom processing logic. Each data field extracted from PDF files with AutoExtract can optionally have a user-supplied script that is executed after the data value is extracted from the document. Please refer to Adobe Acrobat documentation for details on using Acrobat's JavaScript programming language.
Prerequisites
You need a copy of Adobe® Acrobat® along with the AutoExtract™ plug-in installed on your computer in order to use this tutorial. Both are available as trial versions.
Objective: Redact Extracted Social Security Numbers
Step 1 - Edit Data Field Settings
With the AutoExtract plug-in dialog open in Acrobat (Plug-Ins > Extract Data > Extract Data Records From Document Text…) either add a new data field definition, or edit an existing one. To edit data field settings, double-click on one in the list or select it and press "Edit Field...".
Here, we will modify the "SSN" field so that social security numbers extracted from sample PDF payslips are partly redacted in the output spreadsheet and follow a "XXX-XX-nnnn" format.
add/edit a field
Step 2 - Add a Script
Check "Set or change field value by running JavaScript code" and press "Edit Script...".
add a script
Type the desired JavaScript code. The script used here searches extracted text for the presence of the SSN text pattern and replaces the first 5 digits with "XXX-XX". The script is utilizing a regular expression to search for the pattern. Note that regular expressions are case-sensitive. You can use any JavaScript online tutorial to learn more about search and replace syntax.
event.value = event.value.replace(/\b\d{3}\W\d{2}\W(?=\d{4})\b/g, "XXX-XX-");
The script is working with the value stored in event.value variable that holds extracted text. Note that search pattern uses a \W (non-word character) metacharacter to match different non-word symbols that can be used as delimiters between groups of digits. It appears that JavaScript engine that comes with Adobe Acrobat (as of October 2021) does not support Unicode mode switch \u and it is not possible to use \p{Dash} or \p{Pd} Unicode metacharacters to match all possible dash symbols. There are multiple different dash symbols available and they are very hard to distingush visually. This is a common source of problems when working with text that contains dashes. It is always necessary to remember that dashes can be represented by multiple different characters.
Press "OK" to proceed.
type javascript code
Press "OK" again to save changes.
save changes
Step 3 - Inspect Output Data
Proceed to extract data from input documents using these settings. Open the output spreadsheet(s) and inspect the data field modified in the steps above. Here, field values in the "SSN" column have been partly redacted:
inspect redacted SSN's
Objective: Redact Extracted Text Within [Brackets]
Step 4 - Add a Script
The following example shows how to redact any text that is placed within [...] brackets in the input document. This is one of a popular way to designate certain text for redaction. We'll demonstrate this below by using sample payslips where address street numbers are placed inside brackets:
text in [brackets]
As with steps 1 & 2 above, either add a new field or edit an existing one to add a search and replace post-processing script. Here we will modify the "Address" field - street numbers in the extracted addresses will be redacted.
add/edit a field
Check "Set or change field value by running JavaScript code" and press "Edit Script...".
add a script
Type the desired JavaScript code - the code used here will search for the presence of [...] brackets, and remove any text within them in the output spreadsheet:
event.value = event.value.replace(/\[[^\]]+\]/g, "");
remove bracketed text
Step 5 - Replace Text with XXXXs
It's also possible to put placeholder text (for example: XXXX) in place of the removed characters, by using the following variation of this script:
event.value = event.value.replace(/\[[^\]]+\]/g, "XXXX")
This code will replace any text within brackets with the number of X's entered in the last section of the code - "XXXX".
replace bracketed text
Click here for a list of all step-by-step tutorials available.