Using Scripting to Extract Document Metadata
Introduction
Acrobat JavaScript provides access to many properties and elements of PDF documents that can be used for data extraction purposes. A document's metadata, bookmarks, file attachments, page information and text can all be accessed via custom scripting. In this tutorial, we will demonstrate how to use custom scripting to extract a document’s metadata properties (for example: “Title”, “Author”, “Keywords”) and assign them as field values in an output *.csv spreadsheet.
sample extracted metadata
What is Document Metadata?
A document's metadata is information about one or more aspects of the document. Standard PDF metadata includes: "Title", "Author", "Subject", "Keywords", "Application", "PDF Producer", "Created", and "Modified" etc. The metadata can be viewed and edited via "File > Properties..." in Adobe® Acrobat® (see the "Description" tab). Only 4 standard metadata fields are directly editable by the user in Adobe® Acrobat®: "Title", "Subject", "Author", and "Keywords". The rest of the fields are either updated automatically ("Created", "Modified") or set at the time the document is created ("PDF Producer" & "Application").
What is JavaScript?
JavaScript is Adobe Acrobat's built-in scripting engine. Custom JavaScript scripts can be used for: data formatting; assigning field values based on a document's metadata properties; or custom processing logic. Each data field can optionally have a user-supplied script that is executed after the data value is extracted from the document. Please refer to Adobe Acrobat documentation for details on using Acrobat's JavaScript programming language.
The goal is to create data records containing metadata extracted from each PDF document in an output *.csv spreadsheet file.
Prerequisites
You need a copy of Adobe® Acrobat® along with the AutoExtract™ plug-in installed on your computer in order to use this tutorial. Both are available as trial versions.
Step 1 - Open AutoExtract
Select "Plug-Ins > Extract Data > Extract Data Records From Document Text…" to open the "AutoExtract Plug-in" dialog.
open autoextract
Step 2 - Add a Data Field
Press the "Add Field..." button to add a field to the settings configuration.
add data field
Enter a name for the data field into the "Field name:" box. This will become the field header in the output spreadsheet(s).
name data field
Check the "Set or change field value by running JavaScript code" option and press "Edit Script...".
use javascript
Type the desired JavaScript code - the code shown here is an example of assigning data field values using a document's "Title" metadata property. Use the event.value variable to assign a new value to the data field. This variable holds text that has been extracted from the document. You can modify it or assign a new value.
Press "OK" to proceed.
type javascript code
Step 3 - Confirm Extraction Settings
Repeat step 2 to define multiple data fields in the output.
Here, we've defined fields to extract each metadata item. The JavaScript code used for each individual field definition is:
Extract "Title" field:
event.value = this.info.Title;
Extract "Author" field:
event.value = this.info.Author;
Extract "Subject" field:
event.value = this.info.Subject;
Extract "Keywords" field:
event.value = this.info.Keywords;
Enter an output filename template - the output spreadsheet in this example will be titled "Invoice_Metadata.csv". Check "Create single data file..." to store extracted data from all input PDFs in one spreadsheet file.
Press "OK" to proceed.
confirm extraction settings
Step 4 - Extract Metadata
Proceed through the next dialogs by selecting the desired input PDF documents. Open the ouput spreadsheet to inspect the extracted data. Every row is a record for each input PDF - the extracted metadata properties will be presented under corresponding data field headers:
inspect metadata
Click here for a list of all step-by-step tutorials available.