Using Scripting to Extract Document Metadata

Introduction: Acrobat JavaScript provides access to many properties and elements of PDF documents that can be used for data extraction purposes. A document's metadata, bookmarks, file attachments, page information and text can all be accessed via custom scripting. In this tutorial, we will demonstrate how to use custom scripting to extract a document’s metadata properties (for example: “Title”, “Author”, “Keywords”) and assign them as field values in an output *.csv spreadsheet.
What is Document Metadata?: A document's metadata is information about one or more aspects of the document. Standard PDF metadata includes: "Title", "Author", "Subject", "Keywords", "Application", "PDF Producer", "Created", and "Modified" etc. The metadata can be viewed and edited via "File > Properties..." in Adobe® Acrobat® (see the "Description" tab). Only 4 standard metadata fields are directly editable by the user in Adobe® Acrobat®: "Title", "Subject", "Author", and "Keywords". The rest of the fields are either updated automatically ("Created", "Modified") or set at the time the document is created ("PDF Producer" & "Application").
What is JavaScript?: JavaScript is Adobe Acrobat's built-in scripting engine. Custom JavaScript scripts can be used for: data formatting; assigning field values based on a document's metadata properties; or custom processing logic. Each data field can optionally have a user-supplied script that is executed after the data value is extracted from the document. Please refer to Adobe Acrobat documentation for details on using Acrobat's JavaScript programming language.; The goal is to create data records containing metadata extracted from each PDF document in an output *.csv spreadsheet file.
Prerequisites: You need a copy of Adobe® Acrobat® along with the AutoExtract™ plug-in installed on your computer in order to use this tutorial. Both are available as trial versions.

Step 1 - Open AutoExtract: Select "Plug-Ins > Extract Data > Extract Data Records From Document Text…" to open the "AutoExtract Plug-in" dialog.
Step 2 - Add a Data Field: Press the "Add Field..." button to add a field to the settings configuration.; Enter a name for the data field into the "Field name:" box. This will become the field header in the output spreadsheet(s).; Check the "Set or change field value by running JavaScript code" option and press "Edit Script...".; Type the desired JavaScript code - the code shown here is an example of assigning data field values using a document's "Title" metadata property. Use the event.value variable to assign a new value to the data field. This variable holds text that has been extracted from the document. You can modify it or assign a new value.; Press "OK" to proceed.
Step 3 - Confirm Extraction Settings: Repeat step 2 to define multiple data fields in the output.; Here, we've defined fields to extract each metadata item. The JavaScript code used for each individual field definition is:; Extract "Title" field:
event.value = this.info.Title;; Extract "Author" field:
event.value = this.info.Author;; Extract "Subject" field:
event.value = this.info.Subject;; Extract "Keywords" field:
event.value = this.info.Keywords;; Enter an output filename template - the output spreadsheet in this example will be titled "Invoice_Metadata.csv". Check "Create single data file..." to store extracted data from all input PDFs in one spreadsheet file.; Press "OK" to proceed.
Step 4 - Extract Metadata: Proceed through the next dialogs by selecting the desired input PDF documents. Open the ouput spreadsheet to inspect the extracted data. Every row is a record for each input PDF - the extracted metadata properties will be presented under corresponding data field headers:; Click here for a list of all step-by-step tutorials available.