Using Scripting to Extract Document Metadata
- Introduction
- Acrobat JavaScript provides access to many properties and elements of PDF documents that can be used for data extraction purposes. A document's metadata, bookmarks, file attachments, page information and text can all be accessed via custom scripting. In this tutorial, we will demonstrate how to use custom scripting to extract a document’s metadata properties (for example: “Title”, “Author”, “Keywords”) and assign them as field values in an output *.csv spreadsheet.
- What is Document Metadata?
- A document's metadata is information about one or more aspects of the document. Standard PDF metadata includes: "Title", "Author", "Subject", "Keywords", "Application", "PDF Producer", "Created", and "Modified" etc. The metadata can be viewed and edited via "File > Properties..." in Adobe® Acrobat® (see the "Description" tab). Only 4 standard metadata fields are directly editable by the user in Adobe® Acrobat®: "Title", "Subject", "Author", and "Keywords". The rest of the fields are either updated automatically ("Created", "Modified") or set at the time the document is created ("PDF Producer" & "Application").
- What is JavaScript?
- JavaScript is Adobe Acrobat's built-in scripting engine. Custom JavaScript scripts can be used for: data formatting; assigning field values based on a document's metadata properties; or custom processing logic. Each data field can optionally have a user-supplied script that is executed after the data value is extracted from the document. Please refer to Adobe Acrobat documentation for details on using Acrobat's JavaScript programming language.
- The goal is to create data records containing metadata extracted from each PDF document in an output *.csv spreadsheet file.
- Prerequisites
- You need a copy of Adobe® Acrobat® along with the AutoExtract™ plug-in installed on your computer in order to use this tutorial. Both are available as trial versions.
- Step 1 - Open AutoExtract
- Select "Plug-Ins > Extract Data > Extract Data Records From Document Text…" to open the "AutoExtract Plug-in" dialog.
- Step 2 - Add a Data Field
- Press the "Add Field..." button to add a field to the settings configuration.
- Enter a name for the data field into the "Field name:" box. This will become the field header in the output spreadsheet(s).
- Check the "Set or change field value by running JavaScript code" option and press "Edit Script...".
- Type the desired JavaScript code - the code shown here is an example of assigning data field values using a document's "Title" metadata property. Use the event.value variable to assign a new value to the data field. This variable holds text that has been extracted from the document. You can modify it or assign a new value.
- Press "OK" to proceed.
- Step 3 - Confirm Extraction Settings
- Repeat step 2 to define multiple data fields in the output.
- Here, we've defined fields to extract each metadata item. The JavaScript code used for each individual field definition is:
-
Extract "Title" field:
event.value = this.info.Title;
-
Extract "Author" field:
event.value = this.info.Author;
-
Extract "Subject" field:
event.value = this.info.Subject;
-
Extract "Keywords" field:
event.value = this.info.Keywords;
- Enter an output filename template - the output spreadsheet in this example will be titled "Invoice_Metadata.csv". Check "Create single data file..." to store extracted data from all input PDFs in one spreadsheet file.
- Press "OK" to proceed.
- Step 4 - Extract Metadata
- Proceed through the next dialogs by selecting the desired input PDF documents. Open the ouput spreadsheet to inspect the extracted data. Every row is a record for each input PDF - the extracted metadata properties will be presented under corresponding data field headers:
- Click here for a list of all step-by-step tutorials available.