Extracting Pages Via a Text Search Using a Command-line BAT File
Introduction
Manually extracting PDF pages from a document can be a slow process. AutoSplit™ can be used to automatically extract pages containing specific text from input files, by using a command-line BAT file. This is a script file containing 'instructions' for searching pages of a document for specific text (or a text pattern), and extracting them. The first step is to make a custom "Extract Pages by Text Search" configuration in AutoSplit, which will be used to create the BAT file. The BAT file instructs AutoSplit to run this search on any input file, extract the relevant pages, and place them in a unique output location.
Input Files and Extraction Method
The input file used to demonstrate this method contains a collection of invoices. Some invoices contain the text: "PAID" or "TOTAL DUE: 0.00".
The goal is to have these pages extracted - the output file will contain only the invoices that contain this text.
Prerequisites
You need a copy of Adobe® Acrobat® along with the AutoSplit plug-in installed on your computer in order to use this tutorial. Both are available as trial versions.
Step 1 - Open the "Extract Pages by Text Search" Dialog
With the file to be processed open in Acrobat, select "Plug-Ins > Split Documents > Extract Pages By Text Search" from the main menu.
Step 2 - Specify Text Search Options
Use this dialog to configure the text search. In this example, the goal is to extract pages that contain the words “PAID” or “Total due: 0.00”. Type the text to search for in the entry box, one item per line.
Pages found to contain any of these search items will be extracted. See the separate tutorial on how to extract PDF pages via a text search for detailed help with configuring these settings.
Press "File Naming..." to configure output file naming options. Using these default settings, output files will have "Extracted from" added before the original filename ("Extracted from Invoices.pdf").
Press "Save..." to save these settings as a text search settings file.
Step 3 - Save the Text Search Settings
Choose a folder and rename the file, which will be saved with a *.textsearch extension. We will save this example as "Settings.textsearch".
Press "Save" to proceed.
Step 4 - Create the BAT File
See the separate tutorial for detailed help on running an operation from a command-line BAT file.
Create a BAT file using any plain text editor (such as Notepad). Begin by creating a blank text file, then add the following lines making sure to replace file paths and filenames with the relevant filenames you are using:
SET AUTOSPLIT_CONFIG_FILE=C:\Data\Settings.textsearch
SET AUTOSPLIT_BAT_ENABLE=ON
SET AUTOSPLIT_MODE=ExtractPages
SET AUTOSPLIT_INPUT_FILE=C:\Data\Input\Invoices.pdf
SET AUTOSPLIT_OUTPUT_FOLDER=C:\Data\Output
SET AUTOSPLIT_LOG_FILE=C:\Data\ExtractedPagesLog.txt
"C:\Program Files (x86)\Adobe\Acrobat DC\Acrobat\Acrobat.exe" /n /h
AUTOSPLIT_CONFIG_FILE specifies a full file path to the text search settings file created in steps 2 & 3.
The AUTOSPLIT_MODE variable specifies the processing 'type' - an "ExtractPages" operation.
AUTOSPLIT_INPUT_FILE specifies a full file path to the input file.
The AUTOSPLIT_OUTPUT_FOLDER file path specifies the output folder (C:\Data\Output) for the output file(s) to be saved in. Input files are not overwritten, regular Windows-style duplicate filename resolution is applied if there is already a file with the same name in the output folder.
Overall, the BAT file needs to specify three file paths for: the settings file, an input PDF file/folder, and an output folder.
Use the AUTOSPLIT_LOG_FILE variable to specify a log file location - useful for troubleshooting and record keeping. If a log file does not exist, it will be automatically created. If a log file already exists, then new records will be appended to the file.
Step 5 - Save the BAT File
Press "File > Save As..." to save the text as a BAT file.
Notepad prompts you to save the text as a *.txt file. Choose a folder and use the "Save as type:" list to select "All Files". Name the file and manually add a *.bat file extension, then press "Save".
Step 6 - Run the BAT File
Double-click on the BAT file to run it.
Step 7 - Inspect the Results
Open the output folder to view the new file. Note that the log file has also been created.
Open the output file.
All pages containing the text specified in step 2 have been extracted from the input document.
You can find more AutoSplit tutorials here.