Pdfbatchqa is a simple tool for automated checking of digitisation batches of PDF files against a user-defined technical profile. Internally it wraps around the pdfimages tool from the Poppler library, which is used to extract the image-related properties for each PDF. The pdfimages output is then validated against a set of Schematron schemas that define the required technical characteristics.
The easiest method to install pdfbatchqa is to use the pip package manager.
This will work on any platform for which Python is available. You need a recent version of pip (version 9.0 or more recent). To install pdfbatchqa for a single user, use the following command:
pip install pdfbatchqa --user
To install pdfbatchqa for all users, use the following command:
pip install jprofile
You need local admin (Windows) / superuser (Linux) privilige to do this. On Windows, you can do this by running the above command in a Command Prompt window that was opened as Administrator. On Linux, use this:
sudo pip install jprofile
usage: pdfbatchqa batchDir prefixOut -p PROFILE
batchDir: root directory of batch
prefixOut: prefix that is used for writing output files
PROFILE: name of profile that defines the validation schemas
To list all available profiles, use a value of l or list for PROFILE.
Pdfbatchqa was designed for processing digitisation batches that are delivered to the KB by external suppliers as part of the DBNL stream. For each digitised publication, these batches typically contain two PDF files:
- A high quality PDF with images in JPEG format that are enoded at 85% JPEG quality
- A lower quality PDF with images in JPEG format that are enoded at 50% JPEG quality
TODO: describe how we can distinguish between 1. and 2. (folder name, file name?).
A profile is an XML-formatted file that simply defines which schemas are used to validate the extracted properties of the high and low quality PDFs, respectively. Here's an example:
<?xml version="1.0"?>
<profile>
<!-- Profile for DBNL full-text digitisation batches -->
<schemaLowQuality>pdf-dbnl-generic.sch</schemaLowQuality>
<schemaHighQuality>pdf-dbnl-generic.sch</schemaHighQuality>
</profile>
Note that each entry only contains the name of a schema, not its full path! All schemas are located in the schemass directory in the installation folder.
Also note that in the above example, the same schema is used for both low and high quality PDFs!
The following profiles are included by default:
Name | Description |
---|---|
dbnl-fulltext.xml | Profile for DBNL full-text digitisation batches |
It is possible to create custom-made profiles. Just add them to the profiles directory in the installation folder.
The quality assessment is based on a number of rules/tests that are defined a set of Schematron schemas. These are located in the schemas folder in the installation directory. In principle any property that is reported by pdfimages can be used here, and new tests can be added by editing the schemas.
Name | Description |
---|---|
pdf-dbnl-generic.sch | Generic schema for DBNL full-text digitisation batches |
It is possible to create custom-made schemas. Just add them to the schemas directory in the installation folder.
The following tables give a general overview of the technical profiles that the current schemas are representing:
Parameter | Value |
---|---|
Image format | JPEG |
Image resolution | (295, 305) |
Number of color components | 3 |
pdfbatchqa d:\myBatch mybatch -p list
This results in a list of all available profiles (these are stored in the installation folder's profiles directory):
Available profiles:
dbnl-fulltext.xml
pdfbatchqa -p dbnl-fulltext.xml d:\myBatch mybatch
TODO: update remaining documentation.
- PDFs that have names containing square brackets ("[" and "]" are ignored (limitation of Python's glob module, will be solved in the future).
Pdfbatchqa is released under the Apache License, Version 2.0.