Skip to content

KBNLresearch/pdfbatchqa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pdfbatchqa

What is pdfbatchqa?

Pdfbatchqa is a simple tool for automated checking of digitisation batches of PDF files against a user-defined technical profile. Internally it wraps around the pdfimages tool from the Poppler library, which is used to extract the image-related properties for each PDF. The pdfimages output is then validated against a set of Schematron schemas that define the required technical characteristics.

Installation

The easiest method to install pdfbatchqa is to use the pip package manager.

Installation with pip (single user)

This will work on any platform for which Python is available. You need a recent version of pip (version 9.0 or more recent). To install pdfbatchqa for a single user, use the following command:

pip install pdfbatchqa --user

Installation with pip (all users)

To install pdfbatchqa for all users, use the following command:

pip install jprofile

You need local admin (Windows) / superuser (Linux) privilige to do this. On Windows, you can do this by running the above command in a Command Prompt window that was opened as Administrator. On Linux, use this:

sudo pip install jprofile

Command-line syntax

usage: pdfbatchqa batchDir prefixOut -p PROFILE

Positional arguments

batchDir: root directory of batch

prefixOut: prefix that is used for writing output files

PROFILE: name of profile that defines the validation schemas

To list all available profiles, use a value of l or list for PROFILE.

Batch structure

Pdfbatchqa was designed for processing digitisation batches that are delivered to the KB by external suppliers as part of the DBNL stream. For each digitised publication, these batches typically contain two PDF files:

  1. A high quality PDF with images in JPEG format that are enoded at 85% JPEG quality
  2. A lower quality PDF with images in JPEG format that are enoded at 50% JPEG quality

TODO: describe how we can distinguish between 1. and 2. (folder name, file name?).

Profiles

A profile is an XML-formatted file that simply defines which schemas are used to validate the extracted properties of the high and low quality PDFs, respectively. Here's an example:

<?xml version="1.0"?>

<profile>

<!-- Profile for DBNL full-text digitisation batches -->

<schemaLowQuality>pdf-dbnl-generic.sch</schemaLowQuality>
<schemaHighQuality>pdf-dbnl-generic.sch</schemaHighQuality>

</profile>

Note that each entry only contains the name of a schema, not its full path! All schemas are located in the schemass directory in the installation folder.

Also note that in the above example, the same schema is used for both low and high quality PDFs!

Available profiles

The following profiles are included by default:

Name Description
dbnl-fulltext.xml Profile for DBNL full-text digitisation batches

It is possible to create custom-made profiles. Just add them to the profiles directory in the installation folder.

Schemas

The quality assessment is based on a number of rules/tests that are defined a set of Schematron schemas. These are located in the schemas folder in the installation directory. In principle any property that is reported by pdfimages can be used here, and new tests can be added by editing the schemas.

Available schemas

Name Description
pdf-dbnl-generic.sch Generic schema for DBNL full-text digitisation batches

It is possible to create custom-made schemas. Just add them to the schemas directory in the installation folder.

Overview schemas

The following tables give a general overview of the technical profiles that the current schemas are representing:

pdf-dbnl-generic

Parameter Value
Image format JPEG
Image resolution (295, 305)
Number of color components 3

Usage examples

List available profiles

pdfbatchqa d:\myBatch mybatch -p list

This results in a list of all available profiles (these are stored in the installation folder's profiles directory):

Available profiles:

dbnl-fulltext.xml

Analyse batch

pdfbatchqa -p dbnl-fulltext.xml d:\myBatch mybatch

TODO: update remaining documentation.

Known limitations

  • PDFs that have names containing square brackets ("[" and "]" are ignored (limitation of Python's glob module, will be solved in the future).

Licensing

Pdfbatchqa is released under the Apache License, Version 2.0.

Useful links

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published