Skip to content

Textmining literature resources, source specific process descriptions

Jen Hammock edited this page Jul 30, 2021 · 2 revisions

Every data hub from which resources have been textmined behaves slightly differently.

Smithsonian Contributions

The Smithsonian Contributions series were processed from .epub format. The steps are:

TreatmentBank

TreatmentBank resources are already parsed into species sections, so Target Text processing can be applied without any prior manipulation. The resources are also equipped with annotated sections (elements) enabling targeted applications of ontologies and target patterns specific to each treatment element.

  • in div type="description": all ontologies and patterns
  • in div type="materials_examined": Geography ontology
  • in div type="distribution":
    • Habitat ontology
    • Geography ontology
  • in div type="biology_ecology": all ontologies and patterns
  • in div type="diagnosis": Growth form ontology and size patterns
  • in tax:treatment > tax:nomenclature > tax:name : find the name and ancestry of the subject species
  • In other instances of tax:name, name and ancestry of other taxa mentioned in the treatment can be found. This structure should be inserted into the Associates patterns, in place of the target taxon name.

Biodiversity Heritage Library

Two test resources were selected: Memoirs of the American Entomological Society and North American Flora. Text has been extracted by Optical Character Recognition and is available from BHL data services, but this text requires an extra processing step, after taxon names are found, with an additional set of patterns, to identify or eliminate non-target artifacts that might interfere with detection of target text. The steps are:

These pre-processing patterns work fairly well, but there are enough exceptions that it is clear this method will require a human in the loop and will not scale very fast. It will still be helpful for priority resources that are only available as OCR text.

header/footer pattern

Page headers immediately follow page breaks, and have an alternating pattern, eg: "EZRA TOWNSEND CRESSON 5" then "IO TYPES OF HYMENOPTERA". Page footers appear every page or every second page, immediately before the page break, eg: "MEM. AMER. ENT. SOC, 10". The page numbers at the beginning and end of the headers are very badly OCRed but they increment anyway- a short variable string is expected in these positions. Page numbers didn't appear in footer text in the test resources, but those frequently also end with numbers which OCR unreliably, so they can be treated the same way

While headers and footers are removed, actual page breaks are retained for use in further artifact cleanup.

contextual section patterns

Because line breaks behave differently within a Species Section synonymy, a List, or a Key, than they do elsewhere in a document, these sections must be detected, so that line breaks can be interpreted and processed reliably in each context. Line breaks can be processed the same way in either a synonymy or a List. They do not need to be processed in a Key, as Keys are not target text, but the whole Key must be detected so that it can be excluded.

list/synonymy upstream pattern

Considering only lines that are not empty and disregarding leading spaces in a line:

  • no more than 2 lines in sequence that don’t begin with a taxon name
  • no more than 3 lines in sequence that don’t begin with a species name
  • at least two lines beginning with species names

Process: Place a line break before each taxon name. Remove all other line breaks between the first taxon name and the last taxon name

key pattern

sample document Considering only lines that are not empty:

  • A line beginning with "Key" followed within 12 lines by a line beginning or ending with a numeral.
  • no more than 12 lines in sequence that don’t either begin or end with a numeral
  • at least 3 in every 20 sequential lines must either begin or end with a numeral
  • sequence: considering only lines that begin with a numeral, the numeral must increment by 1 most of the time. (no more than 3 deviations per 5, no more than 5 per 20)
  • The Key pattern ends after the last line with a sequential beginning numeral

Process: All Key text should be removed.

line break patterns

Line breaks appear every time text is wrapped, in addition to between paragraphs. Different behavior appears in certain contexts, so this pattern must be processed AFTER the List/Synonymy Upstream Pattern.

  • 2 line breaks followed by a space => space
  • 4 line breaks => 2 line breaks
  • dash followed by 2 line breaks followed by a space => remove all

list/synonymy downstream pattern

After processing the line break pattern, additional lists and synonymies may become detectable, so the same process should be run again.

non paragraph text patterns

Artifacts including plates, in-line figures, tables and keys are detected only in order to discard this text. No attempt is made to target any data in this form.

Table displayed sideways pattern

  • at least 200 total lines in a page
  • at least 100 blank lines in a page

Caption pattern

newline ["Fig.", "Figure" or "Map"] Not immediately preceded by a line beginning with a species name

we might reduce false positives by only recognizing this pattern IF the text content of the page is low, <1000 characters

Footnote pattern

newline [one paragraph or sequence of paragraphs each beginning with a numeral, immediately preceding a page break]