Skip to content

Textmining literature resources, section patterns

Jen Hammock edited this page Jul 30, 2021 · 1 revision

Patterns are as generic as possible without detecting non-target text. There is considerable variation in the details of each pattern among resources and among documents within some resources.

species section start pattern

newline newline [species binomial detected in a line of six words or less*] newline newline [target text]

*excluding common addenda: "new status" "nov. sp." "new combination" "Figures" "Figs." alphanumeric IDs (3 digits or less, including a numeral, separated by commas or -)

species section stop patterns

These may represent items that appear near the end of a species section but contain potentially misinterpretable text (eg: Remarks section) or an item that is not part of a species section.

newline AFFINITIES.—

newline NOTE.—

newline DISCUSSION.—

newline REMARKS.—


newline VARIATION.—



newline MATERIAL.—

newline ADULT.—

newline newline Acknowledgments newline

newline newline Intermediates newline

newline newline Literature Cited newline

newline newline Bibliography newline

newline newline Dubious Binomials newline

newline newline Excluded Species newline

newline newline Descriptive Biogeography newline

newline newline Appendix

newline newline Figures newline

newline newline Table

newline newline General Conclusions newline

newline newline Nomenclatorial Considerations.

newline newline [any combination of rank name and/or taxon name w/rank above species, and 0-3 additional words and/or 0-3 numerals] newline

newline newline [12 words or less beginning with "Key"] newline species section append pattern These are species description subsections that often follow a stop pattern but should still be included in the species section

list patterns

List header:

newline newline Header [12 words or less, including List, and at least one of taxon name, vernacular name, habitat term and/or geographic term] newline [lots of non-target text] newline

newline newline Header [12 words or less, including species, and at least one of taxon name, vernacular name, habitat term and/or geographic term] newline

[NO HEADER] (list patterns detected without a header may still be supported. The trick will be distinguishing them from synonymies)

Up to 15 lines of non-target text may separate the list header from the list body

List body:

Line [including species name] newline Line [including species name] newline

Line [including species name and geographic and/or habitat terms] newline Line [including species name and geographic and/or habitat terms] newline

Line [including genus name] newline Line [including species epithet and geographic term] newline Line [including species epithet and geographic term] newline

Line [including species name] newline Line [including geographic and/or habitat term] newline [additional text, up to 4 newlines] Line [including species name] newline Line [including geographic and/or habitat term] newline [additional text, up to 4 newlines]

Line [including species name] newline 1 or 2 Lines [including geographic and habitat terms] newline Line [including species name] newline 1 or 2 Lines [including geographic and habitat terms] newline

Line [including genus name] newline Line [including full species name] newline Line [including species epithet and geographic term] newline Line [including species epithet and geographic term] newline

List Subheaders:

Single or back-to-back pairs of subheaders may interrupt a list pattern if the same list pattern appears before and after each subheader or pair of subheaders.

newline Subheader [up to 8 words including higher taxon name] newline