Skip to content

Latest commit

 

History

History
99 lines (70 loc) · 6.56 KB

HUNFLAIR_CORPORA.md

File metadata and controls

99 lines (70 loc) · 6.56 KB

HunFlair - Data Sets

Here you can find an overview about biomedical NER data sets integrated in HunFlair.

Content: Overview | HUNER Data Sets | BioBERT Evaluation Splits

Overview

HunFlair integrates 31 biomedical named entity recognition (NER) data sets and provides them in an unified format to foster the development and evaluation of new NER models. All data set implementations can be found in flair.datasets.biomedical.

Corpus Data Set Class Entity Types Reference
AnatEM ANAT_EM Anatomical entities Paper, Website
Arizona Disease AZDZ Disease Website
BioCreative II GM BC2GM Gene Paper
BioCreative V CDR task CDR Chemical, Disease Paper, Website
BioInfer BIO_INFER Gene/Protein Paper
BioNLP'2013 Cancer Genetics (ST) BIONLP2013_CG Chemical, Disease, Gene/Protein, Species Paper
BioNLP'2013 Pathway Curation (ST) BIONLP2013_PC Chemical, Gene/Proteins Paper
BioSemantics* BIOSEMANTICS Chemical, Disease Paper, Website
CellFinder CELL_FINDER Cell line, Gene, Species Paper
CEMP CEMP Chemical Website
CHEBI CHEBI Chemical, Gene, Species Paper
CHEMDNER CHEMDNER Chemical Paper
CLL CLL Cell line Paper
DECA DECA Gene Paper
FSU FSU Gene Paper
GPRO GPRO Gene Website
CRAFT (v2.0) CRAFT Chemical, Gene, Species Paper
CRAFT (v4.0.1) CRAFT_V4 Chemical, Gene, Species Website
GELLUS GELLUS Cell line Paper
IEPA IEPA Gene Paper
JNLPBA JNLPBA Cell line, Gene Paper
LINNEAUS LINNEAUS Species Paper
LocText LOCTEXT Gene, Species Paper
miRNA MIRNA Disease, Gene, Species Paper
NCBI Disease NCBI_DISEASE Disease Paper
Osiris v1.2 OSIRIS Gene Paper
Plant-Disease-Relations PDR Disease Paper, Website
S800 S800 Species Paper
SCAI Chemicals SCAI_CHEMICALS Chemical Paper
SCAI Disease SCAI_DISEASE Disease Paper
Variome VARIOME Gene, Disease, Species Paper
Note: The table just gives an overview about the entity types of the individual corpora. Please refer to the original publications for annotation details.

* The corpus is currently not available, but will be re-published online soon.

HUNER Data Sets

Next to the integration of the biomedical data sets, HunFlair provides the fixed splits used by HUNER (Weber et al.) to improve comparability of evaluations

Entity Type Data Set Class Contained Data Sets
Cell Line HUNER_CELL_LINE HUNER_CELL_LINE_CELL_FINDER, HUNER_CELL_LINE_CLL, HUNER_CELL_LINE_GELLUS, HUNER_CELL_LINE_JNLPBA
Chemical HUNER_CHEMICAL HUNER_CHEMICAL_CDR, HUNER_CHEMICAL_CEMP, HUNER_CHEMICAL_CHEBI, HUNER_CHEMICAL_CHEMDNER, HUNER_CHEMICAL_CRAFT_V4, HUNER_CHEMICAL_SCAI
Disease HUNER_DISEASE HUNER_DISEASE_CDR, HUNER_DISEASE_MIRNA, HUNER_DISEASE_NCBI, HUNER_DISEASE_SCAI, HUNER_DISEASE_VARIOME
Gene/Protein HUNER_GENE HUNER_GENE_BC2GM, HUNER_GENE_BIO_INFER, HUNER_GENE_CELL_FINDER, HUNER_GENE_CHEBI, HUNER_GENE_CRAFT_V4, HUNER_GENE_DECA, HUNER_GENE_FSU, HUNER_GENE_GPRO, HUNER_GENE_IEPA, HUNER_GENE_JNLPBA, HUNER_GENE_LOCTEXT, HUNER_GENE_MIRNA, HUNER_GENE_OSIRIS, HUNER_GENE_VARIOME
Species HUNER_SPECIES HUNER_SPECIES_CELL_FINDER, HUNER_SPECIES_CHEBI, HUNER_SPECIES_CRAFT_V4, HUNER_SPECIES_LINNEAUS, HUNER_SPECIES_LOCTEXT, HUNER_SPECIES_MIRNA, HUNER_SPECIES_S800, HUNER_SPECIES_VARIOME

BioBERT evaluation splits

To ease comparison with BioBERT, HunFlair provides the splits used by Lee et al.: BIOBERT_GENE_BC4CHEMD, BIOBERT_GENE_BC2GM, BIOBERT_GENE_JNLPBA, BIOBERT_CHEMICAL_BC5CDR, BIOBERT_DISEASE_BC5CDR, BIOBERT_DISEASE_NCBI, BIOBERT_SPECIES_LINNAEUS, and BIOBERT_SPECIES_S800

Note: To download and use the BioBERT corpora you need to install the package googledrivedownloader, since the files are hosted in Google Drive:

pip install googledrivedownloader