Heavily improve automatic model card generation + Patch XLM-R (#28)

* Uncomment pushing to the Hub * Initial version to improve automatic model card generation * Simplify label normalization * Automatically select some eval sentences for the widget * Improve language card * Add automatic evaluation results * Use dash instead of underscore in model name * Add extra TODOs * model.predict text as the first example * Automatically set model name based on encoder & dataset * Remove accidental Dataset import * Rename examples to widget examples * Add table with label examples Also use fields instead of __dict__ * Ensure complete metadata * Add tokenizer warning if punct must be split from words * Remove dead code * Rename poor variable names * Fix incorrect warning * Add " in the model labels * Set model_id based on args if possible * Add training set metrics * Randomly select 100 samples for the widget examples Instead of taking the first 100 * Prevent duplicate widget examples * Remove completed TODO * Use title case throughout model card * Add useful comments if values not provided Also prevent crash if dataset_id is not provided * Add environmental impact with codecarbon * Ensure that the model card template is included in the install * Add training hardware section * Add Python version * Make everything title case * Add missing docstring * Add docstring for SpanMarkerModelCardData * Update CHANGELOG * Add SpanMarkerModelCardData to dunder init * Add SpanMarkerModelCardData to snippets * Resolve breaking error if hub_model_id is set * gpu_model -> hardware_used To better match what HF expects * Add "base_model" to metadata * Increment datasets min version to 2.14.0 Required for sorting on multiple columns at once * Update trainer evaluate tests * Skip old model card test for now * Fix edge case: less than 5 examples * pytest.skip -> pytest.mark.skip * Try to infer the language from the dataset * Add citations and hidden sections * Refactor inferring language * Remove unused import * Add comment explaining version * Override default Trainer create_model_card * Update model card template slightly * Add newline to model card template * Remove incorrect space * Add model card tests * Improve Trainer tests regarding model card * Remove commented out breakpoint * Add codecarbon to CI * Rename integration extra to codecarbon * Make hardware_used optional (if no GPU present) * Apply suggestions to model_card_template Co-authored-by: Daniel van Strien <[email protected]> * Update model card test pattern alongside template changes * Don't include hardware_used when no GPU present * Set "No GPU used" for GPU Model if hardware_used is None * Don't store None in yaml * Ensure that emissions is a regular float * kgs to g * support e-05 notation * Add small test case for model cards * Update model tables in docs * Link to the spaCy integration in the tokenizer warning * Update README snippet * Update outdated docs: entity_max_length default is 8 * Remove /models from URL, caused 404s * Fix outdated type hint * 🎉 Apply XLM-R patch * Remove /models from test * Remove tokenizer warning after patch * Update training docs with model card data etc. * Pad token embeddings to multiple of 8 Removes a warning since transformers 4.32.0 * Always attach list directly to header * Tackle edge case where dataset card has no metadata * Allow installing nltk for detokenizing model card examples * Add model card docs * Mention codecarbon install in docstring * overwrite the default codecarbon log level to "error" * Update CHANGELOG * Fix issue with inference example containing full quotes * Update CHANGELOG * Never print a model when printing SpanMarkerModelCardData * Try to infer the dataset_id from the training set Thanks @cakiki * Update the main docs landing page --------- Co-authored-by: Daniel van Strien <[email protected]>
tomaarsen · Sep 29, 2023 · 509d5f4 · 509d5f4
1 parent 506c25b
commit 509d5f4
Show file tree

Hide file tree

Showing 29 changed files with 1,777 additions and 626 deletions.
diff --git a/.github/workflows/tests.yaml b/.github/workflows/tests.yaml
@@ -38,7 +38,7 @@ jobs:
       - name: Install external dependencies on cache miss
         run: |
           python -m pip install --no-cache-dir --upgrade pip
-          python -m pip install --no-cache-dir ".[dev]"
+          python -m pip install --no-cache-dir ".[dev, codecarbon]"
           python -m spacy download en_core_web_sm
         if: steps.restore-cache.outputs.cache-hit != 'true'
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -19,8 +19,25 @@ Types of changes
 
 ### Added
 
+- Added `SpanMarkerModel.generate_model_card()` method to get a model card string.
+- Added `SpanMarkerModelCardData` that should be passed to `SpanMarkerModel.from_pretrained` with additional information like
+  - `language`, `license`, `model_name`, `model_id`, `encoder_name`, `encoder_id`, `dataset_name`, `dataset_id`, `dataset_revision`.
 - Added `transformers` `pipeline` support, e.g. `pipeline(task="span-marker", model="tomaarsen/span-marker-mbert-base-multinerd")`.
 
+### Changed
+
+- Heavily improved automatic model card generated.
+- Evaluating outside of training now returns per-label outputs instead of only "overall" F1, precision and recall.
+- Warn if the used tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space.
+  - If so, then inference of that model will require the punctuation to be split from the words.
+- Improve label normalization speed.
+- Allow you to call SpanMarkerModel.from_pretrained with a pre-initialized SpanMarkerConfig.
+
+### Fixed
+
+- Fixed tokenization mismatch between training and inference for XLM-RoBERTa models: allows for normal inference of those models.
+- Resolve niche bug when TrainingArguments are not provided.
+
 ## [1.3.0]
 
 ### Added

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1 @@
+include span_marker/model_card_template.md
diff --git a/README.md b/README.md
@@ -44,32 +44,47 @@ Please have a look at our [Getting Started](notebooks/getting_started.ipynb) not
 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb)                       | [![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb)                       | [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb)                       | [![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb)                       |
 
 ```python
+from pathlib import Path
 from datasets import load_dataset
 from transformers import TrainingArguments
-from span_marker import SpanMarkerModel, Trainer
+from span_marker import SpanMarkerModel, Trainer, SpanMarkerModelCardData
 
 
 def main() -> None:
     # Load the dataset, ensure "tokens" and "ner_tags" columns, and get a list of labels
-    dataset = load_dataset("DFKI-SLT/few-nerd", "supervised")
+    dataset_id = "DFKI-SLT/few-nerd"
+    dataset_name = "FewNERD"
+    dataset = load_dataset(dataset_id, "supervised")
     dataset = dataset.remove_columns("ner_tags")
     dataset = dataset.rename_column("fine_ner_tags", "ner_tags")
     labels = dataset["train"].features["ner_tags"].feature.names
+    # ['O', 'art-broadcastprogram', 'art-film', 'art-music', 'art-other', ...
 
     # Initialize a SpanMarker model using a pretrained BERT-style encoder
-    model_name = "bert-base-cased"
+    encoder_id = "bert-base-cased"
+    model_id = f"tomaarsen/span-marker-{encoder_id}-fewnerd-fine-super"
     model = SpanMarkerModel.from_pretrained(
-        model_name,
+        encoder_id,
         labels=labels,
         # SpanMarker hyperparameters:
         model_max_length=256,
         marker_max_length=128,
         entity_max_length=8,
+        # Model card arguments
+        model_card_data=SpanMarkerModelCardData(
+            model_id=model_id,
+            encoder_id=encoder_id,
+            dataset_name=dataset_name,
+            dataset_id=dataset_id,
+            license="cc-by-sa-4.0",
+            language="en",
+        ),
     )
 
     # Prepare the 🤗 transformers training arguments
+    output_dir = Path("models") / model_id
     args = TrainingArguments(
-        output_dir="models/span_marker_bert_base_cased_fewnerd_fine_super",
+        output_dir=output_dir,
         # Training Hyperparameters:
         learning_rate=5e-5,
         per_device_train_batch_size=32,
@@ -96,12 +111,13 @@ def main() -> None:
         eval_dataset=dataset["validation"],
     )
     trainer.train()
-    trainer.save_model("models/span_marker_bert_base_cased_fewnerd_fine_super/checkpoint-final")
 
     # Compute & save the metrics on the test set
     metrics = trainer.evaluate(dataset["test"], metric_key_prefix="test")
     trainer.save_metrics("test", metrics)
 
+    # Save the final checkpoint
+    trainer.save_model(output_dir / "checkpoint-final")
 
 if __name__ == "__main__":
     main()
@@ -121,8 +137,6 @@ entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B
  {'span': 'Paris', 'label': 'location-GPE', 'score': 0.9892390966415405, 'char_start_index': 78, 'char_end_index': 83}]
 ```
 
-<!-- Because this work is based on [PL-Marker](https://arxiv.org/pdf/2109.06067v5.pdf), you may expect similar results to its [Papers with Code Leaderboard](https://paperswithcode.com/paper/pack-together-entity-and-relation-extraction) results. -->
-
 ## Pretrained Models
 
 All models in this list contain `train.py` files that show the training scripts used to generate them. Additionally, all training scripts used are stored in the [training_scripts](training_scripts) directory.

diff --git a/docs/api/span_marker.model_card.rst b/docs/api/span_marker.model_card.rst
@@ -0,0 +1,17 @@
+
+:autogenerated:
+
+..
+    This file is autogenerated by `sphinx-api`.
+
+span_marker.model_card module
+=============================
+
+.. currentmodule:: span_marker.model_card
+
+.. automodule:: span_marker.model_card
+    :members:
+    :exclude-members: hyperparameters, eval_results_dict, eval_lines_list, metric_lines, widget, predict_example, label_example_list, tokenizer_warning, train_set_metrics_list, code_carbon_callback, pipeline_tag, library_name, version, metrics, model, set_widget_examples, set_train_set_metrics, set_label_examples, register_model, is_on_huggingface, generate_model_card
+    :undoc-members:
+    :show-inheritance:
+    :member-order: bysource
diff --git a/docs/api/span_marker.rst b/docs/api/span_marker.rst
@@ -19,6 +19,7 @@ span_marker package
        span_marker.modeling
        span_marker.trainer
        span_marker.configuration
+       span_marker.model_card
        span_marker.pipeline_component
        span_marker.data_collator
        span_marker.tokenizer