Add support for BILO label scheme #31

tomaarsen · 2023-09-18T07:50:38Z

Closes #30

Hello!

Pull Request overview

Add BILO label scheme

Details

It might be as simple as this PR proposes, but there might also be some hidden issues that I'm not thinking of.
@david-waterworth perhaps you could install this PR and try to see if you can train with it.

pip install git+https://github.com/tomaarsen/SpanMarkerNER.git@feat/bilo_support

Tom Aarsen

david-waterworth · 2023-09-18T08:05:37Z

@tomaarsen I did something similar myself and it didn't work - the problem is the LabelNormalizerBILOU calls self.label_ids_by_tag["U"] and this also throws if the are no U's the the dataset

SpanMarkerNER/span_marker/label_normalizer.py

Line 81 in 1f4fd4f

class LabelNormalizerBILOU(LabelNormalizerScheme):

As a quick hack I created a custom LabelNormalizerBILO - an alternative would be to replace self.label_ids_by_tag["U"] with self.label_ids_by_tag.get("U", set())

Doing this gets it to the stage where tokenisation works (I also had to hard code is_split_into_words = True in the tokeniser as I ended up with some whitespace only tokens but that's something I'll fix later).

But it's been running for 2 hours now and the progress shows

"Tokenizing the train dataset 100%"
"Spreading data between multiple samples 100%"
"0%"

And it's been stuck on this step - there's no CPU or GPU activity so I think it may have failed - possible a deadlock as I didn't set TOKENIZERS_PARALLELISM

tomaarsen · 2023-09-18T08:25:02Z

The is_split_into_words = True might be unrelated. The current setup isn't very robust, so perhaps there's some edge case where your training text contains tokens with spaces?

Regarding the 0% - that one is odd. I can't really explain that one. I'll try to do some debugging locally.

Edit: I've created a BILO dataset locally by updating a BIO dataset, and it all seems to work for me after 89e1a4a.

tomaarsen · 2023-09-18T08:26:40Z

Another alternative approach is mapping your dataset from BILO to BIO by replacing all L-... with I-.... The resulting dataset should be equivalently expressive.

david-waterworth · 2023-09-18T08:53:11Z

I've found that BIO doesn't work very well in my case, I'm trying to extract equipment identifiers from sensor point description so entities like AHU-01-01, if you don't include the L then it often drops the last few characters as AHU-01 is also common (Spacy mention this in their introduction to spancat.

I've got it running now - I interrupted it and saw from the stack trace that the dataloader thread was waiting, so I set TOKENIZERS_PARALLELISM=false and it runs (had to set the batch size quite small though as I increased the entity length, may need to use gradient accumulation)

tomaarsen · 2023-09-19T10:10:28Z

For the purposes of SpanMarker, all labeling schemes get normalized into the same scheme: (label, start index, end index) tuples. It doesn't actually predict token-level labels such as I-ORG or L-ORG, but only just ORG, for example. So, I think that converting BILO to BIO would still work. That said, I'd love to support BILO as well, as I prefer simplifying the work for the user.

@david-waterworth have you been able to make any progress?

david-waterworth · 2023-09-19T10:45:42Z

Yeah that’s a good point. I did get training working, everything looked good with the loss dripping nicely and the validation accuracy was fine. But when I tried to reload the model I couldn’t get it to detect any entities using the same examples I trained / tested on

I had to hack around a bit with tokenisation so that’s almost certainly the issue - something I did during training isn’t being done when I try and predict so I’ll have to look a bit closer

abhayalok · 2023-10-24T12:02:15Z

@tomaarsen , is Entity_max_length impacting the model performance. In my case, Once i passed resume kind data , its not detecting all the sections. Because some of entity length is greater than 25 character.

tomaarsen · 2023-10-24T12:09:16Z

@abhayalok entity_max_length denotes the maximum number of words that an entity can be. So a value of 8 means that it can detect entities with 8 words or less. You can always open an Issue on this repository if you have any other questions :)

…o feat/bilo_support

Add support for BILO

1f4fd4f

Improve BILO support

26aecdb

Add missing case of U

89e1a4a

tomaarsen added 2 commits October 31, 2023 11:53

Merge branch 'main' of https://github.com/tomaarsen/SpanMarkerNER int…

c5653c3

…o feat/bilo_support

Updated the changelog

9c89916

tomaarsen merged commit 4ed4dc9 into main Oct 31, 2023
8 checks passed

tomaarsen deleted the feat/bilo_support branch October 31, 2023 11:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for BILO label scheme #31

Add support for BILO label scheme #31

tomaarsen commented Sep 18, 2023

david-waterworth commented Sep 18, 2023

tomaarsen commented Sep 18, 2023 •

edited

Loading

tomaarsen commented Sep 18, 2023 •

edited

Loading

david-waterworth commented Sep 18, 2023

tomaarsen commented Sep 19, 2023

david-waterworth commented Sep 19, 2023

abhayalok commented Oct 24, 2023

tomaarsen commented Oct 24, 2023

Add support for BILO label scheme #31

Add support for BILO label scheme #31

Conversation

tomaarsen commented Sep 18, 2023

Pull Request overview

Details

david-waterworth commented Sep 18, 2023

tomaarsen commented Sep 18, 2023 • edited Loading

tomaarsen commented Sep 18, 2023 • edited Loading

david-waterworth commented Sep 18, 2023

tomaarsen commented Sep 19, 2023

david-waterworth commented Sep 19, 2023

abhayalok commented Oct 24, 2023

tomaarsen commented Oct 24, 2023

tomaarsen commented Sep 18, 2023 •

edited

Loading

tomaarsen commented Sep 18, 2023 •

edited

Loading