Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for BILO label scheme #31

Merged
merged 5 commits into from
Oct 31, 2023
Merged

Add support for BILO label scheme #31

merged 5 commits into from
Oct 31, 2023

Conversation

tomaarsen
Copy link
Owner

Closes #30

Hello!

Pull Request overview

  • Add BILO label scheme

Details

It might be as simple as this PR proposes, but there might also be some hidden issues that I'm not thinking of.
@david-waterworth perhaps you could install this PR and try to see if you can train with it.

pip install git+https://github.com/tomaarsen/SpanMarkerNER.git@feat/bilo_support
  • Tom Aarsen

@david-waterworth
Copy link

@tomaarsen I did something similar myself and it didn't work - the problem is the LabelNormalizerBILOU calls self.label_ids_by_tag["U"] and this also throws if the are no U's the the dataset

class LabelNormalizerBILOU(LabelNormalizerScheme):

As a quick hack I created a custom LabelNormalizerBILO - an alternative would be to replace self.label_ids_by_tag["U"] with self.label_ids_by_tag.get("U", set())

Doing this gets it to the stage where tokenisation works (I also had to hard code is_split_into_words = True in the tokeniser as I ended up with some whitespace only tokens but that's something I'll fix later).

But it's been running for 2 hours now and the progress shows

"Tokenizing the train dataset 100%"
"Spreading data between multiple samples 100%"
"0%"

And it's been stuck on this step - there's no CPU or GPU activity so I think it may have failed - possible a deadlock as I didn't set TOKENIZERS_PARALLELISM

@tomaarsen
Copy link
Owner Author

tomaarsen commented Sep 18, 2023

The is_split_into_words = True might be unrelated. The current setup isn't very robust, so perhaps there's some edge case where your training text contains tokens with spaces?

Regarding the 0% - that one is odd. I can't really explain that one. I'll try to do some debugging locally.

Edit: I've created a BILO dataset locally by updating a BIO dataset, and it all seems to work for me after 89e1a4a.

@tomaarsen
Copy link
Owner Author

tomaarsen commented Sep 18, 2023

Another alternative approach is mapping your dataset from BILO to BIO by replacing all L-... with I-.... The resulting dataset should be equivalently expressive.

@david-waterworth
Copy link

I've found that BIO doesn't work very well in my case, I'm trying to extract equipment identifiers from sensor point description so entities like AHU-01-01, if you don't include the L then it often drops the last few characters as AHU-01 is also common (Spacy mention this in their introduction to spancat.

I've got it running now - I interrupted it and saw from the stack trace that the dataloader thread was waiting, so I set TOKENIZERS_PARALLELISM=false and it runs (had to set the batch size quite small though as I increased the entity length, may need to use gradient accumulation)

@tomaarsen
Copy link
Owner Author

For the purposes of SpanMarker, all labeling schemes get normalized into the same scheme: (label, start index, end index) tuples. It doesn't actually predict token-level labels such as I-ORG or L-ORG, but only just ORG, for example. So, I think that converting BILO to BIO would still work. That said, I'd love to support BILO as well, as I prefer simplifying the work for the user.


@david-waterworth have you been able to make any progress?

@david-waterworth
Copy link

Yeah that’s a good point. I did get training working, everything looked good with the loss dripping nicely and the validation accuracy was fine. But when I tried to reload the model I couldn’t get it to detect any entities using the same examples I trained / tested on

I had to hack around a bit with tokenisation so that’s almost certainly the issue - something I did during training isn’t being done when I try and predict so I’ll have to look a bit closer

@abhayalok
Copy link

@tomaarsen , is Entity_max_length impacting the model performance. In my case, Once i passed resume kind data , its not detecting all the sections. Because some of entity length is greater than 25 character.

@tomaarsen
Copy link
Owner Author

@abhayalok entity_max_length denotes the maximum number of words that an entity can be. So a value of 8 means that it can detect entities with 8 words or less. You can always open an Issue on this repository if you have any other questions :)

@tomaarsen tomaarsen merged commit 4ed4dc9 into main Oct 31, 2023
8 checks passed
@tomaarsen tomaarsen deleted the feat/bilo_support branch October 31, 2023 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot train BILOU scheme with no singletons
3 participants