-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for BILO label scheme #31
Conversation
@tomaarsen I did something similar myself and it didn't work - the problem is the
As a quick hack I created a custom Doing this gets it to the stage where tokenisation works (I also had to hard code But it's been running for 2 hours now and the progress shows "Tokenizing the train dataset 100%" And it's been stuck on this step - there's no CPU or GPU activity so I think it may have failed - possible a deadlock as I didn't set TOKENIZERS_PARALLELISM |
The Regarding the 0% - that one is odd. I can't really explain that one. I'll try to do some debugging locally. Edit: I've created a BILO dataset locally by updating a BIO dataset, and it all seems to work for me after 89e1a4a. |
Another alternative approach is mapping your dataset from BILO to BIO by replacing all |
I've found that BIO doesn't work very well in my case, I'm trying to extract equipment identifiers from sensor point description so entities like AHU-01-01, if you don't include the L then it often drops the last few characters as AHU-01 is also common (Spacy mention this in their introduction to spancat. I've got it running now - I interrupted it and saw from the stack trace that the dataloader thread was waiting, so I set TOKENIZERS_PARALLELISM=false and it runs (had to set the batch size quite small though as I increased the entity length, may need to use gradient accumulation) |
For the purposes of SpanMarker, all labeling schemes get normalized into the same scheme: (label, start index, end index) tuples. It doesn't actually predict token-level labels such as @david-waterworth have you been able to make any progress? |
Yeah that’s a good point. I did get training working, everything looked good with the loss dripping nicely and the validation accuracy was fine. But when I tried to reload the model I couldn’t get it to detect any entities using the same examples I trained / tested on I had to hack around a bit with tokenisation so that’s almost certainly the issue - something I did during training isn’t being done when I try and predict so I’ll have to look a bit closer |
@tomaarsen , is Entity_max_length impacting the model performance. In my case, Once i passed resume kind data , its not detecting all the sections. Because some of entity length is greater than 25 character. |
@abhayalok |
…o feat/bilo_support
Closes #30
Hello!
Pull Request overview
Details
It might be as simple as this PR proposes, but there might also be some hidden issues that I'm not thinking of.
@david-waterworth perhaps you could install this PR and try to see if you can train with it.