Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using spaCy nlp.pipe now processes texts sentence-wise, just like for nlp(...). #41

Merged
merged 3 commits into from
Oct 31, 2023

Conversation

tomaarsen
Copy link
Owner

Closes #37

Hello!

Pull Request overview

  • Using spaCy nlp.pipe now processes texts sentence-wise, just like nlp(...).

Details

Before this PR, using nlp(...) would split the document into sentences before predicting with each sentence separately, while pipe would process each document in full. This is both unexpected and can perform worse. For example, see the following scenario:

import spacy

# Load the spaCy model with the span_marker pipeline component
nlp = spacy.load("en_core_web_sm")#, exclude=["ner"])
nlp.add_pipe("span_marker",
    config={"model": "tomaarsen/span-marker-roberta-large-fewnerd-fine-super"}
)

text = [
    "Leonardo da Vinci recently published a scientific paper on combatting Mitocromulent disease. Leonardo da Vinci painted the most famous painting in existence: the Mona Lisa.",
    "Leonardo da Vinci scored a critical goal towards the end of the second half. Leonardo da Vinci controversially veto'd a bill regarding public health care last friday. Leonardo da Vinci was promoted to Sergeant after his outstanding work in the war."
]
for doc in nlp.pipe(text):
    print([(entity, entity.label_) for entity in doc.ents])

Results in:

[(Leonardo da Vinci, 'person-scholar'), (Mitocromulent, 'other-disease'), (Leonardo da Vinci, 'person-artist/author'), (the Mona Lisa, 'PERSON')]
[(Leonardo da Vinci, 'person-athlete'), (the end of the second half, 'DATE'), (Leonardo da Vinci, 'person-athlete'), (last friday, 'DATE'), (Leonardo da Vinci, 'person-athlete'), (Sergeant, 'PERSON')]

Note: Leonardo da Vinci is classified as a person-athlete three times, because the full text is sent through the model in one go.


After this PR, this same script now results in:

[(Leonardo da Vinci, 'person-scholar'), (Mitocromulent, 'other-disease'), (Leonardo da Vinci, 'person-artist/author'), (the Mona Lisa, 'PERSON')]
[(Leonardo da Vinci, 'person-athlete'), (the end of the second half, 'DATE'), (Leonardo da Vinci, 'person-politician'), (last friday, 'DATE'), (Leonardo da Vinci, 'person-soldier'), (Sergeant, 'PERSON')]

Note: Leonardo da Vinci is now (correctly) classified according to the context. This is because each of the sentences are sent through the model separately as you would expect.

The implementation is a tad messy, as there now isn't a 1-1 mapping between all entities and the docs, but it works well.

Thank you @q-jackboylan for reporting this discrepancy.

  • Tom Aarsen

@tomaarsen tomaarsen merged commit efbbb68 into main Oct 31, 2023
8 checks passed
@tomaarsen tomaarsen deleted the spacy/pipe_sentencewise branch October 31, 2023 10:53
@q-jackboylan
Copy link

Awesome! 😀Thanks for getting on this, and so fast ❤️

@tomaarsen
Copy link
Owner Author

tomaarsen commented Nov 1, 2023

Gladly! And thanks again for reporting this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

spaCy_integration .pipe() does not behave as expected
2 participants