Using spaCy `nlp.pipe` now processes texts sentence-wise, just like for `nlp(...)`. #41

tomaarsen · 2023-10-31T10:44:11Z

Closes #37

Hello!

Pull Request overview

Using spaCy nlp.pipe now processes texts sentence-wise, just like nlp(...).

Details

Before this PR, using nlp(...) would split the document into sentences before predicting with each sentence separately, while pipe would process each document in full. This is both unexpected and can perform worse. For example, see the following scenario:

import spacy

# Load the spaCy model with the span_marker pipeline component
nlp = spacy.load("en_core_web_sm")#, exclude=["ner"])
nlp.add_pipe("span_marker",
    config={"model": "tomaarsen/span-marker-roberta-large-fewnerd-fine-super"}
)

text = [
    "Leonardo da Vinci recently published a scientific paper on combatting Mitocromulent disease. Leonardo da Vinci painted the most famous painting in existence: the Mona Lisa.",
    "Leonardo da Vinci scored a critical goal towards the end of the second half. Leonardo da Vinci controversially veto'd a bill regarding public health care last friday. Leonardo da Vinci was promoted to Sergeant after his outstanding work in the war."
]
for doc in nlp.pipe(text):
    print([(entity, entity.label_) for entity in doc.ents])

Results in:

[(Leonardo da Vinci, 'person-scholar'), (Mitocromulent, 'other-disease'), (Leonardo da Vinci, 'person-artist/author'), (the Mona Lisa, 'PERSON')]
[(Leonardo da Vinci, 'person-athlete'), (the end of the second half, 'DATE'), (Leonardo da Vinci, 'person-athlete'), (last friday, 'DATE'), (Leonardo da Vinci, 'person-athlete'), (Sergeant, 'PERSON')]

Note: Leonardo da Vinci is classified as a person-athlete three times, because the full text is sent through the model in one go.

After this PR, this same script now results in:

[(Leonardo da Vinci, 'person-scholar'), (Mitocromulent, 'other-disease'), (Leonardo da Vinci, 'person-artist/author'), (the Mona Lisa, 'PERSON')]
[(Leonardo da Vinci, 'person-athlete'), (the end of the second half, 'DATE'), (Leonardo da Vinci, 'person-politician'), (last friday, 'DATE'), (Leonardo da Vinci, 'person-soldier'), (Sergeant, 'PERSON')]

Note: Leonardo da Vinci is now (correctly) classified according to the context. This is because each of the sentences are sent through the model separately as you would expect.

The implementation is a tad messy, as there now isn't a 1-1 mapping between all entities and the docs, but it works well.

Thank you @q-jackboylan for reporting this discrepancy.

Tom Aarsen

…o spacy/pipe_sentencewise

q-jackboylan · 2023-11-01T19:32:13Z

Awesome! 😀Thanks for getting on this, and so fast ❤️

tomaarsen · 2023-11-01T20:02:11Z

Gladly! And thanks again for reporting this!

tomaarsen added 3 commits October 31, 2023 11:29

pipe now does sentence-wise predictions just like __call__

35be5f5

Merge branch 'main' of https://github.com/tomaarsen/SpanMarkerNER int…

20f8ab8

…o spacy/pipe_sentencewise

Update changelog

042eb12

tomaarsen merged commit efbbb68 into main Oct 31, 2023
8 checks passed

tomaarsen deleted the spacy/pipe_sentencewise branch October 31, 2023 10:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using spaCy `nlp.pipe` now processes texts sentence-wise, just like for `nlp(...)`. #41

Using spaCy `nlp.pipe` now processes texts sentence-wise, just like for `nlp(...)`. #41

tomaarsen commented Oct 31, 2023

q-jackboylan commented Nov 1, 2023

tomaarsen commented Nov 1, 2023 •

edited

Loading

Using spaCy nlp.pipe now processes texts sentence-wise, just like for nlp(...). #41

Using spaCy nlp.pipe now processes texts sentence-wise, just like for nlp(...). #41

Conversation

tomaarsen commented Oct 31, 2023

Pull Request overview

Details

q-jackboylan commented Nov 1, 2023

tomaarsen commented Nov 1, 2023 • edited Loading

Using spaCy `nlp.pipe` now processes texts sentence-wise, just like for `nlp(...)`. #41

Using spaCy `nlp.pipe` now processes texts sentence-wise, just like for `nlp(...)`. #41

tomaarsen commented Nov 1, 2023 •

edited

Loading