spaCy: Adding exceptions to sentencizer
I am not sure if I haven’t look thoroughly enough in the docs but I want to add abbreviation exceptions to the sentence tokenizer.
E.g. Operating income incl. JV was SEK 2.1 b. with an operating margin of 4.0% is split into Operating income incl. and JV was SEK 2.1 b. with an operating margin of 4.0%.
My experience so far with spaCy tells me that there is probably a smart way to fix it?
Posted as bug but it might be doc related or a feature request.
from spacy.lang.en import English
nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))
doc = nlp('Operating income incl. JV was SEK 2.1 b. with an operating margin of 4.0%')
assert len([s for s in doc.sents]) == 1
Info about spaCy
- spaCy version: 2.1.8
- Platform: Linux-5.0.0-25-generic-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.3
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (8 by maintainers)
This will be in v3.
It’s actually basically done on
develop:https://github.com/explosion/spaCy/blob/8137b24928432c7c23ea66d190584336075e29ae/spacy/pipeline/pipes.pyx#L758-L921
developis currently under pretty heavy development (mainly due to the new rewrite ofthinc), but you’re welcome to try it out. You should be able to train models withspacy train -p sentrecand the JSON training format as long as you haveorthfor each of the tokens in each sentence. The shortcut name is probably also going to change fromsentrectosenter, unless someone comes up with a better name in the meanwhile.Wait, hmm, it looks like it hasn’t been updated for some of the very recent changes, but if you try it as of about https://github.com/explosion/spaCy/commit/d2f3a44b42bfff9773fdf3abaccdcc0e78d295f7, it should work.
I’ve also worked on prodigy recipes for it, which isn’t too much work because it’s just a variant of
pos, but those will have to wait until prodigy is updated for spacy v3.