spaCy: Adding exceptions to sentencizer

I am not sure if I haven’t look thoroughly enough in the docs but I want to add abbreviation exceptions to the sentence tokenizer.

E.g. Operating income incl. JV was SEK 2.1 b. with an operating margin of 4.0% is split into Operating income incl. and JV was SEK 2.1 b. with an operating margin of 4.0%.

My experience so far with spaCy tells me that there is probably a smart way to fix it?

Posted as bug but it might be doc related or a feature request.

from spacy.lang.en import English

nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))
doc = nlp('Operating income incl. JV was SEK 2.1 b. with an operating margin of 4.0%')

assert len([s for s in doc.sents]) == 1

Info about spaCy

  • spaCy version: 2.1.8
  • Platform: Linux-5.0.0-25-generic-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.3

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (8 by maintainers)

Most upvoted comments

This will be in v3.

It’s actually basically done on develop:

https://github.com/explosion/spaCy/blob/8137b24928432c7c23ea66d190584336075e29ae/spacy/pipeline/pipes.pyx#L758-L921

develop is currently under pretty heavy development (mainly due to the new rewrite of thinc), but you’re welcome to try it out. You should be able to train models with spacy train -p sentrec and the JSON training format as long as you have orth for each of the tokens in each sentence. The shortcut name is probably also going to change from sentrec to senter, unless someone comes up with a better name in the meanwhile.

Wait, hmm, it looks like it hasn’t been updated for some of the very recent changes, but if you try it as of about https://github.com/explosion/spaCy/commit/d2f3a44b42bfff9773fdf3abaccdcc0e78d295f7, it should work.

I’ve also worked on prodigy recipes for it, which isn’t too much work because it’s just a variant of pos, but those will have to wait until prodigy is updated for spacy v3.