spaCy: Entity extracted at evaluation doesn't show up using the imported model

Hi,

Trained a custom NER model on a our own labelled dataset on financial risk. Did pretraining as well and the model finished with a score of 97.48.

Training pipeline: ner
Starting with blank model 'en'
2511 training docs
267 evaluation docs

============================== Vocab & Vectors ==============================
ℹ 101601 total words in the data (12951 unique)
ℹ No word vectors present in the model

========================== Named Entity Recognition ==========================
ℹ 1 new label, 0 existing labels
0 missing values (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurrences available for all labels
✔ No entities consisting of or starting/ending with whitespace
✔ No entities consisting of or starting/ending with punctuation

When evaluating the model on the dev set, entities got picked up just fine, but there is one entity: US Treasury Department’s Office of Foreign Assets Control (at least one I’ve notices) that doesn’t show up in the same sentences when importing and testing the best-model in a notebook:

Screen Shot 2020-08-24 at 08 26 29 Screen Shot 2020-08-24 at 08 27 00

Screen Shot 2020-08-24 at 08 29 03

Than I ran a test on every single sentence (150) containing the missing entity, 3 returned it partially as: Department’s Office of Foreign Assets Control but nothing more.

There are quite a few other entities like: Department of Justice (130), Department of State (70), US Department of the Treasury (40) which contain similar wording, can these potentially conflict the missing entity: US Treasury Department’s Office of Foreign Assets Control? However this still won’t answer why this is present in the evaluation sample but missing in production.

Btw, there’s a permutation of the missing entity: US Treasury’s Office of Foreign Assets Control which pops up perfectly in any tested sentence, which puzzles me even more.

Using latest version of Spacy. Thanks.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 18 (9 by maintainers)

Most upvoted comments

Hi @svlandeg, I appreciate your help so much! Yes, this makes sense and will help us in preparing our data in such a manner that’s consistent from training to production. Can’t wait to dive in and do some refactoring.

Thanks again!

Yes, I received it, thanks!

I just run the: python3 -m spacy evaluate ./model/model-best dev.json -dp ./displacy, than navigate to the ./displacy folder and open the entities.html to see the entities highlighted. The dev.json is the same I’ve used for training.