transformers: NER pipeline: Inconsistent entity grouping

🐛 Bug

Information

“mrm8488/bert-spanish-cased-finetuned-ner”

Language I am using the model on (English, Chinese …): Spanish

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

create a ner pipeline
pass flag grouped_entities
entities are not grouped as expected see sample below

NER_MODEL = "mrm8488/bert-spanish-cased-finetuned-ner"
nlp_ner = pipeline("ner", model=NER_MODEL,
                   grouped_entities=True,
                   tokenizer=(NER_MODEL, {"use_fast": False}))

t = """Consuelo Araújo Noguera, ministra de cultura del presidente Andrés Pastrana (1998.2002) fue asesinada por las Farc luego de haber permanecido secuestrada por algunos meses."""
ner(t)
>>> 
[ {'entity_group': 'B-PER', 'score': 0.901019960641861, 'word': 'Consuelo'}, 
 {'entity_group': 'I-PER', 'score': 0.9990904808044434, 'word': 'Araújo Noguera'}, 
 {'entity_group': 'B-PER', 'score': 0.9998136162757874, 'word': 'Andrés'}, 
 {'entity_group': 'I-PER', 'score': 0.9996985991795858, 'word': 'Pastrana'}, 
 {'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Far'}]

Expected behavior

Inconsistent grouping

I expect the first two items of the given sample( B-PER, and I-PER) to be grouped. As they are contiguous tokens and correspond to a single entity spot. It seems the current code does not take into account B and I tokens.

expected output:

 {'entity_group': 'I-PER', 'score': 0.9990904808044434, 'word': ' Consuelo Araújo Noguera'}, 
 {'entity_group': 'I-PER', 'score': 0.9998136162757874, 'word': 'Andrés Pastrana'}, 
 {'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Farc'}]

Lost tokens?

for the same input, passing grouped_entities=False generates the following output:

[
{'word': 'Cons', 'score': 0.9994944930076599, 'entity': 'B-PER', 'index': 1},
{'word': '##uelo', 'score': 0.802545428276062, 'entity': 'B-PER', 'index': 2}, 
{'word': 'Ara', 'score': 0.9993102550506592, 'entity': 'I-PER', 'index': 3}, 
{'word': '##új', 'score': 0.9993743896484375, 'entity': 'I-PER', 'index': 4}, 
{'word': '##o', 'score': 0.9992871880531311, 'entity': 'I-PER', 'index': 5}, 
{'word': 'No', 'score': 0.9993029236793518, 'entity': 'I-PER', 'index': 6}, 
{'word': '##guera', 'score': 0.9981776475906372, 'entity': 'I-PER', 'index': 7}, 
{'word': 'Andrés', 'score': 0.9998136162757874, 'entity': 'B-PER', 'index': 15}, 
{'word': 'Pas', 'score': 0.999740719795227, 'entity': 'I-PER', 'index': 16}, 
{'word': '##tran', 'score': 0.9997414350509644, 'entity': 'I-PER', 'index': 17}, 
{'word': '##a', 'score': 0.9996136426925659, 'entity': 'I-PER', 'index': 18}, 
{'word': 'Far', 'score': 0.9989739060401917, 'entity': 'B-ORG', 'index': 28}, 
{'word': '##c', 'score': 0.7188423275947571, 'entity': 'I-ORG', 'index': 29}]

when using grouped_entities the last entity word (##c) got lost, it is not even considered as a different group

{'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Far'}]

Environment info

transformers version: 2.11.0
Platform: OSX
Python version: 3.7
PyTorch version (GPU?): 1.5.0
Tensorflow version (GPU?):
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 19 (9 by maintainers)

Most upvoted comments

@dav009 Thanks for posting this issue!

Inconsistent grouping - correct that B and I tokens are not yet considered. Will have to include this in a new PR.
Lost tokens - the skipped tokens are those with an entity type found in the ignore_labels argument for TokenClassificationPipeline, which is set as ["O"] by default. If you don’t want to skip any token, you can just set ignore_labels=[].

I’m happy to work on 1 within the next week or so since I’ve already been planning to apply this fix.

enzoampil on Jun 6, 2020

@Nighthyst I see, you’re bringing up a different issue now. This is the case where the entity type of a word’s word piece, is different from other word pieces.

A fix I can apply here is to automatically group word pieces together regardless of entity type. I can apply this to a new PR after merging this existing one.

enzoampil on Jun 16, 2020

@dav009 Understand now! Thanks for clarifying. Yes, it does seem to be related to the I and B issue. Think can handle this in the same PR.

enzoampil on Jun 6, 2020

Hi everyone, this PR was recently merged to resolve the original issue #4987.

enzoampil on Jul 9, 2020

@sudharsan2020 Setting grouped_entities=True should work for your example under the new PR, since similar entities w/ different prefixes are now grouped (e.g. “I-PER” and “B-PER”) 😄

enzoampil on Jun 15, 2020

@dav009 Opened a PR (above) that should resolve this 😄

enzoampil on Jun 14, 2020

@enzoampil 👋 thanks for your prompt answer

Lost tokens - the skipped tokens are those with an entity type found in the ignore_labels argument for TokenClassificationPipeline, which is set as [“O”] by default. If you don’t want to skip any token, you can just set ignore_labels=[].

in the given sample, the missing entity is not tagged as O :

##c is tagged as I-ORG in (grouped_entities =False) {'word': '##c', 'score': 0.7188423275947571, 'entity': 'I-ORG', 'index': 29}]

however it did not get included in the grouping results (grouped_entities =True)

dav009 on Jun 6, 2020