transformers: NER pipeline: Inconsistent entity grouping
đ Bug
Information
âmrm8488/bert-spanish-cased-finetuned-nerâ
Language I am using the model on (English, Chinese âŠ): Spanish
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- create a
nerpipeline - pass flag
grouped_entities - entities are not grouped as expected see sample below
NER_MODEL = "mrm8488/bert-spanish-cased-finetuned-ner"
nlp_ner = pipeline("ner", model=NER_MODEL,
grouped_entities=True,
tokenizer=(NER_MODEL, {"use_fast": False}))
t = """Consuelo AraĂșjo Noguera, ministra de cultura del presidente AndrĂ©s Pastrana (1998.2002) fue asesinada por las Farc luego de haber permanecido secuestrada por algunos meses."""
ner(t)
>>>
[ {'entity_group': 'B-PER', 'score': 0.901019960641861, 'word': 'Consuelo'},
{'entity_group': 'I-PER', 'score': 0.9990904808044434, 'word': 'AraĂșjo Noguera'},
{'entity_group': 'B-PER', 'score': 0.9998136162757874, 'word': 'Andrés'},
{'entity_group': 'I-PER', 'score': 0.9996985991795858, 'word': 'Pastrana'},
{'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Far'}]
Expected behavior
Inconsistent grouping
I expect the first two items of the given sample( B-PER, and I-PER) to be grouped. As they are contiguous tokens and correspond to a single entity spot. It seems the current code does not take into account B and I tokens.
expected output:
{'entity_group': 'I-PER', 'score': 0.9990904808044434, 'word': ' Consuelo AraĂșjo Noguera'},
{'entity_group': 'I-PER', 'score': 0.9998136162757874, 'word': 'Andrés Pastrana'},
{'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Farc'}]
Lost tokens?
for the same input, passing grouped_entities=False generates the following output:
[
{'word': 'Cons', 'score': 0.9994944930076599, 'entity': 'B-PER', 'index': 1},
{'word': '##uelo', 'score': 0.802545428276062, 'entity': 'B-PER', 'index': 2},
{'word': 'Ara', 'score': 0.9993102550506592, 'entity': 'I-PER', 'index': 3},
{'word': '##Ășj', 'score': 0.9993743896484375, 'entity': 'I-PER', 'index': 4},
{'word': '##o', 'score': 0.9992871880531311, 'entity': 'I-PER', 'index': 5},
{'word': 'No', 'score': 0.9993029236793518, 'entity': 'I-PER', 'index': 6},
{'word': '##guera', 'score': 0.9981776475906372, 'entity': 'I-PER', 'index': 7},
{'word': 'Andrés', 'score': 0.9998136162757874, 'entity': 'B-PER', 'index': 15},
{'word': 'Pas', 'score': 0.999740719795227, 'entity': 'I-PER', 'index': 16},
{'word': '##tran', 'score': 0.9997414350509644, 'entity': 'I-PER', 'index': 17},
{'word': '##a', 'score': 0.9996136426925659, 'entity': 'I-PER', 'index': 18},
{'word': 'Far', 'score': 0.9989739060401917, 'entity': 'B-ORG', 'index': 28},
{'word': '##c', 'score': 0.7188423275947571, 'entity': 'I-ORG', 'index': 29}]
when using grouped_entities the last entity word (##c) got lost, it is not even considered as a different group
{'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Far'}]
Environment info
transformersversion: 2.11.0- Platform: OSX
- Python version: 3.7
- PyTorch version (GPU?): 1.5.0
- Tensorflow version (GPU?):
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 19 (9 by maintainers)
@dav009 Thanks for posting this issue!
BandItokens are not yet considered. Will have to include this in a new PR.ignore_labelsargument forTokenClassificationPipeline, which is set as["O"]by default. If you donât want to skip any token, you can just setignore_labels=[].Iâm happy to work on
1within the next week or so since Iâve already been planning to apply this fix.@Nighthyst I see, youâre bringing up a different issue now. This is the case where the entity type of a wordâs word piece, is different from other word pieces.
A fix I can apply here is to automatically group word pieces together regardless of entity type. I can apply this to a new PR after merging this existing one.
@dav009 Understand now! Thanks for clarifying. Yes, it does seem to be related to the I and B issue. Think can handle this in the same PR.
Hi everyone, this PR was recently merged to resolve the original issue #4987.
@sudharsan2020 Setting
grouped_entities=Trueshould work for your example under the new PR, since similar entities w/ different prefixes are now grouped (e.g. âI-PERâ and âB-PERâ) đ@dav009 Opened a PR (above) that should resolve this đ
@enzoampil đ thanks for your prompt answer
in the given sample, the missing entity is not tagged as
O:##cis tagged asI-ORGin (grouped_entities =False){'word': '##c', 'score': 0.7188423275947571, 'entity': 'I-ORG', 'index': 29}]however it did not get included in the grouping results (
grouped_entities =True)