spaCy: IndexError: list index out of range while training parser

Training pipeline: ['parser']
Starting with blank model 'ko'
Counting training words (limit=0)
Traceback (most recent call last):   
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ksjae/.local/lib/python3.7/site-packages/spacy/__main__.py", line 
35, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "/home/ksjae/.local/lib/python3.7/site-packages/plac_core.py", line 328, 
in call
    cmd, result = parser.consume(arglist)
  File "/home/ksjae/.local/lib/python3.7/site-packages/plac_core.py", line 207, 
in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/ksjae/.local/lib/python3.7/site-packages/spacy/cli/train.py", line
 213, in train
 optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
  File "/home/ksjae/.local/lib/python3.7/site-packages/spacy/language.py", line 583, in begin_training
    **kwargs
  File "nn_parser.pyx", line 576, in spacy.syntax.nn_parser.Parser.begin_training
  File "arc_eager.pyx", line 346, in spacy.syntax.arc_eager.ArcEager.get_actions
  File "nonproj.pyx", line 123, in spacy.syntax.nonproj.projectivize
  File "nonproj.pyx", line 172, in spacy.syntax.nonproj._get_smallest_nonproj_arc
  File "nonproj.pyx", line 58, in spacy.syntax.nonproj.is_nonproj_arc
  File "nonproj.pyx", line 26, in ancestors
IndexError: list index out of range

How to reproduce the behaviour

Training code as-is from document

python3 -m spacy train ko model KNLI-spacy.json KNLI-spacy-dev.json -p parser

Use this json file EDIT: These are faulty but remained in place, use Corpus.zip for newest ones https://1drv.ms/u/s!Aq0-1ykl7mZBqWCbqo6cq4X1amma?e=1eBExo for KNLI-spacy.json https://1drv.ms/u/s!Aq0-1ykl7mZBqWGLptXC0Ba5nGFK?e=suX2RJ for KNLI-spacy-dev.json

Your Environment

spaCy version: 2.1.9
Platform: Linux-4.4.0-178-generic-x86_64-with-debian-stretch-sid
Python version: 3.7.7

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 20 (2 by maintainers)

Most upvoted comments

Hi, the problem is that the heads haven’t been converted correctly for spacy’s training format. The heads should be relative to the current token, not absolute IDs. The root should have head 0 and all other tokens should have heads relative to their position, so a head of -2 would mean the head is two words to the left, 1 would mean one word to the right, etc.

The data loader should fail with a more useful error in this case, though. I’ll take a look to see how this could be improved.

adrianeboyd on Sep 7, 2020