transformers: [run_summarization.py] wrong dataset leads to CUDA error:s

Feeding --dataset_name cnn_dailymail to --model_name_or_path google/pegasus-xsum leads to lots of errors from pytorch - perhaps there is a way to detect that the dataset is inappropriate and give a nice relevant assert instead?

You’d think that --dataset_name cnn_dailymail and --dataset_name xsum should be interchangeable…

python examples/seq2seq/run_summarization.py --model_name_or_path google/pegasus-xsum --do_train \
--do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0"  \
--output_dir /tmp/tst-summarization --per_device_train_batch_size=1 --per_device_eval_batch_size=1 \
--overwrite_output_dir --predict_with_generate
[....]
/workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [290,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [290,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [290,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
(crashes w/o traceback here)

If I run it on one gpu I get:

[...]
/workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [138,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/models/pegasus/modeling_pegasus.py", line 763, in forward
    layer_outputs = encoder_layer(
  File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/models/pegasus/modeling_pegasus.py", line 323, in forward
    hidden_states, attn_weights, _ = self.self_attn(
  File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/models/pegasus/modeling_pegasus.py", line 190, in forward
    query_states = self.q_proj(hidden_states) * self.scaling
  File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/functional.py", line 1860, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

Thanks.

@sgugger, @patil-suraj

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 24 (18 by maintainers)

Most upvoted comments

Ok so the plan is to:

Add a resize_position_embeddings to PreTrainedModel just like we are doing it for the word embeddings
resize_position_embeddings should probably log or warn depending on whether it’s sinus position embeddings or learned ones
The function should overwrite config.max_position_embeddings

=> Happy to open a PR for this one, but would be great to first hear @LysandreJik and @sgugger’s opinion on it as well

patrickvonplaten on Aug 18, 2021

@stas00, I checked and the problem simply seems to be that max_source_length is too high. It’s set to 1024 by default even though Pegasus can only handle 512. So, the following command should just run fine:

python examples/pytorch/summarization/run_summarization.py --model_name_or_path google/pegasus-xsum --do_train \
--do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0"  \
--output_dir /tmp/tst-summarization --per_device_train_batch_size=1 --per_device_eval_batch_size=1 \
--overwrite_output_dir --predict_with_generate --max_source_length 512

patrickvonplaten on May 13, 2021

By the way errors like those /workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [174,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed are in my experience very often out of index errors and it helps to run the same code on CPU which then gives a better error message

patrickvonplaten on May 13, 2021

We should also overwrite the config.max_position_embeddings when doing so

patrickvonplaten on May 16, 2021

Ok - good arguments! IMO we should only allow this resizing though for models that use Sinusoidal position embeddings a.k.a. position embeddings that have set .grad to False.

In terms of implementation, I’d suggest to add a general resize_position_embeddings(self, max_posituon_embeddings) to PreTrainedModel that throws a NotImplementedError and is then overwritten in Pegasus

patrickvonplaten on May 16, 2021

@stas00 , @patrickvonplaten , Pegasus actually uses SinusoidalPositionalEmbedding, so there is no seq length limit. We should resize the embedding if cur len is greater than the default len. That’s what we do in FSMT and M2M100

patil-suraj on May 13, 2021