transformers: [run_summarization.py] wrong dataset leads to CUDA error:s
Feeding --dataset_name cnn_dailymail to --model_name_or_path google/pegasus-xsum leads to lots of errors from pytorch - perhaps there is a way to detect that the dataset is inappropriate and give a nice relevant assert instead?
You’d think that --dataset_name cnn_dailymail and --dataset_name xsum should be interchangeable…
python examples/seq2seq/run_summarization.py --model_name_or_path google/pegasus-xsum --do_train \
--do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0" \
--output_dir /tmp/tst-summarization --per_device_train_batch_size=1 --per_device_eval_batch_size=1 \
--overwrite_output_dir --predict_with_generate
[....]
/workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [290,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [290,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [290,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
(crashes w/o traceback here)
If I run it on one gpu I get:
[...]
/workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [138,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
return forward_call(*input, **kwargs)
File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/models/pegasus/modeling_pegasus.py", line 763, in forward
layer_outputs = encoder_layer(
File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/models/pegasus/modeling_pegasus.py", line 323, in forward
hidden_states, attn_weights, _ = self.self_attn(
File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/models/pegasus/modeling_pegasus.py", line 190, in forward
query_states = self.q_proj(hidden_states) * self.scaling
File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
return forward_call(*input, **kwargs)
File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward
return F.linear(input, self.weight, self.bias)
File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/functional.py", line 1860, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Thanks.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 24 (18 by maintainers)
Ok so the plan is to:
resize_position_embeddingstoPreTrainedModeljust like we are doing it for the word embeddingsresize_position_embeddingsshould probably log or warn depending on whether it’s sinus position embeddings or learned onesconfig.max_position_embeddings=> Happy to open a PR for this one, but would be great to first hear @LysandreJik and @sgugger’s opinion on it as well
@stas00, I checked and the problem simply seems to be that
max_source_lengthis too high. It’s set to 1024 by default even though Pegasus can only handle512. So, the following command should just run fine:By the way errors like those
/workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [174,0,0], thread: [95,0,0] AssertionsrcIndex < srcSelectDimSizefailedare in my experience very often out of index errors and it helps to run the same code on CPU which then gives a better error messageWe should also overwrite the
config.max_position_embeddingswhen doing soOk - good arguments! IMO we should only allow this resizing though for models that use Sinusoidal position embeddings a.k.a. position embeddings that have set
.gradto False.In terms of implementation, I’d suggest to add a general
resize_position_embeddings(self, max_posituon_embeddings)toPreTrainedModelthat throws a NotImplementedError and is then overwritten in Pegasus@stas00 , @patrickvonplaten , Pegasus actually uses SinusoidalPositionalEmbedding, so there is no seq length limit. We should resize the embedding if cur len is greater than the default len. That’s what we do in FSMT and M2M100