transfer-learning-conv-ai: RuntimeError: cublas runtime error : resource allocation failed at THCGeneral.cpp:250
Any ideas on resolving this issue would be greatly appreciated!
GPU details: Tesla K80 (8 GPUs), NVIDIA-SMI 410.79, Driver Version: 410.79, CUDA Version: 10.0
I was trying to run it on a single GPU alone first (local_rank = -1), and faced the below error.
ERROR:ignite.engine.engine.Engine:Current run is terminating due to exception: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCGeneral.cpp:250.
ERROR:ignite.engine.engine.Engine:Engine run is terminating due to exception: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCGeneral.cpp:250.
Traceback (most recent call last):
File "train.py", line 358, in <module>
train()
File "train.py", line 349, in train
trainer.run(train_loader, max_epochs=args.n_epochs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 388, in run
self._handle_exception(e)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
raise e
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 375, in run
hours, mins, secs = self._run_once_on_dataset()
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 341, in _run_once_on_dataset
self._handle_exception(e)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
raise e
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 333, in _run_once_on_dataset
self.state.output = self._process_function(self, batch)
File "train.py", line 275, in update
lm_loss, mc_loss = model(*batch)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 808, in forward
hidden_states = self.transformer(input_ids, position_ids, token_type_ids)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 643, in forward
hidden_states = block(hidden_states)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 334, in forward
a = self.attn(x)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 297, in forward
x = self.c_attn(x)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 248, in forward
x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
RuntimeError: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCGeneral.cpp:250
I also tried multi-GPU by doing python -m torch.distributed.launch --nproc_per_node=8 train.py <my cmdline options>
and it threw the same error.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 17 (2 by maintainers)
@thomwolf I think I have figured this out. Let me know what you think.
https://github.com/huggingface/transfer-learning-conv-ai/blob/b7f295f840f719056287504554083ec3f2688651/train.py#L55
If
len(instance["input_ids"])
above is greater than 512 (which is the default value ofn_positions
inOpenAIGPTConfig
in modeling_openai.py in pytorch-pretrained-bert), then the position_ids created in the below link will contain values much larger than 512.https://github.com/huggingface/pytorch-pretrained-BERT/blob/372a5c1ceec49b52c503707e9657bfaae7c236a0/pytorch_pretrained_bert/modeling_openai.py#L620
Consequently, this line (https://github.com/huggingface/pytorch-pretrained-BERT/blob/372a5c1ceec49b52c503707e9657bfaae7c236a0/pytorch_pretrained_bert/modeling_openai.py#L633) will fail.
I think you need to add truncation logic for the
sequence
above prior to doinginstance["input_ids"] = list(chain(*sequence))
so that the length is always less than or equal to 512.@g-karthik I think the device-side assert error triggered is due to the position embedding which is limited to 512 in size and your input size dimension sequence length is greater than that in the training loader