apex: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx after couple of first epochs
I am training a version of unet with joint classification and semantic segmentation using O1
level. The training crashes after I explicitly cast box_coord_tensor
in roi_pool
function.
rois = roi_pool(
input=classification_feature_map_tensor, # FLOAT16
boxes=box_coord_tensor.half(), # FLOAT32 IF NOT CASTED EXPLICITLY
output_size=roi_size,
spatial_scale=1,
)
Thing is, classification_feature_map_tensor
comes as float16 since it is handled by amp while box_coord_tensor
comes from input batch which is float32. However, roi_pool
requires tensors to have equal precision and throws
RuntimeError: Expected tensor for argument #1 'input' to have the same type as tensor for argument #2 'rois'; but type Variable[CUDAHalfType] does not equal Variable[CUDAFloatType] (while checking arguments for ROIPool_forward_cuda) (checkSameType at /pytorch/aten/src/ATen/TensorUtils.cpp:140)
But if I cast box_coord_tensor
to float16, CUDA throws memory access error below.
File "/usr/lib/python3.7/contextlib.py", line 119, in __exit__
next(self.gen)
File "/usr/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 241, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 120, in post_backward_models_are_masters
scale_override=grads_have_scale/out_scale)
File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 119, in unscale
self.unscale_python(model_grads, master_grads, scale)
File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 89, in unscale_python
self.dynamic)
File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 9, in scale_check_overflow_python
cpu_sum = float(model_grad.float().sum())
RuntimeError: CUDA error: an illegal memory access was encountered
Is there anything I could try to do because so far any attempts result in the error above.
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 13
- Comments: 29 (7 by maintainers)
I get a similar error with the forward pass. After some batches, it gives the following error(s).
Sometimes it is error 1 and sometimes it is error 2 or error 3. Sometimes the error is thrown after processing 1st batch and sometimes at 2nd,9th or 13th, 17th, 21st batch.
Error 1
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling
cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)``Error 2
RuntimeError: CUDA error: device-side assert triggered
Error 3
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCBlas.cu:258
Maybe this issue discussion can bring more perspective to it.
Got the same tracktrace as above (RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling
cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)
) on pytorch 1.4.x, 1.5.xWith the last nightly build (1.7.0.dev20200709), Cuda V10.1.243, apex master (https://github.com/NVIDIA/apex/commit/1ff54b8fed441c39dac181091b44fecdca31a403) and cudnn 7.6.3_0 seems to be working fine (no overflows or segfaults) (using apex API) on cycleGAN (https://github.com/seovchinnikov/pytorch-CycleGAN-and-pix2pix)
When in doubt, always prefer casting to FP32. In this case (I think) you’re calling into a custom torchvision op that may not have an FP16 implementation. Cast both inputs to FP32 instead of FP16 and it should work.
We had a similar issue with another cuBLAS API (cublasSgemm()), although @anjani-dhrangadhariya also eperienced this.
CUDA Toolkit 11.1 release notes mention an issue fixed in cuBLAS:
We had
cublasSgemm()
failing withCUBLAS_STATUS_EXECUTION_FAILED
for us when built with 10.0 and running on Ampere GPU (3060 Ti). It ran fine on older GPUs (Pascal, Turing). We had it run successfully on Ampere when we build it with CUDA 11.2.Basically - try building against the newest CUDA Toolkit available and see if it helps.
P.S. this was with another framework/project, but should still be relevant. P.P.S. related issue in PyTorch pytorch/pytorch#29795.
@mcarilli converted my training loop to use torch.cuda.amp instead of apex. It runs… but it doesn’t seem like there’s any indication that it’s actually using 16 bit floats. Memory usage is identical as non-fp16 as is the speed. Do you know if there a way to verify amp is working with fp16 correctly?
Here’s my modified code from pix2pixHD:
update: Using DataParallel, I need to wrap forward of my module in @autocast. Works now… for a while and then I start getting nan losses 😦.
The problem is solved now. How? The problem was actually caused by using BioBERT model that I used. Using the BERT in Pytorch works smoothly. The problem seems to be coming from BioBERT.