apex: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx after couple of first epochs

I am training a version of unet with joint classification and semantic segmentation using O1 level. The training crashes after I explicitly cast box_coord_tensor in roi_pool function.

rois = roi_pool(
        input=classification_feature_map_tensor, # FLOAT16 
        boxes=box_coord_tensor.half(), # FLOAT32 IF NOT CASTED EXPLICITLY
        output_size=roi_size,
        spatial_scale=1,
)

Thing is, classification_feature_map_tensor comes as float16 since it is handled by amp while box_coord_tensor comes from input batch which is float32. However, roi_pool requires tensors to have equal precision and throws

RuntimeError: Expected tensor for argument #1 'input' to have the same type as tensor for argument #2 'rois'; but type Variable[CUDAHalfType] does not equal Variable[CUDAFloatType] (while checking arguments for ROIPool_forward_cuda) (checkSameType at /pytorch/aten/src/ATen/TensorUtils.cpp:140)

But if I cast box_coord_tensor to float16, CUDA throws memory access error below.

  File "/usr/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/usr/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
    optimizer._post_amp_backward(loss_scaler)
  File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 241, in post_backward_no_master_weights
    post_backward_models_are_masters(scaler, params, stashed_grads)
  File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 120, in post_backward_models_are_masters
    scale_override=grads_have_scale/out_scale)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 119, in unscale
    self.unscale_python(model_grads, master_grads, scale)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 89, in unscale_python
    self.dynamic)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 9, in scale_check_overflow_python
    cpu_sum = float(model_grad.float().sum())
RuntimeError: CUDA error: an illegal memory access was encountered

Is there anything I could try to do because so far any attempts result in the error above.

About this issue

Original URL
State: open
Created 5 years ago
Reactions: 13
Comments: 29 (7 by maintainers)

Most upvoted comments

I get a similar error with the forward pass. After some batches, it gives the following error(s).

Sometimes it is error 1 and sometimes it is error 2 or error 3. Sometimes the error is thrown after processing 1st batch and sometimes at 2nd,9th or 13th, 17th, 21st batch.

Error 1 RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)``

Error 2 RuntimeError: CUDA error: device-side assert triggered

Error 3 RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCBlas.cu:258

Maybe this issue discussion can bring more perspective to it.

+27

anjani-dhrangadhariya on Nov 8, 2019

Got the same tracktrace as above (RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)) on pytorch 1.4.x, 1.5.x

With the last nightly build (1.7.0.dev20200709), Cuda V10.1.243, apex master (https://github.com/NVIDIA/apex/commit/1ff54b8fed441c39dac181091b44fecdca31a403) and cudnn 7.6.3_0 seems to be working fine (no overflows or segfaults) (using apex API) on cycleGAN (https://github.com/seovchinnikov/pytorch-CycleGAN-and-pix2pix)

seovchinnikov on Jul 9, 2020

When in doubt, always prefer casting to FP32. In this case (I think) you’re calling into a custom torchvision op that may not have an FP16 implementation. Cast both inputs to FP32 instead of FP16 and it should work.

mcarilli on Nov 3, 2019

We had a similar issue with another cuBLAS API (cublasSgemm()), although @anjani-dhrangadhariya also eperienced this.

CUDA Toolkit 11.1 release notes mention an issue fixed in cuBLAS:

Fixed an issue that caused an Address out of bounds error when calling cublasSgemm().

We had cublasSgemm() failing with CUBLAS_STATUS_EXECUTION_FAILED for us when built with 10.0 and running on Ampere GPU (3060 Ti). It ran fine on older GPUs (Pascal, Turing). We had it run successfully on Ampere when we build it with CUDA 11.2.

Basically - try building against the newest CUDA Toolkit available and see if it helps.

P.S. this was with another framework/project, but should still be relevant. P.P.S. related issue in PyTorch pytorch/pytorch#29795.

Jmennius on Feb 23, 2021

@mcarilli converted my training loop to use torch.cuda.amp instead of apex. It runs… but it doesn’t seem like there’s any indication that it’s actually using 16 bit floats. Memory usage is identical as non-fp16 as is the speed. Do you know if there a way to verify amp is working with fp16 correctly?

Here’s my modified code from pix2pixHD:

    amp_scaler = GradScaler(enabled=opt.fp16)
    with autocast(enabled=opt.fp16):

            ############## Forward Pass ######################
            losses, generated = model(Variable(data['label']), inst_map,
                                      Variable(data['image']), Variable(data['feat']), infer=save_fake)

            # sum per device losses
            losses = [torch.mean(x) if not isinstance(x, int)
                      else x for x in losses]
            loss_dict = dict(zip(model.module.loss_names, losses))

            # calculate final loss scalar
            loss_D = (loss_dict['D_fake'] + loss_dict['D_real']) * 0.5
            loss_G = loss_dict['G_GAN'] + \
                loss_dict.get('G_GAN_Feat', 0) + loss_dict.get('G_VGG', 0)

        ############### Backward Pass ####################
        # update generator weights
        optimizer_G.zero_grad()
        amp_scaler.scale(loss_G).backward()
        amp_scaler.step(optimizer_G)
        # if opt.fp16:
        #    with amp.scale_loss(loss_G, optimizer_G) as scaled_loss:
        #        scaled_loss.backward()
        # else:
        #    loss_G.backward()
        # optimizer_G.step()

        # update discriminator weights
        optimizer_D.zero_grad()
        amp_scaler.scale(loss_D).backward()
        amp_scaler.step(optimizer_D)

        amp_scaler.update()

update: Using DataParallel, I need to wrap forward of my module in @autocast. Works now… for a while and then I start getting nan losses 😦.

tripzero on Jul 10, 2020

@tastyminerals @someAdjectiveNoun Could you try to post a (small) code snippet to reproduce this issue?

The problem is solved now. How? The problem was actually caused by using BioBERT model that I used. Using the BERT in Pytorch works smoothly. The problem seems to be coming from BioBERT.

anjani-dhrangadhariya on Nov 11, 2019