onnxruntime: Wrong requested shape after a few thousand inference steps when using CUDA

Describe the bug After many (typically many thousand) successful inference steps with the same data, the ONNX Runtime with CUDA suddenly stops with an error. The following error message suggests that a value set by an initializer has changed its value which yields an invalid requested shape in a Reshape node:

[E:onnxruntime:, sequential_executor.cc:309 Execute] Non-zero status code returned while running Reshape node. Name:'Reshape_15' Status Message: /code/onnxruntime/onnxruntime/core/providers/cpu/tensor/reshape_helper.h:43 onnxruntime::ReshapeHelper::ReshapeHelper(const onnxruntime::TensorShape&, std::vector<long int>&) gsl::narrow_cast<int64_t>(input_shape.Size()) == size was false. The input tensor cannot be reshaped to the requested shape. Input shape:{4}, requested shape:{4,2,2}

This problem does not occur with the CPU or TensorRT version of ONNX Runtime.

This is what the graph of the ONNX model looks like (I suspect %19 is the issue here):

graph torch-jit-export (
  %shape[INT64, 2]
) initializers (
  %19[INT64, 1]
) {
  %1 = Constant[value = <Scalar Tensor []>]()
  %2 = Gather[axis = 0](%shape, %1)
  %3 = Constant[value = <Scalar Tensor []>]()
  %4 = Gather[axis = 0](%shape, %3)
  %5 = Mul(%2, %4)
  %6 = Unsqueeze[axes = [0]](%5)
  %7 = Concat[axis = 0](%6)
  %8 = ConstantOfShape[value = <Tensor>](%7)
  %9 = Constant[value = <Scalar Tensor []>]()
  %10 = Gather[axis = 0](%shape, %9)
  %11 = Constant[value = <Scalar Tensor []>]()
  %12 = Gather[axis = 0](%shape, %11)
  %15 = Unsqueeze[axes = [0]](%10)
  %16 = Unsqueeze[axes = [0]](%12)
  %17 = Concat[axis = 0](%19, %15, %16)
  %output = Reshape(%8, %17)
  return %output
}

The input is always a one-dimensional array with value [2,2].

The ONNX model was created from the following PyTorch code (using PyTorch 1.6.0):

    def forward(self, shape):
        r = torch.zeros(shape[0] * shape[1])
        return r.view(1, shape[0], shape[1])

System information

OS Platform and Distribution: Container created from Dockerfile.cuda (problem also occurs in containers based on Ubuntu 18.04):
ONNX Runtime installed from (source or binary): from source (GitHub master branch)
ONNX Runtime version: 1.4.0
Python version: 3.7.0
GCC/Compiler version: 7.4.0
CUDA/cuDNN version: CUDA 10.1 / cuDNN 7
GPU model and memory: Nvidia GeForce GTX 1060 (problem also occurs on 1080Ti with 11GB)

To Reproduce Run the following code in an environment (or container) with the ONNX Runtime with CUDA (not TensorRT) from a directory that contains the file issue.onnx that is attached to this issue:

#!/usr/bin/env python3
"""Minimal example to reproduce issue with with ONNX Runtime on GPU"""

import numpy as np
import onnxruntime


def create_inputs():
    """create input for the model as a numpy array"""
    return np.array([2, 2])


def run_onnx(file_name):
    """run ONNX model until it fails (when run on GPU)

    on my computer this tends to fail withing the first 100000 iterations
    """
    options = onnxruntime.SessionOptions()
    session = onnxruntime.InferenceSession(file_name, options)

    shape = create_inputs()
    for iteration in range(int(1e6)):
        try:
            result = session.run(output_names=['output'],
                                 input_feed={
                                     'shape': shape,
                                 })
        except Exception as e:
            print(f"\nerror occured during iteration {iteration}")
            break
    return result


def main():
    """try to run inference"""
    filename = 'issue.onnx'
    result = run_onnx(filename)
    print(f"result: {result[0].shape}")


if __name__ == '__main__':
    main()

After a few seconds and a few thousand (sometimes tens of thousands of) iterations the loop in run_onnx() should abort with the error message above. While I have managed to reproduce this issue on multiple systems in various configurations, the number of iterations before the error occurs has varied dramatically. Any sort of additional GPU load (from watching Youtube videos to multiplying large matrices) seems to make the issue more reproducible.

Expected behavior Inference should work deterministically.

Additional context The ONNX/PyTorch model provided here may not be very useful in itself. However, the same issue also appears in more complex models that contain similar steps. Larger models seem to require fewer iterations until the error occurs.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 6
Comments: 16 (7 by maintainers)

Commits related to this issue

Add CUDA option to run copy in default stream This change fixes #4829. Thanks @maherzog for providing the repro! The bug is caused by memory reuse in BFC arena, where copy and compute stream in CUDA... — committed to microsoft/onnxruntime by deleted user 4 years ago
Add CUDA option to run copy in default stream (#5445) * Add CUDA option to run copy in default stream This change fixes #4829. Thanks @maherzog for providing the repro! The bug is caused by mem... — committed to microsoft/onnxruntime by deleted user 4 years ago

Most upvoted comments

Sorry for being later in the discussion, and thanks @maherzog for this repro. As @HectorSVC pointed out, the bug is indeed caused by memory reuse. The issue is that copy and compute stream in CUDA has a racing condition in BFC arena. BFC arena is an arena allocator on top of cudaMalloc/Free to reduce the cost in syncing CPU and GPU when alloc/free.

To make CPU and GPU running asynchronously, buffers freed on CPU could still be in use on GPU. This is OK if there’s only one stream, where the execution order in CPU and GPU are consistent. For example, if we have two kernels A and B, when CPU runs with order of allocA->computeA->freeA->allocB->computeB->freeB, even when A and B shares the same memory, computeA and computeB will not have racing in the same GPU compute stream. However, if it is allocA->CopyA->freeA->allocB->computeB->freeB in CPU, the order of execution in GPU could have copyA happen after computeB, when copy and compute happens in different GPU streams.

For this particular case, the execution plan in CPU is:

Allocation Plan:
(ort_value_idx) output_name : <allocation plan>
(18) 17 : Allocate, OrtMemoryInfo:[name:CudaPinned id:0 OrtMemType:-1 OrtAllocatorType:1 Device:[DeviceType:0 MemoryType:1 DeviceId:0]], use fence when async
(17) 17_CUDAExecutionProvider : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]], use fence when async
(4) 4 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(12) 11 : AllocateStatically, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(19) output : AllocateOutput, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(3) 3 : AllocateStatically, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(2) 2 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(10) 9 : AllocateStatically, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(0) shape : PreExisting, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(5) 5 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(9) 8 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(6) 6 : Reuse 5, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(1) 1 : AllocateStatically, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(13) 12 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(7) 7_CUDAExecutionProvider : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]], use fence when async
(8) 7 : Allocate, OrtMemoryInfo:[name:CudaPinned id:0 OrtMemType:-1 OrtAllocatorType:1 Device:[DeviceType:0 MemoryType:1 DeviceId:0]], use fence when async
(11) 10 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(16) 19 : AllocateStatically, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(14) 15 : Reuse 11, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(15) 16 : Reuse 13, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]

Execution Plan:
[0] Gather (Gather_11)
[1] Unsqueeze (Unsqueeze_13)
[2] Gather (Gather_9)
[3] Unsqueeze (Unsqueeze_12)
[4] Concat (Concat_14)
Free ml-values: (11) 10, (13) 12
[5] MemcpyToHost (Memcpy)
Free ml-values: (17) 17_CUDAExecutionProvider
[6] Gather (Gather_3)
[7] Gather (Gather_1)
[8] Mul (Mul_4)
Free ml-values: (2) 2, (4) 4
[9] Unsqueeze (Unsqueeze_5)
[10] Concat (Concat_6)
Free ml-values: (5) 5
[11] MemcpyToHost (Memcpy_token_0)
Free ml-values: (7) 7_CUDAExecutionProvider
[12] ConstantOfShape (ConstantOfShape_7)
Free ml-values: (8) 7
[13] Reshape (Reshape_15)
Free ml-values: (9) 8, (18) 17

Here in step [5], the input buffer to MemcpyToHost is freed to BFC arena, and then allocated by static allocation plan to the output of step [8]. Because compute stream and copy stream runs concurrently on GPU, the GPU execution order may not match CPU’s plan, and thus causing memory to be overwritten.

In this repro, if we add a line of code to disable memory pattern, the problem is gone:

    options = onnxruntime.SessionOptions()
    options.enable_mem_pattern = False
    session = onnxruntime.InferenceSession(file_name, options)

However, disabling memory pattern is not a full solution to the racing of streams in BFC arena. As a short term fix, we might force the copy stream to be the same as compute stream, by changing here to:

  streams_[kCudaStreamCopyIn] = nullptr;
  streams_[kCudaStreamCopyOut] = nullptr;

And also remove the cudaStreamDestory in the dtor. This approach might cause some performance degradation for certain models though. A thorough fix to BFC arena to support multiple stream is being looked at, and once that is in, we can continue to have the concurrent copy and compute streams.

ke1337 on Oct 9, 2020

There’s nothing wrong with CUDA Concat implementation, should be relate to memory re-use. Still debugging.

HectorSVC on Sep 18, 2020

Thank you. I can repro the bug.

snnn on Aug 31, 2020