onnxruntime: SafeIntOnOverflow() Integer overflow error when running inference in an ASGI server

Describe the bug When running a docker container running uvicorn + fastapi + an ORT inference session with a single model on a single uvicorn worker, handling at most 3 requests at a time, we regularly see errors in the ORT session, exclusively: RUNTIME_EXCEPTION : Non-zero status code returned while running InstanceNormalization node. Name:'InstanceNormalization_15' Status Message: /onnxruntime_src/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow

The node might not always be the same, sometimes it’s one of the many conv nodes in the network.

Urgency Prevents us from going to production

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 20.04
ONNX Runtime installed from (source or binary): pip
ONNX Runtime version: 1.11.1
Python version: 3.9
CUDA/cuDNN version: 11.4/8
GPU model and memory: Tesla K80, 12GB

Expected behavior No errors thrown. I was under the impression the model inference session would wait for a given prediction to finish before accepting another.

Additional context I’m wondering if this is arising due to too many inference runs happening concurrently. I assumed based on reading documentation that the InferenceSession would queue up runs, but I think I am mistaken. Looking for clarification on that if possible.

About this issue

Original URL
State: open
Created 2 years ago
Comments: 21 (4 by maintainers)

Most upvoted comments

I’m facing a similar issue when when multiple (2 is usually enough to reproduce) processes of onnxruntime run on the same Docker container. Curiously this does not happen with all the models, only some – is there some way to debug this reliably? Could it be connected to the way the model is converted to onnx or how it is being optimised?

@zhanghuanrong Faced same issue on v1.14.1 (upgraded from v1.10.0) with ~30 models loaded (some constantly loads and unloads to/from GPU, other models persists on GPU)

onnx runtime error 6: Non-zero status code returned while running Conv node. Name:'/model.8/cv1/conv/Conv' Status Message: /workspace/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow

This error occurs on a model I don’t unload from GPU. And when it happens, inferencing on this specific model will ALWAYS fail afterwards but other models are not affected. However, if I restart everything, this model just works fine (and most time, it works well). It’s not easy to reproduce this bug but I saw several other people facing same issues above. Might be due to some rare race condition?

zeruniverse on Apr 8, 2023