onnxruntime: SafeIntOnOverflow() Integer overflow error when running inference in an ASGI server
Describe the bug
When running a docker container running uvicorn + fastapi + an ORT inference session with a single model on a single uvicorn worker, handling at most 3 requests at a time, we regularly see errors in the ORT session, exclusively:
RUNTIME_EXCEPTION : Non-zero status code returned while running InstanceNormalization node. Name:'InstanceNormalization_15' Status Message: /onnxruntime_src/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
The node might not always be the same, sometimes it’s one of the many conv nodes in the network.
Urgency Prevents us from going to production
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 20.04
- ONNX Runtime installed from (source or binary): pip
- ONNX Runtime version: 1.11.1
- Python version: 3.9
- CUDA/cuDNN version: 11.4/8
- GPU model and memory: Tesla K80, 12GB
Expected behavior No errors thrown. I was under the impression the model inference session would wait for a given prediction to finish before accepting another.
Additional context I’m wondering if this is arising due to too many inference runs happening concurrently. I assumed based on reading documentation that the InferenceSession would queue up runs, but I think I am mistaken. Looking for clarification on that if possible.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 21 (4 by maintainers)
@zhanghuanrong Faced same issue on v1.14.1 (upgraded from v1.10.0) with ~30 models loaded (some constantly loads and unloads to/from GPU, other models persists on GPU)
This error occurs on a model I don’t unload from GPU. And when it happens, inferencing on this specific model will ALWAYS fail afterwards but other models are not affected. However, if I restart everything, this model just works fine (and most time, it works well). It’s not easy to reproduce this bug but I saw several other people facing same issues above. Might be due to some rare race condition?