onnxruntime: Bug: Converting from ONNX to ORT fails when setting Device=Direct ML [C++] [ONNX2ORT converter] [Direct ML]
Describe the bug ONNX to ORT conversion works when device=CPU, but does not with Direct ML (exact same code)
Low level details:
- I tried with 7 networks and it happens in all of them (including MNIST and ResNet)
- I tried and this affects both v1.7.1 (the one we are using) and your very latest GitHub code.
- ORT doesn’t crash on conversion but rather later when loading/using those new ORT models, but when checking the ORT files, the ORT for CPU is similar in size to the ONNX file (~MBs), while the GPU one is only a few KBs, so definitely a bug on their converter. The crash it tells me is something about operators not implemented, but the ORT file is just too small (error in https://github.com/microsoft/onnxruntime/discussions/7931)
Also, all models run (and we checked they match the original PyTorch model accuracies) if loaded from ONNX and set to DML: What works:
- ONNX file loaded, set to CPU and running inference on it
- ONNX file loaded, set to GPU and running inference on it
- ONNX file loaded, set to CPU, converted to ORT, loaded as ORT file, set to CPU, and running inference on it
What does not work:
- ONNX file loaded, set to GPU, converted to ORT, loaded as ORT file, set to GPU, and running inference on it
- ONNX file loaded, set to CPU, converted to ORT, loaded as ORT file, set to GPU, and running inference on it --> This one does not crash, but it is clearly running on CPU because its runtime timings are those of the CPU version (not the GPU version). So it seems that whatever session option was loaded for the ORT file is what it’s used for it regardless of me trying to set it to another kind of device
Urgency Urgent --> It blocks ORT file deployment on DirectML networks. We have an internal deadline in August to release this project
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
- ONNX Runtime installed from (source or binary): Tried both, but we care mostly about the source one
- ONNX Runtime version: v1.7.1 and also tested in latest GitHub code
- Python version: None, using C++
- Visual Studio version (if applicable): VS 2019 Professional
- GCC/Compiler version (if compiling from source): VS 2019 Professional
- CUDA/cuDNN version: None (DirectML)
- GPU model and memory: Nvidia 3080
To Reproduce
- Describe steps/code to reproduce the behavior.
// Conversion step
{
// Set up ORT and create an environment
Ort::InitApi();
const char* const ModelRelativeFilePathCharPtr = TCHAR_TO_ANSI(*InModelRelativeFilePath);
Environment = MakeUnique<Ort::Env>(ORT_LOGGING_LEVEL_WARNING, ModelRelativeFilePathCharPtr);
Allocator = MakeUnique<Ort::AllocatorWithDefaultOptions>();
SessionOptions = MakeUnique<Ort::SessionOptions>();
if (Device == GPU)
{
SessionOptions->SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);
OrtSessionOptionsAppendExecutionProvider_DML(*SessionOptions, 0);
}
else
{
SessionOptions->SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
}
// Generate ORT file
SessionOptions->SetOptimizedModelFilePath(*OutputORTOptimizedModelPath);
Session = MakeUnique<Ort::Session>(*Impl->Environment, *FullModelFilePath, *SessionOptions);
// Result --> ORT file on disk on OutputORTOptimizedModelPath, which is good if Device = CPU, but smaller than it should be if Device = GPU
}
// Running step
{
// Same setting code
// Load/run ORT file
// Note the lack of "SetOptimizedModelFilePath()"
Session = MakeUnique<Ort::Session>(*Impl->Environment, *FullModelFilePath, *SessionOptions);
// Result --> ORT file working fine as long as it's on CPU, but crashing when it's DirectML giving the error shown in https://github.com/microsoft/onnxruntime/discussions/7931
}
- Attach the ONNX model to the issue (where applicable) to expedite investigation. Here a zip file with 3 models (SqueezeNet, MNIST, and ResNet), and for each one: Original ONNX model, ORT model when device=CPU, and ORT model when device=DirectML: https://drive.google.com/file/d/1F13H3HW4PoEZqLBg2sJoXfBB-zcfQGwq
Expected behavior I expect both ORT files to approximately have the same size, and for the DirectML one not to crash when used later
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 28 (28 by maintainers)
Commits related to this issue
- [DML EP] Disable DML Graph Fusion for lower graph optimization level OR setOptimizedFilePath true (#13913) ### Description DML EP won't fuse the ONNX Graph if ORT Graph optimization level is <= 1 o... — committed to microsoft/onnxruntime by sumitsays 2 years ago
- [DML EP] Disable DML Graph Fusion for lower graph optimization level OR setOptimizedFilePath true (#13913) ### Description DML EP won't fuse the ONNX Graph if ORT Graph optimization level is <= 1 o... — committed to microsoft/onnxruntime by sumitsays 2 years ago
- Fix capidocs (#45) * Update libclang version for building C/C++ API docs * Temporarily add a workflow dispatch * Fix libclang-cpp too * Fix libclang-cpp again * Upgrade doxygen * Upgra... — committed to natke/onnxruntime by natke 2 years ago
- [DML EP] Disable DML Graph Fusion for lower graph optimization level OR setOptimizedFilePath true (#13913) ### Description DML EP won't fuse the ONNX Graph if ORT Graph optimization level is <= 1 o... — committed to henrywu2019/onnxruntime by sumitsays 2 years ago
@diablodale Correct, nodes will remain distinct operators (or fused operators).
Yes, it will have operator fusions (e.g. Conv + Relu -> ConvRelu), but not whole-graph-fusion.
Yes, that final whole-graph-fusion will be done upon reload.
That final fusion happens in either case, loading the original model or loading the pre-operator-fused model. Exporting to
.onnxfile and reloading, I noticed a time saving during session load of like 5-15% depending on the model, and run time is the same. Exporting to.ortfile format and reloading, I noticed a substantial time saving in session load, from 2-7x depending on the model, but as enticing as that is, beware .ort is just recently enabled by https://github.com/microsoft/onnxruntime/pull/13913, and I can’t yet vouch for it’s robustness without further more exhuastive testing (I just tried it with a few models), because interaction with the DML EP might call new code paths. Also, we should verify whether this issue applies still: https://github.com/microsoft/onnxruntime/issues/13535.Great. I’d also include the driver version too in your hash, just in case updating the driver changes registered data type support.
I create a DLL plugin for the Cycling74 Max runtime patching system. My customers are educators, researchers, artists, musicians, etc. I provide one onnx model for a specific use case plus the ability to run any onnx model. My DLL transforms in/outs between native Max data. My plugin allows running the model on the cpu, directml, cuda, or tensorRT providers with a single setting change. I hide all the technical complexities so my customers can focus on their art/research/education.
The Max environment is always running, it is a graphical hack/patch environment where nodes are connected by patchcords. Patchcords and nodes are reshaped/connected hundreds of times a day as customers experiment and try ideas. This realtime iteration necessitates caching and reuse. The time burden of running the onnx optimization process every time they connect a patchcord or click “go” hampers their creativity and kills their “flow”.
I know when hardware, models, or settings change…therefore I can cache models after they go through the optimization process. I already do this successfully with the TensorRT provider. A similar ability with DirectML is desired and I attempted it with
SetOptimizedModelFilePath()but ran into this same OP…the saved DirectML model is unusable.@gedoensmax If you have a model with dynamic dimensions and want to make them fixed, you could use this tool: https://onnxruntime.ai/docs/reference/mobile/make-dynamic-shape-fixed.html
I don’t quite understand how model load time would be affected by having fixed shapes. If anything, I would expect more optimizations to be possible when shapes are fixed.
I would suggest running the ‘basic’ level optimizations on the model with just the CPU EP enabled to do those optimizations ahead of time. They are not specific to any EP, only use official ONNX operators, and cover things like constant folding and common subexpression elimination.
Beyond the ‘basic’ level you get into EP specific optimizations which may involve compiling nodes or fusing nodes that will use a custom operator. Currently there’s no general purpose way to save a compiled node like TensorRT engine caching does. An inference session is intended to be re-used though, so this cost during loading is not per-inference.
POC for adding support for DML when using an ORT format model: https://github.com/microsoft/onnxruntime/compare/skottmckay/ORT_model_support_with_DML_EP
Technically we could create the ORT format model with just basic optimizations and DML disabled to not require the changes in the DML graph partitioning. At runtime, if DML was enabled it could still execute the same nodes.
Thanks to those last answers we were able to feed the ONNX buffer into ORT directly, which is a working workaround for us!
We will keep an eye to this post to know when the DML-ORT file issue is solved, as we’d need to switch to it once it’s working, but we are no longer blocked.
Thanks for the quick answers and the great work!
Example of reading bytes from file: https://github.com/microsoft/onnxruntime/blob/894fc828587c919d815918c4da6cde314e5d54ed/onnxruntime/test/shared_lib/test_model_loading.cc#L21-L31
The bytes are just passed directly when creating the inference session.
https://github.com/microsoft/onnxruntime/blob/894fc828587c919d815918c4da6cde314e5d54ed/onnxruntime/test/shared_lib/test_model_loading.cc#L41
We’ll look into the DML issue as it should be possible to use that with an ORT format model.
ONNX format files are supported on all platforms. It’s just that the binary size of the ORT library will be bigger vs. a minimal build that only supports ORT format models (by a few MB). For that you get a lot more flexibility though, such as the ability to use CPU or GPU depending on what’s available at runtime.
Can you provide more details on how you were trying to feed the ONNX format file at runtime? InferenceSession has an API where raw bytes can be provided, which can be used for both ONNX and ORT format models. Given that, I’m not quite following how ‘the ORT API seems to open the onnx file in many places’ given it’s only seeing bytes and not a filename if that API is used.
https://github.com/microsoft/onnxruntime/blob/894fc828587c919d815918c4da6cde314e5d54ed/onnxruntime/core/session/inference_session.cc#L686
I did a quick test using the python API and it seemed to work fine with the ONNX format model being provided as bytes.