godot: Vulkan: GPU Timeout on MacOS [tested multiple hardware]
Godot version
4.0.dev (we cut from 880a0177d12463b612268afe95bd3d8dd565bf52) @Zylann’s terrain module on top
System information
multiple hardware tested MacOS 12.4 intel i9 MacOS 12.2 intel i7
Issue description
Game crashes when a complex enough scene is provided and loaded.
No UBSAN/ASAN errors anymore either. (spent entire day literally fixing them all - patch incoming soon ™️)
Tried MVK_ALLOW_METAL_EVENTS=1
as per some searches but gave me an additional 2 seconds after the scene had loaded before the GPU timeout.
We’re having this on ARM processors too. I will try a bisect tomorrow, but any help is appreciated.
This is for the Mirror.
Symptoms entire app hangs, no display output anymore, infinite swap issues or crash VK_DISPLAY_LOST etc.
2022-10-14 00:20:24.210364+0100 godot.macos.opt.tools.x86_64[4888:74403] Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (00000002:kIOAccelCommandBufferCallbackErrorTimeout)
[mvk-error] VK_ERROR_DEVICE_LOST: Command buffer 0x7fe954f59a00 "vkQueueSubmit CommandBuffer on Queue 0-0" execution failed (code 2): Caused GPU Timeout Error (00000002:kIOAccelCommandBufferCallbackErrorTimeout)
2022-10-14 00:20:24.210533+0100 godot.macos.opt.tools.x86_64[4888:74403] ERROR: - Message Id Number: 0 | Message Id Name:
VK_ERROR_DEVICE_LOST: Command buffer 0x7fe954f59a00 "vkQueueSubmit CommandBuffer on Queue 0-0" execution failed (code 2): Caused GPU Timeout Error (00000002:kIOAccelCommandBufferCallbackErrorTimeout)
Objects - 1
Object[0] - VK_OBJECT_TYPE_QUEUE, Handle 105553183017928
at: _debug_messenger_callback (drivers/vulkan/vulkan_context.cpp:171)
ERROR: - Message Id Number: 0 | Message Id Name:
VK_ERROR_DEVICE_LOST: Command buffer 0x7fe954f59a00 "vkQueueSubmit CommandBuffer on Queue 0-0" execution failed (code 2): Caused GPU Timeout Error (00000002:kIOAccelCommandBufferCallbackErrorTimeout)
Objects - 1
Object[0] - VK_OBJECT_TYPE_QUEUE, Handle 105553183017928
Steps to reproduce
Honestly dug around a lot and don’t know how to reproduce outside of our codebase due to complexity.
Minimal reproduction project
Unable to provide as unsure how to reproduce outside our game.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 24 (23 by maintainers)
RevoluPowered and I debugged this a little further today. We think that the issue is not reproducible on the mobile renderer which points to the root issue being in either the scene shader or in a compute shader. We analyzed typical draw calls in the xcode debugger and did not see anything totally out of the ordinary (although between Vulkan -> MoltenVK -> Metal a lot of debug info is lost so I am not confident in what we saw).
My best guess is something is breaking in the cluster building resulting in pathological loops forming in the scene shader
Next steps:
#define MODE_UNSHADED
to the top of the clustered rendering shader. See if this impacts the crashfor (uint i = item_from; i < item_to; i++) {
. See if this impacts the crashFixed by https://github.com/godotengine/godot/pull/67915
Fixes: https://github.com/godotengine/godot/pull/67912 - important if you like your debugger to work https://github.com/godotengine/godot/pull/67913 - important if you don’t want msaa to randomly crash on you https://github.com/godotengine/godot/pull/67915 - the most important one
Patch to fix the issue is incoming, we found it was the subgroup support in moltenvk
Patch to correct datatypes to ensure using unsigned int (GLSL compiler complains they’re int not uint):
I noticed something potentially bad. We don’t lock the version of the metal API we use as far as I can see. Metal 3 has vastly different features compared to Metal 2.
I posted this question in godot rendering chat, so putting here:
Our bug in this case could also be caused by a machine using an outdated version of metal. I don’t believe it’s the issue but I am going to try testing with a higher OS and XCode version.
I spent a few more hours on this, I debugged a lot, eventually I found that if shader validation in XCode is disabled then the crash will occur in xcode.
This leads me to believe that the timeout IS inside the shader and some of the shader is triggering undefined behaviour and thus causing the GPU timeout.
I will writeup how to configure the frame debugger properly once I can get the “Show in source” button resolving to MoltenVK.
I have a theory that the reason we can’t reproduce on non intel mac’s is the issue is hidden by the higher performance of the M1 mac. the machines roughly have a 30% uplift, it might be why they cannot reproduce it since its timing critical.
We timeout when something broken is sent to the GPU. I enabled vulkan shader validation and it doesn’t report many issues.
My first port of call is a random blank texture is sent to the compute shader, which is 256 MB and never accessed by the GPU:
Another cause for concern is non unique textures are sent to the GPU more than once, and left alive.
We could potentially reduce these to a single texture for less memory consumption. Happens quite a lot.
Setting some to use a volatile texture might be a good idea.
We also seem to bind to many objects at once and duplicate bindings perhaps a logic error is present:
This is just my initial observations over the past day.
I did more investigation into the issue with the xcode frame debugger, I have found two issues relevant with the same error: https://github.com/KhronosGroup/MoltenVK/issues/602 https://github.com/KhronosGroup/MoltenVK/issues/836
With the XCode frame debugger attached I could not reproduce the crash.
However both those issue pertain and show exactly the same symptoms and log output.
I believe changing the lighting configuration is a symptom of the problem but not the actual issue. The actual issue seems to be we cause some code to timeout with the compute shaders.