godot: Skinned Skeleton3D calling `add_child()` with specific classes cause crash on Android with directional shadow in forward+/mobile
NOTE:
Significant research has gone into this issue. The initial report is a bit dated. See Summary of current findings below the Original Issue Report for more up to date details.
Original Issue Report
Tested versions
- Reproducible in: v4.3.dev.mono.custom_build [a7b860250]
System information
Godot v4.3.dev.mono (a7b860250) - Pop!_OS 22.04 LTS - Wayland - Vulkan (Forward+) - dedicated Intel® Arc™ A750 Graphics (DG2) () - AMD Ryzen 7 5800X 8-Core Processor (16 Threads)
Issue description
When running on Android, if shadows are enabled on the directional light while my model is in the scene, it will crash. I tested with a simple CsgBox3d and it was fine, but my Blocky_Guy.blend model causes it to crash. Disabling shadows works fine tho.
Previous to the current master, it was actually working with shadows, but crashing when I eqipped additional items to the player’s skeleton at runtime. So I am unsure if its specifically an issue with shadows, or an issue with how Godot handles my model on Android.
Steps to reproduce
- Download the MRP
- ~Build godot with double and mono support~ We reproduced this without double and mono. Just standard godot causes it.
- Export to Android
- Run the scene on Android and see that it crashes
This was specifically tested against Galaxy S10e model number: SM-G970U1
Edit: Also reproduced on Google Pixel
Minimal reproduction project (MRP)
Edit:
Here is an MRP using a standard t-pose from mixamo. So it does not appear to be specific to my model: https://github.com/godotengine/godot/issues/90459#issuecomment-2049998710
Edit 2:
Here is a current summary of the findings for easy reading
Summary of current findings
Renderers Tested:
forward+ = crash ❌ mobile = crash ❌ gl_compatibility = works ✅
Bug introducing PRs:
MRP test summaries
- Skeleton mesh + directional light + shadows = crash ❌
- Skeleton mesh + directional light + shadows + changing skeleton + add_child in _process = crash ❌
- Skeleton mesh + directional light + shadows + bone attachment + add_child gpu particles 3d = crash ❌
- Skeleton mesh + directional light + shadows + add_child + bone attachment + child mesh with skeleton = crash ❌
- Skeleton mesh + directional light + shadows + add_child + bone attachment + child mesh without skeleton = works ✅
MRP Details
Skeleton mesh + directional light + shadows
Pr that introduced the bug: https://github.com/godotengine/godot/pull/87888
Steps to reproduce:
- Export any of the below MRPs to Android.
- Open the application
- Watch it instantly crash when it tries to render the character
Known workarounds:
- Build with DISABLED_DEPRECATED as suggested by @TokageItLab here: https://github.com/godotengine/godot/issues/90459#issuecomment-2052238329
- Disable shadows on the directional light
- use gl_compatibility renderer
MRP with Blocky Ball model: https://github.com/godotengine/godot/files/14925776/example-crash.zip
MRP with Mixamo T Pose model: https://github.com/godotengine/godot/files/14948327/example-crash2.zip
MRP with Godot Character Model from examples repo: https://github.com/godotengine/godot/files/14951312/example-crash3.zip
IMPORTANT:
The following crash examples require https://github.com/godotengine/godot/pull/87888 to be reverted, fixed, or a commit checked out before it was in the repo. Otherwise, the following examples will crash on startup due to the fact they all require a skeleton and mesh present.
Skeleton mesh + directional light + shadows + changing skeleton + add_child in _process
Pr that introduced the bug: https://github.com/godotengine/godot/pull/84976 Steps to reproduce:
- Export any of the below MRPs to Android.
- Open the application
- Watch it crash after 1 second when it attempts to equip the “crown” to the Blocky Ball character
Known Workarounds
- Disable shadows on directional light
- use gl_compatibility renderer
MRP with Blocky Ball model: https://github.com/godotengine/godot/files/14967690/example-crash-change-skeleton.zip
And here is the code for equipping:
extends Node
var secs: float = 0
var is_equipped: bool = false
# # This works
# func _ready() -> void:
# var skeleton = get_node("Cube_Guy/Cube_Guy_Armature/Skeleton3D")
# var crown = (load("res://Cosmetics_Hat_Crown.blend") as PackedScene).instantiate()
# var mesh = crown.get_child(0).get_child(0).get_child(0) as MeshInstance3D
# mesh.get_parent().remove_child(mesh)
# crown.queue_free()
# skeleton.add_child(mesh)
# This crashes
func _process(delta: float) -> void:
if secs < 1:
secs += delta
return
if is_equipped:
return
is_equipped = true
var skeleton = get_node("Cube_Guy/Cube_Guy_Armature/Skeleton3D")
var crown = (load("res://Cosmetics_Hat_Crown.blend") as PackedScene).instantiate()
var mesh = crown.get_child(0).get_child(0).get_child(0) as MeshInstance3D
mesh.get_parent().remove_child(mesh)
crown.queue_free()
skeleton.add_child(mesh)
Note that equipping in _ready
works. Equipping in _process
after waiting for a second crashes. I tried equipping in _process
right away without a delay and that also worked.
Skeleton mesh + directional light + shadows + bone attachment + add_child gpu particles 3d
Pr that introduced the bug: https://github.com/godotengine/godot/pull/84976
Steps to reproduce:
- Export any of the below MRPs to Android.
- Open the application
- Watch it crash after 1 second when it attempts to equip the “crown” to the Blocky Ball character
MRP: example-crash-particles-3d.zip
Code for equipping:
extends Node
var secs: float = 0
var is_equipped: bool = false
# This crashes
#func _ready() -> void:
#var skeleton = get_node("Cube_Guy/Cube_Guy_Armature/Skeleton3D")
#var particles = load("res://gpu_particles_3d.tscn") as PackedScene
#skeleton.get_node("Foot_Front_R").add_child(particles.instantiate())
#skeleton.get_node("Foot_Front_L").add_child(particles.instantiate())
#skeleton.get_node("Foot_Back_R").add_child(particles.instantiate())
#skeleton.get_node("Foot_Back_L").add_child(particles.instantiate())
# This crashes
func _process(delta: float) -> void:
if secs < 1:
secs += delta
return
if is_equipped:
return
is_equipped = true
var skeleton = get_node("Cube_Guy/Cube_Guy_Armature/Skeleton3D")
var particles = load("res://gpu_particles_3d.tscn") as PackedScene
skeleton.get_node("Foot_Front_R").add_child(particles.instantiate())
skeleton.get_node("Foot_Front_L").add_child(particles.instantiate())
skeleton.get_node("Foot_Back_R").add_child(particles.instantiate())
skeleton.get_node("Foot_Back_L").add_child(particles.instantiate())
Note: In this situation, equipping in _ready and _process both cause a crash
Known Workarounds
- Disable shadows on directional light
- use gl_compatibility renderer
Skeleton mesh + directional light + shadows + add_child + bone attachment + child mesh with skeleton
Pr that introduced the bug: https://github.com/godotengine/godot/pull/84976
Steps to reproduce:
- Export any of the below MRPs to Android.
- Open the application
- Watch it crash after 1 second when it attempts to equip another Blocky Ball character as a hat on the main Blocky Ball character
MRP: example-crash-equip-mesh.zip
Code for equipping:
extends Node
var secs: float = 0
var is_equipped: bool = false
# This works
# func _ready() -> void:
# var skeleton = get_node("Cube_Guy/Cube_Guy_Armature/Skeleton3D")
# var hat = load("res://Cube_Guy_Hat.tscn") as PackedScene
# skeleton.add_child(hat.instantiate())
# This crashes
func _process(delta: float) -> void:
if secs < 1:
secs += delta
return
if is_equipped:
return
is_equipped = true
var skeleton = get_node("Cube_Guy/Cube_Guy_Armature/Skeleton3D")
var hat = load("res://Cube_Guy_Hat.tscn") as PackedScene
skeleton.add_child(hat.instantiate())
Note: In this situation, equipping in _ready works
Known Workarounds
- Disable shadows on directional light
- use gl_compatibility renderer
Skeleton mesh + directional light + shadows + add_child + bone attachment + child mesh without skeleton
This situation actually works! ✅ Here is an MRP demonstrating it working: example-working-mesh.zip
So if the model being added as a child of a bone attachment doesn’t itself have a skeleton, it works fine. In our game, we allow users to equip pets as “hats” for fun. We want to play cool animations with them. Hence the reason we are equipping items with skeletons.
About this issue
- Original URL
- State: open
- Created 3 months ago
- Reactions: 1
- Comments: 81 (81 by maintainers)
Clay just added the device name detection for devices for Adreno 6XX devices. Give it a shot.
Judging from this report, it seems at least 620, 630, 640, and 650 are affected. I think that it is probably safe to just use the workaround for all Adreno 6XX devices
I’ve submitted https://github.com/godotengine/godot/pull/91514 for fixing this issue.
At the moment however, the detection logic is not implemented. I pointed out the relevant line where it can be forced to use the workaround if you wish to check if it fixes the crash. It might be worth gathering the exact driver versions and device names that require this workaround.
Small update on the ARG side of things. We have confirmed that this is indeed a driver bug in certain Adreno drivers (seems to be 6XX series). The exact bug is that if a scissor rect is set in a given command buffer, any compute commands dispatched after will crash. Technically compute should ignore the scissor rect, but on this driver version it doesn’t.
We are still investigating solutions. The reason this didn’t crash previously is that in the mobile renderer the only compute work we do is skeletons and particles which are submitted at the beginning of the frame. So they always ran before the scissor rect was set. But now that we re-order work, it is possible for a graphics command to set the scissor rect before certain compute commands and cause this crash.
Interesting progress. I can’t find anything unique about the dispatch that crashes. I have also tested out forcing the engine to rebind the pipeline and all uniform sets, but the crash persists.
I noticed that both devices (the Samsung S10e and my device, the Pixel 4) have an Adreno 640 GPU. So I tested on a device with a Mali GPU and confirmed that the crash is not happening there. I have also tested on a device with an Adreno 530 GPU and it didn’t crash there either.
Looks like this is a driver bug after all that we are somehow triggering
To refine my thoughts. Given that:
FORCE_FULL_ACCESS_BITS
andRENDER_GRAPH_FULL_BARRIERS
and they didn’t resolve the crashI just tested the “Skeleton mesh + directional light + shadows + bone attachment + add_child gpu particles 3d” on a pixel 4 with the following results: 4.2.2: works fine 4.3-dev2: crash 4.3-dev4: crash
So clearly this is a mobile specific issue and something is getting through both our validation code and Vulkan’s validation layers
@TokageItLab Here are the results of the requested tests:
I tested the “disable_deprecated” and “revert https://github.com/godotengine/godot/pull/87888” separately. Both gave the same results. They are as follows:
✅ = works ❌ = crash
add_child()
in skinned Skeleton +Node
type child = ✅add_child()
in skinned Skeleton +Node2D
type child = ✅add_child()
in skinned Skeleton +Node3D
type child = ✅add_child()
in skinned Skeleton + Mesh Without Skeleton child = ✅add_child()
in skinned Skeleton + Mesh With Skeleton child = ❌add_child()
in skinned Skeleton + change skeleton of child to be parent = ❌add_child()
in skinned Skeleton +GPUParticles3D
type child = ❌✅ MRP Mesh Without Skeleton Child: https://github.com/godotengine/godot/files/14971317/example-working-mesh.zip ❌ MRP Mesh With Skeleton Child: https://github.com/godotengine/godot/files/14971312/example-crash-equip-mesh.zip ❌ MRP Particles Child: https://github.com/godotengine/godot/files/14971175/example-crash-particles-3d.zip ❌ MRP Change Skeleton Child: https://github.com/godotengine/godot/files/14967690/example-crash-change-skeleton.zip
So I’m beginning to think the MRPs that causes the crash may actually be parent skeleton adding child with skeleton AND parent skeleton adding child of ParticlesGPU3D
(deleting this for now as it wasn’t relevant)
@TokageItLab I just tested the
gl_compatibility
render on mobile and that worked! So there’s another piece of information for debugging 😃@TokageItLab I found the problematic commit!
https://github.com/godotengine/godot/commit/e9695d9fa22113f41348a0d2c7edb1138efc2b22
And here is the PR with it:
https://github.com/godotengine/godot/pull/84976
I tried reverting it in my fork but it wasn’t as simple as just “taking their changes” on everything when reverting. So I couldn’t quickly test if the revert fixes it off master.
But we found it! 😃 This is when the bug was introduced! At least in regards for this MRP: https://github.com/godotengine/godot/issues/90459#issuecomment-2053669239
Checking out the PR just before this PR, the MRP works. Then checking out that PR, it crashes.
@TokageItLab I have some interesting discoveries! 😃 Here is the list:
I simply commented those out on my local for now while we debug this.
It fixes the MRP. Yay!!! 🎉🎉🎉. It behaves the same as when I revert the PR.
It does not fix equipping items to a character by changing that item’s skeleton or adding particle3ds as a child of a bone attachment. Both of these scenarios still result in a crash.
Is there anything you would like me to try next?
After A LOT of debugging and tireless help from @clayjohn , I am reasonably confident to have bisected and determined this PR to be the root cause of the crash: https://github.com/godotengine/godot/pull/87888
Yay!!! 😃 Now I can sleep peacefully!
Edit:
Literally, its almost midnight and I’m very tired 😅 Excited to see what new discoveries await tomorrow!
The Vulkan profile for the affected device is here: https://vulkan.gpuinfo.org/displayreport.php?id=16301
I can reproduce the validation layer error when using this profile.
It appears that the crash is coming from the skeleton compute shader. I think there is a good chance that the crash is happening due to an issue with this specific skeleton. But we can’t rule out that the Vulkan error is somehow related. So let’s try to fix the validation layer error first and then re-evaluate and see if the crash goes away.
For the validation error, it seems that it comes from not supporting linear samplers with
VK_FORMAT_D16_UNORM
. On my device the validation layer is slightly different, but I think it is because it is falling back to another code path it thinks will be better support. I get the following warning print:format VK_FORMAT_D16_UNORM is simulating unsupported features!
and then the validation layer error I get is:I just tested and can confirm this PR fixes it! Great work everyone! Shoutout to @clayjohn @DarioSamo and @TokageItLab !! 😃
Given Clay just got a crash I figure we can investigate it from here. And no, I did not run into any visual issues with them either.
It’s been brought to my attention from the call stack posted before the crash is actually at the driver level but not the GPU execution, so it’s probably some situation the renderer causes on this particular driver.
Thanks @clayjohn ! 😃 Should we update the label with “confirmed” instead of “needs testing”? Or is that not what those labels mean?
I figure I should give my two cents since it is suspected #84976 might be the cause of the issue.
I gave some of the MRPs linked here a test on desktop at least in
master
with the strictest possible validation enabled. None of the projects reported anything back, so we can probably rule out synchronization as far as the VLL goes.However, there’s some flags you can play around with if you feel it should be more paranoid about it:
When reordering is disabled, the ARG will basically fall back to the behavior of the previous version and submit commands in their original order. When full barriers are enabled, then it’ll basically be completely paranoid about synchronization. However, given we have no reliance on stuff like indirect drawing or dispatches, I’m very suspicious of synchronization being a problem here that could cause a crash. At worst I could imagine reordering having some effect significant enough for a crash.
There’s a lot of degrees of separation here between what was originally reported and the issue that could be causing a crash so it’s a bit tough to understand the possible cause to me yet. Instancing elements from scripts or what they’re attached to shouldn’t cause any visible difference to the renderer, as no hierarchy is visible from its side nor are the objects treated differently depending on their origin. So I think pretty much anything should be reproducible on a static version of the scene if it exists.
My suggestion for now would be to see if you see any behavior change whatsoever messing with these options, but I figured it’d be worth noting that the pattern has usually been that the introduction of the ARG has exposed some long-standing bugs due to minimizing the amount of barriers and transitions necessary. Either Godot’s own issues were exposed or more obscure driver issues that end up requiring workarounds (like how @akien-mga’s setup required the introduction of buffer barriers as the driver seemingly did not follow global barriers very well).
Yes I think this is a good title change. With the exception of the first MRP where skeletons + mesh in general are crashing. Everything else is add_child.
I will do those tests tomorrow as well! 😃
@TCROC Thanks for the effort! I think we have focus on the cause of the problem quite well. Finally, it would be great if you could identify which environment have crashes with MRP between MONO and GDScript, and precision for double and float, respectively.
Yay!! Good news! I found a version that works: https://godotengine.org/download/archive/4.0-stable/
Granted it is almost a year back… BUT at least I have a window to work with when bisecting! 😃 I’ll keep working off official releases until I find which official release broke it to narrow my window
@TokageItLab And one more really interesting thing that I just tested.
If I disabled shadows on the directional light, EVERYTHING works fine. Including changing skeletons of items and making particle gpu 3ds children of bone attachments.
This was the case for the MRP. I had not tested it on the Blocky Ball OT project. I just did and can confirm disabling shadows fixes things.
Maybe shadows are the issue? Maybe not and shadows simply push it over the edge due to a different mistake? Not sure. Is there anything you would like me to explore in this area (lighting and shadows)?
Just woke up. I’m gonna grab some coffee and check these out. Thanks @TokageItLab 😃
Here is a stack trace captured from debugging in Android Studio: