godot: Mesh Instances randomly not rendering (regression since 3.3)
Godot version
3.3.4 official
System information
Win 10/64 NVIDIA GeForce GTX 1060/PCIe/SSE2
Issue description
Too frequently some meshes won’t render in the editor, and sometimes in game. Switching to another scene and switching back, or closing and reopening will temporarily fix it.
We were developing on 3.2.1 and I never had the issue. As we explored the 3.3 branch I started noticing this error. It’s frequent enough that it’s a problem. It could have happened any time from 3.2.2.
@Arnklit Already reported this in #47051, and provided a test project, but it was closed prematurely. Though since it’s intermittent I doubt the core devs will see it until they start making a large scene with tons of objects. @mbrlabs also confirmed the problem in that ticket.
It happens to all sorts of meshes. It has also affected @Zylann’s heightmap terrain, one time, in the editor. One quad (say 5mx5m) was invisible and I couldn’t get it to become visible until reloading the scene (prior to any of my gdnative work).
Here I’m building a house from a kit. I already had this scene open. I was in another scene, then switched back to it and saw this. I moved the camera closer for the shot:
I switched to the other scene, then came back, and the two meshes rendered again:
The floor is a fairly simple mesh with one wood material. The wall is more complex. Notice in the first picture you can see the shutter and glass. Here is the full object:
This wall object has separate glass and shutter meshes. The mesh that disappeared includes two materials, the same wood used on all objects in all of these pictures, and a rusty metal frame. Both materials on the mesh didn’t render, and both materials are visible on all the other meshes.
The mesh instances have their mesh data stored in the tscn
file. They were originally imported from fbx, but that file has already been deleted. The textures are separate dds files. The materials are separate binary files, though most of our other materials are tres
. Here we are using MTLOD and an immediate geometry node, however I’ve experienced this on our meshes that are not using MTLOD. These shots are all close enough that we’re only looking at LOD0.
I do not believe this is not an issue with LOD, file types or materials. It decides whether it is going to render these objects when the scene opens. Nothing I can do in the scene will get it to render. Not selecting it, moving it, changing visibility, or moving the camera. Only by switching or reopening scenes fixes (or breaks) these meshes.
My console is full of these errors, which repeat when I switch back and forth between my scenes sometimes.
ERROR: get: Condition "!id_map.has(p_rid.get_data())" is true. Returned: __null
At: ./core/rid.h:150
ERROR: instance_set_transform: Condition "!instance" is true.
At: servers/visual/visual_server_scene.cpp:713
ERROR: get: Condition "!id_map.has(p_rid.get_data())" is true. Returned: __null
At: ./core/rid.h:150
ERROR: instance_set_scenario: Condition "!instance" is true.
At: servers/visual/visual_server_scene.cpp:629
ERROR: get: Condition "!id_map.has(p_rid.get_data())" is true. Returned: __null
At: ./core/rid.h:150
ERROR: instance_set_visible: Condition "!instance" is true.
At: servers/visual/visual_server_scene.cpp:780
ERROR: get: Condition "!id_map.has(p_rid.get_data())" is true. Returned: __null
At: ./core/rid.h:150
ERROR: instance_attach_skeleton: Condition "!instance" is true.
At: servers/visual/visual_server_scene.cpp:913
Steps to reproduce
Build out a complex scene for a game with a lot of objects to greatly increase your likelihood of seeing the error. Now at least 3 known gamedevs experience this problem.
Minimal reproduction project
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 34 (34 by maintainers)
Commits related to this issue
- Clears RIDs wherever they are freed by VisualServer or PhysicsServer and possibly reused. Fixes #53374 — committed to TokisanGames/godot by TokisanGames 3 years ago
- Clears RIDs wherever they are freed by VisualServer or PhysicsServer and possibly reused. Fixes #53374 (cherry picked from commit 3d20218dae68f0c50d0c9a943ce67ea4260099ce) — committed to godotengine/godot by TokisanGames 3 years ago
Turning off the grid in the viewport prevents the error messages and symptoms on my system. Specifically disabling all three
Editor->Editor Settings->Editors->3d->grid_*_plane
, though onlyxz
is enabled by default.Guys, please test this on your systems and let me know if you get any
!id_map.has(p_rid.get_data())
error messages. You should be able to test this with any build of 3.3 and later.If that fixes it for everyone, then it looks like I’ve solved the issue.
The Cause
SpatialEditor has two 3-item arrays:
grid
andgrid_instance
. These hold the meshes and instances of the 3D grid shown in the viewports for each of the 3 grid planes.These arrays are continually freed and repopulated whenever the camera or an object is moved, or several other events. They are repopulated only if the grid is enabled. Which means that sometimes (all the time), the old RIDs are repeatedly freed.
This is the root problem: It is not safe to free RIDs multiple times. When they are freed, they leave a dangling pointer.
Every update cycle in SpatialEditor, RIDs are sent to the visual server to be freed. Sometimes with current pointers, sometimes with dangling pointers.
VisualServer has no way to tell if an RID is valid or not.
rid.is_valid()
is true even if its dangling, andrid.get_id()!=0)
will cause an exception if the pointer is dangling.VS gets the pointer from each RID and searches for it in each of ~20 red black trees to see if the pointer matches any instance in any tree. If there’s no match, nothing happens. However if there is a match, the instance is removed from the tree.
When you have a scene with hundreds of objects, it only takes a few scene switches to generate thousands of compares. Eventually it finds a match and a random instance gets removed from whichever r/b tree owned it (cameras, lights, meshes, etc).
The instance is still referenced elsewhere in the engine, but since the VisualServer can’t find the instance in any of its id_maps, it spams the console with error messages, usually for the same instance over and over. Switching scenes or closing/reopening scenes causes SpatialEditor to repopulate its grid arrays, but it could free another random instance.
The Fix
I could just edit SpatialEditor and manually assign RID() to the grid arrays after free(). However this issue has highlighted a bigger problem in the engine. Freeing RIDs leaves dangling pointers. It’s not safe to free an RID multiple times. If you do, you could remove instances from random id_maps throughout the engine, not just in VisualServer but also in Physics and anywhere else they are used.
My PR will tweak
RID
and every instance offree(RID)
to enforce cleaning up these dangling pointers (passing the RIDs by reference and setting_data=nullptr
). This also meansRID.is_valid()
will actually mean something. I’ve already got it working in all the VisualServers. I just need to do it for Physics and any other place. I’ll push up a PR probably tomorrow with the RID changes and VS::free() sanity checks.Also, RID.h, handles release and debug mode differently. Debug mode searches a red/black tree to see if it contains an RID. In release mode it doesn’t even use a red/black tree. Instead the RID just keeps a direct reference to its owner. I really don’t like having two versions of code, with release mode having more functions available. I intend to bring the two versions closer together to reduce complexity of finding bugs in the future.
Just as a reminder, #44642 identifies a place in the
SpatialEditor
that can free rids that dont belong to the editor. With the checks you are adding, you could verify this place too to make sure it is not the same cause.@mbrlabs The core issue is this part of all of the various error messages:
!id_map.has(p_rid.get_data()) is true
. This is because the RID was previously removed from its container (id_map
) for an unknown reason.@lawnjelly I can’t use the callstack PR as it requires EngineDebugger which doesn’t exist in 3.x. However, I learned how to dump the call stack with CaptureStackBackTrace() in the Windows SDK.
I ran through several errors and got 4 unique traces (callstacks.xlsx). All of them go back to
VisualServerWrapMT::free
(as a wrapper forVisualServerRaster::free
), then diverge from there.One thing I have found is that
VisualServerRaster::free()
is often called with invalid RIDs (ie! p_rid.is_valid()
and sometimes with RID==0, which should never happen (SinceRID_OwnerBase::init_rid()
guarantees it will always be >=1). These invalid conditions are then sent to 5 VisualServers to be searched for in 5-10 redblack trees each. I’ll introduce some sanity checks to skip these conditions.When running in a fully loaded scene, not switching, free() is called often, including both valid and invalid calls. On my village gallery scene shown above, it’s called around 10 times per frame, whenever the camera or an object is moved.
I haven’t yet figured out why these objects are being freed or if they are legitimate, why only 10 free calls are being made when there are 300 objects in this scene, or why invalid RIDs and RID==0s are attempting to be freed over and over again.
So far, searching backwards with the callstacks hasn’t been effective due to the design. I’ve started tracing forward by logging markers at various points in the code, starting with what happens when editor tabs are switched. So far on one instance, I got errors when this line adds the new scene to the tree:
I haven’t traced into add_child() yet.
I tested printing out when the
VisualServerScene::instance_owner::id_map
has something added and removed. Turns out it’s constantly being added to and removed from through operations moving stuff around in a scene in addition to switching between scenes. I hit a breakpoint for a missing RID from the id_map, then scrolled back through my log. I found where the RID had been added to id_map, then later it was removed, then later the error occurred.Since I’m predominantly just switching scenes, the object class and instance should both continue to exist in memory while the scene is open. But it won’t be in the tree until switching to the scene. I assume the object and/or instance have their own global RIDs, but the VSS creates a new, unique (SafeRefCount), temporary RID for use in it’s id_map, which is constantly being populated and depopulated.
When these errors occur, it appears that the RIDs are valid, the underlying objects should also still be valid and still in memory, though maybe not in the tree. And at some point the instance was added to the id_map and then removed. The question is, “Why was it removed when the object is still there in the scene?”
I’m still thinking about the other ideas you had. I’m more of a hacker than a rigorous developer, so me adding debugging facilities to Godot is unlikely. Being able to print a call stack on demand looks helpful. I didn’t know that was possible without a debugger. Maybe my next step is to dump a call stack to a logging file for every id_map.free(). Then repeat the same test above so I can retrieve the callstack for missing RIDs. Then I can start writing tests in that code to figure out why, and of course analyze the code history so we know who to blame 😉.
I tried running my game in 3.2.1, but there are too many problems with it. Code that should work is invalid. It doesn’t read all of the meshes that were imported in 3.3 or 3.4 and saved in a native mesh format. So I’m not going to work backward.
Instead, I built a new 3.x da5c843bd debug version w/ threadsafe BVH on, and worked today while running the editor through the MSVC debugger. I set some breakpoints in the code and recorded a video of my screen whenever I hit an error to debug the callstack and objects.
Here are some conclusions so far:
set_visible
,set_transform
, orset_scenario
on a particular resource (RID), which happens on all objects every time it switches scenes, including editor viewport objects (e.g. SpatialEditorViewport whatever that is). Only once in a while does it trigger onset_visible
, hence point 1. On each scene switch, the scene goes through_enter_tree()
and_exit_tree()
.Here is one of the three areas where the error messages come from, where VisualServer attempts to get the instance of the RID.
instance_owner
is anRID_Owner
, which is a class with aSet
called id_map. A Set is a red-black tree and is used to contain all RIDs for the scene:Set<RID_Data *> id_map;
Here is
instance_owner.get()
The first line doesn’t produce an error, so the RID object is valid.
I expanded the second macro and added some debugging code to determine that in almost all cases the RID object and the resource ID integer within are valid. However when this fails, the ID is NOT in the Set (id_map). In only one instance today was the RID data invalid.
Here’s a video showing my debugging progress. It’s long, and gets more efficient in later chapters. Look for the chapter markers on the timeline. I captured 5 error sessions, but only one was visual, at 1:06:45.
https://youtu.be/ozET9itjeD4
It would be awesome to get an MRP based in the project in the original post. The one in #47051 has indeed a problem related to tab switching, but it’s no longer the one reported there.
What happens there now is related to timing. If you switch to and away from the fur ball scene, a piece of script in the Fur addon causes errors about certain nodes not being in the tree:
While the fur scene is the active one,
update_physics_object()
is called. If you switch away from it in a shorter time thandelay
(which seems to be 0.5), by the time theyield()
ends, the “current physics object” is no longer in the tree, causing the errors. I woulnd’t call that a Godot bug. It may be possible to freeze the state of a scene in a tab that becomes inactive and unfreeze when it’s active again, but it’s possibly not worth the hassle. Sounds like a non trivial at all thing to implement.I had to tear RID apart to find the issue, so I’m familiar with it. I will ask reduz.
Either every caller of any
free(RID)
needs to also dorid=RID()
, or RID needs to clean itself up, or a combination, like it’s clean in the engine and dirty in script.I currently have it working the original way for the script API, and a clean way for the engine. Several areas of the engine regularly send bad data to free.
As for script, it could possibly generate an error if desired, but there’s no way for it to know that the pointer is dangling, unless 1) it try{}s to access the memory and 2) it’s not pointing to a valid object, but this issue demonstrates that it does far too often. Many times I watched
SpatialEditor
, which should have only referencedVisualInstance
, referencingAnimationNodeOneShot
.I’ve been using a 3.4 build @akien-mga made (v3.4.beta-pr53411-a2e374d.official [a2e374d41]) w/ bvh and threadsafe bvh turned on. It took a while, but it finally hid a mesh. It seems this set up works a lot better than 3.3, which seemed to occur daily or even a couple times a day. This is the first time in several days that it occurred where I caught it. However that does include life, work, and working on other aspects so it’s possible this is just an appearance and not reality.
Here you can see I found my Stairs5 failed to render.
The object was there and I could select it in the tree. When I moved the gizmo, it did move, and these error messages appeared in the console. Then I opened the stairs5 scene, switched back, and the stairs reappeared. Then I when I moved it, I didn’t receive any error messages.
What I reported in the last comment was that the mesh was actually corrupted. Every other instance on this issue has only been a display error. I have not encountered that again. One of my team members did have an interesting bug where the mesh was gone from the viewport, and the mesh display in the inspector was blank. Mine have always displayed except the one time when the mesh got corrupted. His mesh did not get corrupted either.
Also what Zylann connected in the terrain ticket above ( https://github.com/Zylann/godot_heightmap_plugin/issues/260#issuecomment-944895257) is exactly what I was referring to in the original post.
@lawnjelly
I have assets I cannot share. Also the minimum project is not useful because it’s a temporary problem. As soon as I switch scenes, it goes away. If I saved it for you, closed it and reopened it on my own system, the problem is gone. There’s nothing in the file that is a problem. You just have to work with a large project yourself if you want to see it on your own system.
I suspect the relationship with drivers may be a red herring - it is likely this is affecting the timing, and it is likely a semi-random timing issue which is triggering the bug. Driver issues can cause objects to disappear, but I struggle to think of how they could cause error message our side with RIDs etc.
Yes i remember that issue. I did a lot of testing back then but could not reproduce it reliably (only happend in the editor for me so it was not that critical). I still occasionally have this problem when working on big scenes.
The earliest version i encountered this with was 3.2.4 beta3. See https://github.com/godotengine/godot/issues/47051#issuecomment-802819194