godot: Mesh Instances randomly not rendering (regression since 3.3)

Godot version

3.3.4 official

System information

Win 10/64 NVIDIA GeForce GTX 1060/PCIe/SSE2

Issue description

Too frequently some meshes won’t render in the editor, and sometimes in game. Switching to another scene and switching back, or closing and reopening will temporarily fix it.

We were developing on 3.2.1 and I never had the issue. As we explored the 3.3 branch I started noticing this error. It’s frequent enough that it’s a problem. It could have happened any time from 3.2.2.

@Arnklit Already reported this in #47051, and provided a test project, but it was closed prematurely. Though since it’s intermittent I doubt the core devs will see it until they start making a large scene with tons of objects. @mbrlabs also confirmed the problem in that ticket.

It happens to all sorts of meshes. It has also affected @Zylann’s heightmap terrain, one time, in the editor. One quad (say 5mx5m) was invisible and I couldn’t get it to become visible until reloading the scene (prior to any of my gdnative work).

Here I’m building a house from a kit. I already had this scene open. I was in another scene, then switched back to it and saw this. I moved the camera closer for the shot:

image

I switched to the other scene, then came back, and the two meshes rendered again:

image

The floor is a fairly simple mesh with one wood material. The wall is more complex. Notice in the first picture you can see the shutter and glass. Here is the full object:

image

This wall object has separate glass and shutter meshes. The mesh that disappeared includes two materials, the same wood used on all objects in all of these pictures, and a rusty metal frame. Both materials on the mesh didn’t render, and both materials are visible on all the other meshes.

The mesh instances have their mesh data stored in the tscn file. They were originally imported from fbx, but that file has already been deleted. The textures are separate dds files. The materials are separate binary files, though most of our other materials are tres. Here we are using MTLOD and an immediate geometry node, however I’ve experienced this on our meshes that are not using MTLOD. These shots are all close enough that we’re only looking at LOD0.

I do not believe this is not an issue with LOD, file types or materials. It decides whether it is going to render these objects when the scene opens. Nothing I can do in the scene will get it to render. Not selecting it, moving it, changing visibility, or moving the camera. Only by switching or reopening scenes fixes (or breaks) these meshes.

My console is full of these errors, which repeat when I switch back and forth between my scenes sometimes.

ERROR: get: Condition "!id_map.has(p_rid.get_data())" is true. Returned: __null
   At: ./core/rid.h:150
ERROR: instance_set_transform: Condition "!instance" is true.
   At: servers/visual/visual_server_scene.cpp:713
ERROR: get: Condition "!id_map.has(p_rid.get_data())" is true. Returned: __null
   At: ./core/rid.h:150
ERROR: instance_set_scenario: Condition "!instance" is true.
   At: servers/visual/visual_server_scene.cpp:629
ERROR: get: Condition "!id_map.has(p_rid.get_data())" is true. Returned: __null
   At: ./core/rid.h:150
ERROR: instance_set_visible: Condition "!instance" is true.
   At: servers/visual/visual_server_scene.cpp:780
ERROR: get: Condition "!id_map.has(p_rid.get_data())" is true. Returned: __null
   At: ./core/rid.h:150
ERROR: instance_attach_skeleton: Condition "!instance" is true.
   At: servers/visual/visual_server_scene.cpp:913

Steps to reproduce

Build out a complex scene for a game with a lot of objects to greatly increase your likelihood of seeing the error. Now at least 3 known gamedevs experience this problem.

Minimal reproduction project

Look in https://github.com/godotengine/godot/issues/47051

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 34 (34 by maintainers)

Commits related to this issue

Most upvoted comments

Turning off the grid in the viewport prevents the error messages and symptoms on my system. Specifically disabling all three Editor->Editor Settings->Editors->3d->grid_*_plane, though only xz is enabled by default.

Guys, please test this on your systems and let me know if you get any !id_map.has(p_rid.get_data()) error messages. You should be able to test this with any build of 3.3 and later.

If that fixes it for everyone, then it looks like I’ve solved the issue.

The Cause

SpatialEditor has two 3-item arrays: grid and grid_instance. These hold the meshes and instances of the 3D grid shown in the viewports for each of the 3 grid planes.

These arrays are continually freed and repopulated whenever the camera or an object is moved, or several other events. They are repopulated only if the grid is enabled. Which means that sometimes (all the time), the old RIDs are repeatedly freed.

This is the root problem: It is not safe to free RIDs multiple times. When they are freed, they leave a dangling pointer.

Every update cycle in SpatialEditor, RIDs are sent to the visual server to be freed. Sometimes with current pointers, sometimes with dangling pointers.

VisualServer has no way to tell if an RID is valid or not. rid.is_valid() is true even if its dangling, and rid.get_id()!=0) will cause an exception if the pointer is dangling.

VS gets the pointer from each RID and searches for it in each of ~20 red black trees to see if the pointer matches any instance in any tree. If there’s no match, nothing happens. However if there is a match, the instance is removed from the tree.

When you have a scene with hundreds of objects, it only takes a few scene switches to generate thousands of compares. Eventually it finds a match and a random instance gets removed from whichever r/b tree owned it (cameras, lights, meshes, etc).

The instance is still referenced elsewhere in the engine, but since the VisualServer can’t find the instance in any of its id_maps, it spams the console with error messages, usually for the same instance over and over. Switching scenes or closing/reopening scenes causes SpatialEditor to repopulate its grid arrays, but it could free another random instance.

The Fix

I could just edit SpatialEditor and manually assign RID() to the grid arrays after free(). However this issue has highlighted a bigger problem in the engine. Freeing RIDs leaves dangling pointers. It’s not safe to free an RID multiple times. If you do, you could remove instances from random id_maps throughout the engine, not just in VisualServer but also in Physics and anywhere else they are used.

My PR will tweak RID and every instance of free(RID) to enforce cleaning up these dangling pointers (passing the RIDs by reference and setting _data=nullptr). This also means RID.is_valid() will actually mean something. I’ve already got it working in all the VisualServers. I just need to do it for Physics and any other place. I’ll push up a PR probably tomorrow with the RID changes and VS::free() sanity checks.

Also, RID.h, handles release and debug mode differently. Debug mode searches a red/black tree to see if it contains an RID. In release mode it doesn’t even use a red/black tree. Instead the RID just keeps a direct reference to its owner. I really don’t like having two versions of code, with release mode having more functions available. I intend to bring the two versions closer together to reduce complexity of finding bugs in the future.

Just as a reminder, #44642 identifies a place in the SpatialEditor that can free rids that dont belong to the editor. With the checks you are adding, you could verify this place too to make sure it is not the same cause.

@mbrlabs The core issue is this part of all of the various error messages: !id_map.has(p_rid.get_data()) is true. This is because the RID was previously removed from its container (id_map) for an unknown reason.

@lawnjelly I can’t use the callstack PR as it requires EngineDebugger which doesn’t exist in 3.x. However, I learned how to dump the call stack with CaptureStackBackTrace() in the Windows SDK.

I ran through several errors and got 4 unique traces (callstacks.xlsx). All of them go back to VisualServerWrapMT::free (as a wrapper for VisualServerRaster::free), then diverge from there.

One thing I have found is that VisualServerRaster::free() is often called with invalid RIDs (ie ! p_rid.is_valid() and sometimes with RID==0, which should never happen (Since RID_OwnerBase::init_rid() guarantees it will always be >=1). These invalid conditions are then sent to 5 VisualServers to be searched for in 5-10 redblack trees each. I’ll introduce some sanity checks to skip these conditions.

When running in a fully loaded scene, not switching, free() is called often, including both valid and invalid calls. On my village gallery scene shown above, it’s called around 10 times per frame, whenever the camera or an object is moved.

I haven’t yet figured out why these objects are being freed or if they are legitimate, why only 10 free calls are being made when there are 300 objects in this scene, or why invalid RIDs and RID==0s are attempting to be freed over and over again.

So far, searching backwards with the callstacks hasn’t been effective due to the design. I’ve started tracing forward by logging markers at various points in the code, starting with what happens when editor tabs are switched. So far on one instance, I got errors when this line adds the new scene to the tree:

EditorNode::set_current_scene()
     scene_root->add_child(new_scene);

I haven’t traced into add_child() yet.

I tested printing out when the VisualServerScene::instance_owner::id_map has something added and removed. Turns out it’s constantly being added to and removed from through operations moving stuff around in a scene in addition to switching between scenes. I hit a breakpoint for a missing RID from the id_map, then scrolled back through my log. I found where the RID had been added to id_map, then later it was removed, then later the error occurred.

Since I’m predominantly just switching scenes, the object class and instance should both continue to exist in memory while the scene is open. But it won’t be in the tree until switching to the scene. I assume the object and/or instance have their own global RIDs, but the VSS creates a new, unique (SafeRefCount), temporary RID for use in it’s id_map, which is constantly being populated and depopulated.

When these errors occur, it appears that the RIDs are valid, the underlying objects should also still be valid and still in memory, though maybe not in the tree. And at some point the instance was added to the id_map and then removed. The question is, “Why was it removed when the object is still there in the scene?”

I’m still thinking about the other ideas you had. I’m more of a hacker than a rigorous developer, so me adding debugging facilities to Godot is unlikely. Being able to print a call stack on demand looks helpful. I didn’t know that was possible without a debugger. Maybe my next step is to dump a call stack to a logging file for every id_map.free(). Then repeat the same test above so I can retrieve the callstack for missing RIDs. Then I can start writing tests in that code to figure out why, and of course analyze the code history so we know who to blame 😉.

I tried running my game in 3.2.1, but there are too many problems with it. Code that should work is invalid. It doesn’t read all of the meshes that were imported in 3.3 or 3.4 and saved in a native mesh format. So I’m not going to work backward.

Instead, I built a new 3.x da5c843bd debug version w/ threadsafe BVH on, and worked today while running the editor through the MSVC debugger. I set some breakpoints in the code and recorded a video of my screen whenever I hit an error to debug the callstack and objects.

Here are some conclusions so far:

  1. The errors are happening all the time. While opening, closing, and creating object scenes, and switching between scenes, I’m hitting a block of 3-4 errors at least hourly if not more like every 10-15 minutes. However the times I hit an error that provides an obvious visual artifact is much rarer, maybe 1/10th the time.
  2. The problems occur when VisualServer wants to set_visible, set_transform, or set_scenario on a particular resource (RID), which happens on all objects every time it switches scenes, including editor viewport objects (e.g. SpatialEditorViewport whatever that is). Only once in a while does it trigger on set_visible, hence point 1. On each scene switch, the scene goes through _enter_tree() and _exit_tree().
  3. The Set (red-black tree) that stores the RIDs of the objects in the scene is not properly populated. These errors are occurring because the RIDs of the objects in the scene are missing from the Set of RIDs the visual server queries.

Here is one of the three areas where the error messages come from, where VisualServer attempts to get the instance of the RID.

VisualServerScene::instance_set_transform(RID p_instance, const Transform &p_transform) {
	Instance *instance = instance_owner.get(p_instance);
	ERR_FAIL_COND(!instance);

instance_owner is an RID_Owner, which is a class with a Set called id_map. A Set is a red-black tree and is used to contain all RIDs for the scene: Set<RID_Data *> id_map;

Here is instance_owner.get()

rid.h::get(p_rid)
		ERR_FAIL_COND_V(!p_rid.is_valid(), nullptr);
		ERR_FAIL_COND_V(!id_map.has(p_rid.get_data()), nullptr);

The first line doesn’t produce an error, so the RID object is valid.

I expanded the second macro and added some debugging code to determine that in almost all cases the RID object and the resource ID integer within are valid. However when this fails, the ID is NOT in the Set (id_map). In only one instance today was the RID data invalid.

  1. I dumped the entire red-black tree across several error instances to confirm that the element wasn’t just misplaced. It’s unlikely the problem is in the implementation of the tree. It’s much more likely that the problem occurs when the Set is being populated. On another day I’ll explore that.

Here’s a video showing my debugging progress. It’s long, and gets more efficient in later chapters. Look for the chapter markers on the timeline. I captured 5 error sessions, but only one was visual, at 1:06:45.

https://youtu.be/ozET9itjeD4

It would be awesome to get an MRP based in the project in the original post. The one in #47051 has indeed a problem related to tab switching, but it’s no longer the one reported there.

What happens there now is related to timing. If you switch to and away from the fur ball scene, a piece of script in the Fur addon causes errors about certain nodes not being in the tree:

func update_physics_object(delay : float) -> void:
	yield(_shell_fur_object.get_tree().create_timer(delay), "timeout")
	_physics_pos = _current_physics_object().global_transform.origin
	_physics_rot = _current_physics_object().global_transform.basis.get_rotation_quat()

While the fur scene is the active one, update_physics_object() is called. If you switch away from it in a shorter time than delay (which seems to be 0.5), by the time the yield() ends, the “current physics object” is no longer in the tree, causing the errors. I woulnd’t call that a Godot bug. It may be possible to freeze the state of a scene in a tab that becomes inactive and unfreeze when it’s active again, but it’s possibly not worth the hassle. Sounds like a non trivial at all thing to implement.

I had to tear RID apart to find the issue, so I’m familiar with it. I will ask reduz.

Either every caller of any free(RID) needs to also do rid=RID(), or RID needs to clean itself up, or a combination, like it’s clean in the engine and dirty in script.

I currently have it working the original way for the script API, and a clean way for the engine. Several areas of the engine regularly send bad data to free.

As for script, it could possibly generate an error if desired, but there’s no way for it to know that the pointer is dangling, unless 1) it try{}s to access the memory and 2) it’s not pointing to a valid object, but this issue demonstrates that it does far too often. Many times I watched SpatialEditor, which should have only referenced VisualInstance, referencing AnimationNodeOneShot.

I’ve been using a 3.4 build @akien-mga made (v3.4.beta-pr53411-a2e374d.official [a2e374d41]) w/ bvh and threadsafe bvh turned on. It took a while, but it finally hid a mesh. It seems this set up works a lot better than 3.3, which seemed to occur daily or even a couple times a day. This is the first time in several days that it occurred where I caught it. However that does include life, work, and working on other aspects so it’s possible this is just an appearance and not reality.

Here you can see I found my Stairs5 failed to render.

image

The object was there and I could select it in the tree. When I moved the gizmo, it did move, and these error messages appeared in the console. Then I opened the stairs5 scene, switched back, and the stairs reappeared. Then I when I moved it, I didn’t receive any error messages.

ERROR: Condition "!instance" is true.
   at: instance_set_transform (servers/visual/visual_server_scene.cpp:680)
ERROR: Condition "!id_map.has(p_rid.get_data())" is true. Returned: nullptr
   at: get (./core/rid.h:140)
ERROR: Condition "!instance" is true.
   at: instance_set_visible (servers/visual/visual_server_scene.cpp:745)
ERROR: Condition "!id_map.has(p_rid.get_data())" is true. Returned: nullptr
   at: get (./core/rid.h:140)
ERROR: Condition "!instance" is true.
   at: instance_set_scenario (servers/visual/visual_server_scene.cpp:599)
ERROR: Condition "!id_map.has(p_rid.get_data())" is true. Returned: nullptr
   at: get (./core/rid.h:140)
ERROR: Condition "!instance" is true.
   at: instance_attach_skeleton (servers/visual/visual_server_scene.cpp:876)

What I reported in the last comment was that the mesh was actually corrupted. Every other instance on this issue has only been a display error. I have not encountered that again. One of my team members did have an interesting bug where the mesh was gone from the viewport, and the mesh display in the inspector was blank. Mine have always displayed except the one time when the mesh got corrupted. His mesh did not get corrupted either.

Also what Zylann connected in the terrain ticket above ( https://github.com/Zylann/godot_heightmap_plugin/issues/260#issuecomment-944895257) is exactly what I was referring to in the original post.

image

@lawnjelly

Again I would ask if you or anyone is able to paste one of these projects into the thread (especially a before / after corruption, if that seems to be happening). Without a minimum reproduction project it is hard to work out what is happening.

I have assets I cannot share. Also the minimum project is not useful because it’s a temporary problem. As soon as I switch scenes, it goes away. If I saved it for you, closed it and reopened it on my own system, the problem is gone. There’s nothing in the file that is a problem. You just have to work with a large project yourself if you want to see it on your own system.

I’m using the nvdia studio driver. The issue has been going on so long that I’ve upgraded my drivers at least a few times.

I suspect the relationship with drivers may be a red herring - it is likely this is affecting the timing, and it is likely a semi-random timing issue which is triggering the bug. Driver issues can cause objects to disappear, but I struggle to think of how they could cause error message our side with RIDs etc.

Yes i remember that issue. I did a lot of testing back then but could not reproduce it reliably (only happend in the editor for me so it was not that critical). I still occasionally have this problem when working on big scenes.

The earliest version i encountered this with was 3.2.4 beta3. See https://github.com/godotengine/godot/issues/47051#issuecomment-802819194