gdext: Implicit binds in virtual calls cause borrow errors

Edit bromeon – title was “Cannot add_child in a &mut self function if Self overrides on_notification

#[derive(GodotClass)]
#[class(init, base = Node)]
pub struct Bar {
    #[base]
    base: Base<Node>,
}

#[godot_api]
impl NodeVirtual for Bar {
    fn ready(&mut self) {
        if Engine::singleton().is_editor_hint() {
            return;
        }

        let foo = Node::new_alloc();
        self.add_child(foo.share().upcast());
    }

    fn on_notification(&mut self, what: NodeNotification) {}
}

add_child will trigger a notification (https://docs.godotengine.org/en/stable/classes/class_node.html#class-node-constant-notification-child-order-changed), and that will in turn trigger a call to on_notification, which will try to get a mutable borrow of self. This causes a crash.

It seems like this is a big issue tbh because it basically makes self.add_child almost unusable whenever on_notification is overridden.

I’m not sure what a good solution to this is, but it’s related to the common issue of calling a signal that will trigger on self.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 4
  • Comments: 18 (18 by maintainers)

Most upvoted comments

Allowing Gd<Self> #359 as the receiver could be used to do a workaround for #383, though it’d be kinda awkward code imo:

#[func]
fn foo(_self: Gd<Self>) {
  {
    let _self = _self.bind_mut();
    // do stuff with self
  }
  // some function call that would potentially grab a mutable reference to self
  _self.call("some_func", ...);
  {
    let _self = _self.bind_mut();
    // do other stuff with self
  }
}

Also one important thing to note is that deferring is not really a workaround here: It explicitly changes behavior by introducing delays. For some applications that is unacceptable

I have an idea for how to mostly fix these issues.

Background

The basic issue is that we have an indirect call of a rust method through ffi, but this shouldn’t theoretically be an issue as you are allowed to do this in rust.

For instance, let’s say you have a signal foo, which when emitted calls bar. Then this code:

fn process(&mut self, delta: f64) {
    self.base.emit("foo")
}

fn bar(&mut self) {
  // code
}

Is the same as doing

fn process(&mut self, delta: f64) {
    self.bar()
}

fn bar(&mut self) {
  // code
}

which is fine, but the original code would fail.

The reason this is fine is because rust knows that calling bar() requires a mutable reference to self, i.e there must be no aliasing other references around.

So if we can somehow create an API that allows a user to call methods which may potentially call back into self (such as emit or add_child) only when we could get a &mut self reference. Then it would be safe to create a new &mut self reference when we’re called from Godot later.

Basic Idea

The idea has two major components (all names used are up for bikeshedding):

  1. We add a new method that can get a mutable reference to a rust class, which i will call bind_self(), only when the object hasn’t been bound by either bind() or bind_mut(), it also blocks both bind() and bind_mut(). [^1]
  2. We create a mechanism to detect when an aliasing reference may exist. And use bind_mut() if one may exist, and bind_self() when we know one cant.

Current Idea For Implementation

My current idea for implementing this is:

We add a counter to InstanceStorage that i’ll call possibly_aliasing: usize[^2] which tracks the number of possibly aliasing references that exist. This counter is set to 0 by default.

possibly_aliasing will be incremented by 1 whenever we enter a method from Godot, or whenever bind() or bind_mut() are called on a Gd<T>. As we can then get arbitrary aliasing references.

possibly_aliasing is decremented by 1 whenever one of the bind-guards are dropped.

Additionally we create a drop-guard which Base can return. This drop-guard decrements possibly_aliasing by 1, and then increments it by 1 when it’s dropped. This drop-guard requires a &mut self to be created, ensuring that we do not have any other aliasing references around when it’s called.[^3] This drop-guard dereferences to the engine-type similar to a Gd<T>.

When a rust-method is called from Godot (that takes &mut self), we check possibly_aliasing. If it is 0, then we use bind_self(), otherwise we use bind_mut().

So as some code-examples we’d take this:

#[derive(GodotClass)]
#[class(base = Node, init)]
struct MyClass {
  #[base]
  base: Base<Node>
}

#[godot_api]
impl INode for MyClass {
  fn ready(&mut self) {
    // base() here returns the drop guard 
    self.base().call("some_func".into(), &[]);
  }
}

#[godot_api]
impl MyClass {
  #[func]
  fn some_func(&mut self) {
    // code
  }
}

And we generate this (very abbreviated) glue-code:

fn generated_ready(storage: &InstanceStorage<MyClass>) {
  if storage.get_possibly_aliased() == 0 {
    let mut instance = unsafe { storage.bind_self() };
    instance.ready()
  } else {
    let mut instance = storage.bind_mut();
    instance.ready()
  }
}

fn generated_some_func(storage: &InstanceStorage<MyClass>) {
  if storage.get_possibly_aliased() == 0 {
    let mut instance = unsafe { storage.bind_self() };
    instance.some_func()
  } else {
    let mut instance = storage.bind_mut();
    instance.some_func()
  }
}

One thing to consider is adding an immutable/mutable pair of these, as it’d be safe to call a method that takes a &self from one that takes a &mut self. However the other way is not safe. This could be done with a second counter that just tracks how many immutable self-binds we’ve taken. And if it’s greater than 0 then a mutable self bind will fail, but this should also allow bind() to succeed.

old contents i wonder if we can add another kind of unsafe bind, like `bind_self`. which returns a `GdSelf` that derefs to `&mut self`. where the safety invariant is that it is only used to call a method on it.

it fails if there exists a bind or bind_mut (and also blocks both of those calls), but not if there already exists a bind_self. that way we can allow for reentrancy without borrow failures. and i dont think it’d be possible to get two aliasing mut pointers, since it cant exist alongside the other binds and i dont think you can ever observe two GdSelf at the same time.

this needs one modification, accessing base must require a &mut self reference. we can do this by changing how base is accessed to a method on self. so we’d have self.base().add_child() for instance. that way rust can understand that any usage of base could potentially require a mutable reference to self. this could be done by making dereferencing a Base an unsafe method and auto-generating the accessor method.

This should avoid the unsoundness with allowing reentrancy @Bromeon identified above.

This is additionally also semantically more correct imo, since a Base is a pointer to self, so it makes sense that calling a method on it should require an appropriate reference to self.

[^1]: I am unsure how user-facing bind_self() should be. We may choose to only implement it on InstanceStorage thus making it entirely internal and hidden from the user. bind_self() would likely need to be unsafe also as it can return multiple &mut T references to the same object. [^2]: This will actually be a RefCell<usize> or AtomicUsize as relevant. [^3]: We might be able to make this a safe function, if we check that the &mut self points to the same rust object that the Base refers to.

upcast() is no longer needed; since #370 you can call Godot functions on the Gd<T> directly.

This is an interesting challenge, and probably far more general than the specific example in the issue title. Working with notifications or signals in general has the potential to “backfire”, leading to a panic.

Adding another data point: I encountered this issue while implementing a custom item list, which basically started as a port of item_list.cpp of Godot. The pattern used in that implementation doesn’t work out-of-the box in Rust, because it involves backfiring signals, even in a relatively subtle way:

  • The item list uses a scroll back, and its "value_changed" signal gets connected to a callback that simply does queue_redraw. That makes sense, in general the item list wants to react to scroll events.
  • The main layouting actually happens inside NOTIFICATION_DRAW.
  • Under certain conditions this drawing logic actually interacts with the scroll bar, calling its set_max and set_page functions. What is not immediately obvious: Under certain conditions these setter actually internally delegate to the set_value setter, which in turn triggers the signal. In Rust this results in a panic, because the callback cannot be called while the drawing is happening. Interestingly there is a set_value_no_signal setter, that allows to set the scrollbar value without generating a signal. However, there are no set_max_no_signal or set_page_no_signal variations, which makes it tricky to realize that the code can backfire a signal. In fact it took me a while to encounter the bug in practice, because it requires the special condition of the scroll bar being scrolled all the way to the bottom, followed by resizing the window in a way that the height increases.
  • The work-around I’m using in this particular example is to temporarily disconnect the signal during the drawing logic that messes with the scrollbar and re-connect the signal after the interaction is done. It actually turns out that doing a queue_redraw is pointless in this case when the scrollbar changes its value triggered by the drawing logic itself, because the drawing is already happening. Triggering yet another redraw is actually a waste of resources. So while the bug was not so easy to find, it has actually revealed a code smell of the C++ version, and the resolution doesn’t even feel so bad.

Some thoughts about more general solutions:

The idea in #359 looks quite helpful. For instance, instead of disconnecting and reconnecting the signal, the drawing logic could have set self.is_within_redraw = true (and at the end back to false)

#[func]
fn on_scroll_changed(_self: Gd<Self>) {
    // Only try to mutable bind & trigger a redraw if we are not already drawing...
    if self.is_within_redraw {
        let _self = self.bind_mut();
        _self.base.queue_redraw()
    }
}

But the manual maintenance of self.is_within_redraw feels a bit clumsy. This makes me wonder if it would be possible to offer a kind of try_bind_mut() which basically knows whether we are already bound or not? This would allow to write:

#[func]
fn on_scroll_changed(_self: Gd<Self>) {
    if let Some(_self) = _self.try_bind_mut() {
        // No other mut binding exists, so we know we have been triggered from "externally".
        // In this case we want to trigger a redraw.
        _self.base.queue_redraw()
    } else {
        // Apparently we cannot get a mut binding. This indicates that we have triggered ourselves!
        // And assuming this can only come from the drawing/layouting logic itself, we actually don't
        // have to do anything here in this particular example...
    }
}

The interesting question is what to do in the else branch in the general case. One possibility would be to use interior mutability to encode “what I would have liked to do, but couldn’t”. At the end of the other methods we could check whether such actions exist and execute them there. I’m not sure if it is possible/sensible to generalize this, but a crazy idea was to have a kind of DeferredQueue that uses interior mutability like this:

#[func]
fn on_scroll_changed(&self) { // due to interior mutability we don't need &mut
    // The idea of `run_or_enqueue` would be to either execute the code block synchronously
    // if a mutable binding is possible, or to push the FnOnce into a queue with interior mutability.
    // If desirable, a plain `.enqueue()` that is consistently lazy could be offered a well.
    self.deferred_queue.run_or_enqueue(|&mut this| {
        // here we have a &mut binding
    });
}

#[func]
fn process(&mut self, delta: f64) {
    // This executes all the accumulated actions. Since the `process` function runs once per frame, the amount
    // of delay can be okay-ish. To reduce the delay the user could also place the `run_all()` strategically in
    // certain other methods.
    self.deferred_queue.run_all();
}

Not sure if this makes sense, but I feel that offering a simple way to “defer” signals/notifications could help a lot to mitigate this issue, because it breaks the cyclic nature of the problem. In many use cases the slight delay may be a reasonable price to pay for being guaranteed not to run into “signal cycle panics”.