vz: panic: runtime/cgo: misuse of an invalid Handle on RequestStop and waiting for vm stopped state

Describe the bug Issue is similar to the #119. But now it happens mainly during RequestStop and waiting for vm state to change to stop

To Reproduce Steps to reproduce the behavior:

  1. limactl start template://experimental/vz
  2. limactl stop vz
  3. Notice panic happened

Expected behavior No panic occurs during stop

Environment that you use to compile (please complete the following information):

  • Xcode version: [Xcode 14.1 Build version 14B47b]
  • macOS Version: [ProductName: macOS, ProductVersion: 13.2, BuildVersion: 22D49]
  • mac architecture: [x86_64]
  • Go Version: [go version go1.20.1 darwin/amd64]

Additional context https://github.com/lima-vm/lima/issues/1381

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 5
  • Comments: 18 (17 by maintainers)

Commits related to this issue

Most upvoted comments

@balajiv113 I apologize for not being able to keep the promise I made earlier.

Due to a hectic workload, I was unable to complete today’s tasks. Unfortunately, I will be going on a trip for the next four days and it seems unlikely that I’ll be able to work on this issue during that time.

It appears that the release will be pushed back to next week 🙇

@balajiv113 I released this fix as https://github.com/Code-Hex/vz/releases/tag/v3.0.6 Thank you so much!

@balajiv113 Thanks! I handle this issue on Wednesday 🙏

@Code-Hex This issue is still happening even with StateChangeHandler() I was able to identify the root cause, with that was able to come-up with a reproduction test as well

Root cause StateHandle object here https://github.com/Code-Hex/vz/blob/main/virtualization.go#L106 gets GC’ed in go side.

Due to which, in change observer https://github.com/Code-Hex/vz/blob/main/virtualization.go#L157 while converting unsafe.pointer to cgo.Handler, the uintptr value is different / invalid

Possible Solution Holding unsafe.Pointer of stateHandler in vm here https://github.com/Code-Hex/vz/blob/main/virtualization.go#L121 makes sure that this doesn’t gets GC’ed

Reproduction Test

func TestRunIssue(t *testing.T) {
	container := newVirtualizationMachine(t,
		func(vmc *VirtualMachineConfiguration) error {
			return setupConsoleConfig(vmc)
		},
	)
	defer container.Close()

	sshSession := container.NewSession(t)
	defer sshSession.Close()

	vm := container.VirtualMachine

	if got := vm.State(); VirtualMachineStateRunning != got {
		t.Fatalf("want state %v but got %v", VirtualMachineStateRunning, got)
	}
	if got := vm.CanPause(); !got {
		t.Fatal("want CanPause is true")
	}
	if err := vm.Pause(); err != nil {
		t.Fatal(err)
	}

	timeout := 5 * time.Second
	waitState(t, timeout, vm, VirtualMachineStatePausing)
	waitState(t, timeout, vm, VirtualMachineStatePaused)

	if got := vm.State(); VirtualMachineStatePaused != got {
		t.Fatalf("want state %v but got %v", VirtualMachineStatePaused, got)
	}
	if got := vm.CanResume(); !got {
		t.Fatal("want CanPause is true")
	}
	if err := vm.Resume(); err != nil {
		t.Fatal(err)
	}

	waitState(t, timeout, vm, VirtualMachineStateResuming)
	waitState(t, timeout, vm, VirtualMachineStateRunning)

	if got := vm.CanRequestStop(); !got {
		t.Fatal("want CanRequestStop is true")
	}

	ch := make(chan bool)
	runtime.SetFinalizer(&vm.stateHandle, func(self *cgo.Handle) {
		ch <- true
	})

	runtime.GC()
	select {
	case <-ch:
	case <-time.After(5 * time.Minute):
		t.Errorf("finalizer didn't run")
	}

	runtime.GC()
	sshSession.Run("poweroff")
	waitState(t, timeout, vm, VirtualMachineStateStopped)

	if got := vm.State(); VirtualMachineStateStopped != got {
		t.Fatalf("want state %v but got %v", VirtualMachineStateStopped, got)
	}
}

Test will take time as it has to wait till stateHandle gets finalised.

Do let me know if you need any more input. I hope we can reopen this issue as we now have a test to reproduce

I’m working this!

@Code-Hex, Thanks a lot. Will integrate with lima. Hopefully we should not see this issue again 😃

@Code-Hex No problem at all.

Please do enjoy your day off 🎉

Thank you for the fix 😃 I can confirm that this fixes panic that happened with in StateChangeHandler()

But even then a panic occurs if we call Status() directly after the RequestStop() is called. But unfortunately i still don’t have vz specific reproducible code. So i will keep this issue closed.

PS: Lima issue is hopefully fixed now by not using Status() method https://github.com/lima-vm/lima/pull/1494

I’ve probably fixed it. I don’t have the code to reproduce it at hand, which is worrying, but I’ll see how this goes!

If there are still problems, we expect the issue to be opened again with reproduced code.

@ryancurrah I know supported the RequestStop API from macOS 11.0+. I mentioned to Linux Kernel one. https://github.com/Code-Hex/vz/issues/124#issuecomment-1459582551

Ah that makes more sense now. Thanks!

@pmalek @ryancurrah I’m sorry I didn’t notice! I will check it on Saturday!

@ryancurrah I know supported the RequestStop API from macOS 11.0+. I mentioned to Linux Kernel one. https://github.com/Code-Hex/vz/issues/124#issuecomment-1459582551

btw, is this using Kernel which is supported RequestStop?