podman: [Bug]: Podman machine fails to start with exit status 255 on Mac
Issue Description
There seems to be an issue when trying to start Podman with podman machine start when using macOS and QEMU. I created a Podman machine about two months ago but now the machine fails to start. Somehow, starting the machine got gradually worse after a time before completely failing in startup.
Here is what happens when I try to start the machine.
podman machine start
Starting machine "podman-machine-default"
Waiting for VM ...
Mounting volume... /Users/me:/Users/me
Error: exit status 255
From the podman machine start --log-level debug output, I can see the last statement that is executed is the SSH command related to creating the mount point directories (maybe line 669 in qemu/machine.go), which fails. Previous issues have considered this to be a sign of e.g. invalid SSH configuration (e.g. https://github.com/containers/podman/issues/14237) but maybe there is something more to it. When running with log-level debug and the QEMU window open, the exit status 255 error shows before all the Fedora services have started.
Could this be a race condition issue so that SSH related services have not yet started when the SSH mount commands are executed? I found one closed and apparently fixed issue where the race condition was suggested (https://github.com/containers/podman/issues/11532).
I tried connecting to the QEMU QMP monitor socket with nc -U qmp_podman-machine-default.sock and running the following queries just after QEMU has started and before all the services are running.
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 2, "major": 7}, "package": ""}, "capabilities": ["oob"]}}
{ "execute": "qmp_capabilities"}
{"return": {}}
{ "execute": "query-status" }
{"return": {"status": "running", "singlestep": false, "running": true}}
So the VM is in an running state and also the gvproxy port (50810 here) is in a listening state early on.
netstat -ap tcp | grep -i listen
tcp4 0 0 localhost.50810 *.* LISTEN
Is it possible that condition in qemu/machine.go (line 645) for state != machine.Running || !listening { cannot hold back the execution of the SSH statement before the machine is fully initialized?
Steps to reproduce the issue
Steps to reproduce the issue (happens on an existing Podman machine)
podman machine start
Describe the results you received
podman machine start
Starting machine "podman-machine-default"
Waiting for VM ...
Mounting volume... /Users/me:/Users/me
Error: exit status 255
Describe the results you expected
Podman machine should be able to start. Thinking more broadly, maybe there should additional guarantees that SSH on the QEMU machine is up and running before issuing commands to the machine. Perhaps the SSH connection should be polled for a couple of times with sensible timeout or then other events from QEMU used for the purpose? For instance, the QMP monitor emits a NIC_RX_FILTER_CHANGED event towards the end of the initialization (not sure of its purpose, though).
{"timestamp": {"seconds": 1675763180, "microseconds": 921089}, "event": "NIC_RX_FILTER_CHANGED", "data": {"path": "/machine/peripheral-anon/device[0]/virtio-backend"}}
podman info output
The machine (or gvproxy) is not up and running so cannot get info.
podman info
Error: failed to connect: dial tcp [::1]:50810: connect: connection refused
Some supplemental environment details to podman info .
Podman (brew): stable 4.4.0 (bottled), HEAD
OS: macOS Ventura 13.2
Architecture (QEMU): aarch64
Chip: Apple M1 Pro
Podman in a container
No
Privileged Or Rootless
Rootless
Upstream Latest Release
Yes
Additional environment details
Here are the results of podman machine inspect.
[
{
"ConfigPath": {
"Path": "/Users/me/.config/containers/podman/machine/qemu/podman-machine-default.json"
},
"ConnectionInfo": {
"PodmanSocket": {
"Path": "/Users/me/.local/share/containers/podman/machine/podman-machine-default/podman.sock"
}
},
"Created": "2022-12-12T12:34:18.127475+02:00",
"Image": {
"IgnitionFilePath": {
"Path": "/Users/me/.config/containers/podman/machine/qemu/podman-machine-default.ign"
},
"ImageStream": "testing",
"ImagePath": {
"Path": "/Users/me/.local/share/containers/podman/machine/qemu/podman-machine-default_fedora-coreos-37.20221127.2.0-qemu.aarch64.qcow2"
}
},
"LastUp": "2023-01-31T16:41:24.124649+02:00",
"Name": "podman-machine-default",
"Resources": {
"CPUs": 1,
"DiskSize": 100,
"Memory": 3072
},
"SSHConfig": {
"IdentityPath": "/Users/me/.ssh/podman-machine-default",
"Port": 50810,
"RemoteUsername": "core"
},
"State": ""
}
]
Additional information
It appears that the problem affects Podman machines that have been in use for some time (i.e. a few months). Why fresh Podman machines straight out-of-the-box work and what cause the slow decay remain a mystery.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 21
- Comments: 47 (9 by maintainers)
Commits related to this issue
- Invert state and listening logic When waiting for a VM to boot up, we check for two indicators to before mounting volumes: is the vm "running" and is the qemu socket listening. In some cases, presum... — committed to baude/podman by baude a year ago
- podman machine start: fix ready service When debugging #17403, the logs of sshd indicates that Podman tried to ssh into the machine too soon as the `core` user has not yet been fully set up: > erro... — committed to vrothberg/libpod by vrothberg a year ago
- podman machine start: fix ready service When debugging #17403, the logs of sshd indicates that Podman tried to ssh into the machine too soon as the `core` user has not yet been fully set up: > erro... — committed to openshift-cherrypick-robot/podman by vrothberg a year ago
- podman machine start: fix ready service When debugging #17403, the logs of sshd indicates that Podman tried to ssh into the machine too soon as the `core` user has not yet been fully set up: > erro... — committed to ashley-cui/podman by vrothberg a year ago
- machine start: qemu: adjust backoffs Make sure that starting a qemu machine uses proper exponential backoffs and that a single variable isn't shared across multiple backoffs. DO NOT BACKPORT: I want... — committed to vrothberg/libpod by vrothberg a year ago
- machine start: qemu: adjust backoffs Make sure that starting a qemu machine uses proper exponential backoffs and that a single variable isn't shared across multiple backoffs. DO NOT BACKPORT: I want... — committed to vrothberg/libpod by vrothberg a year ago
- machine start: qemu: wait for SSH readiness During the exponential backoff waiting for the machine to be fully up and running, also make sure that SSH is ready. The systemd dependencies of the ready... — committed to vrothberg/libpod by vrothberg a year ago
- machine start: qemu: wait for SSH readiness During the exponential backoff waiting for the machine to be fully up and running, also make sure that SSH is ready. The systemd dependencies of the ready... — committed to vrothberg/libpod by vrothberg a year ago
- machine start: qemu: wait for SSH readiness During the exponential backoff waiting for the machine to be fully up and running, also make sure that SSH is ready. The systemd dependencies of the ready... — committed to vrothberg/libpod by vrothberg a year ago
- podman machine start: fix ready service When debugging #17403, the logs of sshd indicates that Podman tried to ssh into the machine too soon as the `core` user has not yet been fully set up: > erro... — committed to cgiradkar/podman by vrothberg a year ago
- machine start: qemu: adjust backoffs Make sure that starting a qemu machine uses proper exponential backoffs and that a single variable isn't shared across multiple backoffs. DO NOT BACKPORT: I want... — committed to cgiradkar/podman by vrothberg a year ago
- machine start: qemu: wait for SSH readiness During the exponential backoff waiting for the machine to be fully up and running, also make sure that SSH is ready. The systemd dependencies of the ready... — committed to vrothberg/libpod by vrothberg a year ago
- machine start: qemu: wait for SSH readiness During the exponential backoff waiting for the machine to be fully up and running, also make sure that SSH is ready. The systemd dependencies of the ready... — committed to vrothberg/libpod by vrothberg a year ago
- podman machine start: fix ready service When debugging #17403, the logs of sshd indicates that Podman tried to ssh into the machine too soon as the `core` user has not yet been fully set up: > erro... — committed to cgiradkar/podman by vrothberg a year ago
- machine start: qemu: adjust backoffs Make sure that starting a qemu machine uses proper exponential backoffs and that a single variable isn't shared across multiple backoffs. DO NOT BACKPORT: I want... — committed to cgiradkar/podman by vrothberg a year ago
- machine start: qemu: adjust backoffs Make sure that starting a qemu machine uses proper exponential backoffs and that a single variable isn't shared across multiple backoffs. DO NOT BACKPORT: I want... — committed to ashley-cui/podman by vrothberg a year ago
- machine start: qemu: wait for SSH readiness During the exponential backoff waiting for the machine to be fully up and running, also make sure that SSH is ready. The systemd dependencies of the ready... — committed to ashley-cui/podman by vrothberg a year ago
- podman machine start: fix ready service When debugging #17403, the logs of sshd indicates that Podman tried to ssh into the machine too soon as the `core` user has not yet been fully set up: > erro... — committed to cgiradkar/podman by vrothberg a year ago
- machine start: qemu: adjust backoffs Make sure that starting a qemu machine uses proper exponential backoffs and that a single variable isn't shared across multiple backoffs. DO NOT BACKPORT: I want... — committed to cgiradkar/podman by vrothberg a year ago
- podman machine start: fix ready service When debugging #17403, the logs of sshd indicates that Podman tried to ssh into the machine too soon as the `core` user has not yet been fully set up: > erro... — committed to cgiradkar/podman by vrothberg a year ago
I’ve come up with a slightly improved version of my script (read hack) above, which waits for the VM to finish booting before letting podman mount directories.
The above script ran podman/qemu in debug mode, which slowed down performance of the running containers.
This improved version of the script runs podman/qemu in normal mode, and relies on the user waiting for the CPU utilization of the
qemuprocess to lower and stabilize (something that happens once the VM is done loading) before hittingenterand thus allowing podman to mount shared directories.@smakinen I think you are correct about this being a race condition, and some bug internal to Podman
I’m able to get podman to start a machine that was previously failing using the following shell script, which manually pauses the
podmanandgvproxyprocess (no idea what that second command does), and waits until the user verifies that the Qemu VM has finished booting before unpausing those processes.Unfortunately, this is only a bandaid. We need something to address whatever the underlying root problem is.
For me, the issue was my SSH agent with a lot of keys already.
My fix:
Yes, that’s it. The fixes will ship with the upcoming 4.6 release. The plan is to release 4.6 by the end of this week, so I the fix will reach you soon.
I kept on having issue and I am seeing some success thanks to this comment in this issue. I think the topic of this issue could be edited and the M1 part removed, I don’t think the issues discussed here are M1 specific. I run into the same on an Intel Mac. The issue is however likely related to the fact that users here use a Mac.
The podman troubleshooting guide mentions some extra cleanup so the following may help some:
then your regular:
AFAIK, the fix for me was to add the following:
to my
~/.ssh/config.I was consistently getting this
255error even from a clean rebooted system and when using a default podman machine.With the fix above of the ssh config and yet-another cleanup, I was able to finally get things back in order.
The bug is not fixed in 4.5.1, but will be part of the 4.6.0 release (see changelog)
It happened on an Intel MacBook (MacBookPro16,1) too, not just the M1. @jamesmikesell’s script works for me, as a workaround.
Hi, did you try this script: https://github.com/laurent-martin/podman_x86_64_on_apple_aach64
It is similar to @jamesmikesell script, but it automatically detect SSH availability in the VM.
I have updated the script here: https://github.com/laurent-martin/podman_x86_64_on_apple_aach64 So that it automatically fixes this dondition as well now:
Hope to fix this BIG BUG as soon as possible.
Same here, i just ran into this issue. I just temporarily unset SSH_AUTH_SOCK.
@samuel-phan didn’t know about
ssh-add -D, thx!Hitting same issue with the following:
Workarounds mentioned here did not resolve the issue (removing VM and recreating, and script). Running into same failure location as what berndlosert is seeing.
I observed the same: podman vm starting successfully only once every 4 times. but pausing podman until qemu is stable solves the situation. thanks @jamesmikesell
See the script I use podmac.sh start.
Example: