podman: [Bug]: Podman machine fails to start with exit status 255 on Mac

Issue Description

There seems to be an issue when trying to start Podman with podman machine start when using macOS and QEMU. I created a Podman machine about two months ago but now the machine fails to start. Somehow, starting the machine got gradually worse after a time before completely failing in startup.

Here is what happens when I try to start the machine.

podman machine start 
Starting machine "podman-machine-default"
Waiting for VM ...
Mounting volume... /Users/me:/Users/me
Error: exit status 255

From the podman machine start --log-level debug output, I can see the last statement that is executed is the SSH command related to creating the mount point directories (maybe line 669 in qemu/machine.go), which fails. Previous issues have considered this to be a sign of e.g. invalid SSH configuration (e.g. https://github.com/containers/podman/issues/14237) but maybe there is something more to it. When running with log-level debug and the QEMU window open, the exit status 255 error shows before all the Fedora services have started.

Could this be a race condition issue so that SSH related services have not yet started when the SSH mount commands are executed? I found one closed and apparently fixed issue where the race condition was suggested (https://github.com/containers/podman/issues/11532).

I tried connecting to the QEMU QMP monitor socket with nc -U qmp_podman-machine-default.sock and running the following queries just after QEMU has started and before all the services are running.

{"QMP": {"version": {"qemu": {"micro": 0, "minor": 2, "major": 7}, "package": ""}, "capabilities": ["oob"]}}
{ "execute": "qmp_capabilities"} 
{"return": {}}
{ "execute": "query-status" }
{"return": {"status": "running", "singlestep": false, "running": true}}

So the VM is in an running state and also the gvproxy port (50810 here) is in a listening state early on.

netstat -ap tcp | grep -i listen
tcp4       0      0  localhost.50810        *.*                    LISTEN

Is it possible that condition in qemu/machine.go (line 645) for state != machine.Running || !listening { cannot hold back the execution of the SSH statement before the machine is fully initialized?

Steps to reproduce the issue

Steps to reproduce the issue (happens on an existing Podman machine)

podman machine start

Describe the results you received

podman machine start 
Starting machine "podman-machine-default"
Waiting for VM ...
Mounting volume... /Users/me:/Users/me
Error: exit status 255

Describe the results you expected

Podman machine should be able to start. Thinking more broadly, maybe there should additional guarantees that SSH on the QEMU machine is up and running before issuing commands to the machine. Perhaps the SSH connection should be polled for a couple of times with sensible timeout or then other events from QEMU used for the purpose? For instance, the QMP monitor emits a NIC_RX_FILTER_CHANGED event towards the end of the initialization (not sure of its purpose, though).

{"timestamp": {"seconds": 1675763180, "microseconds": 921089}, "event": "NIC_RX_FILTER_CHANGED", "data": {"path": "/machine/peripheral-anon/device[0]/virtio-backend"}}

podman info output

The machine (or gvproxy) is not up and running so cannot get info.

podman info   
Error: failed to connect: dial tcp [::1]:50810: connect: connection refused

Some supplemental environment details to podman info .

Podman (brew): stable 4.4.0 (bottled), HEAD
OS: macOS Ventura 13.2
Architecture (QEMU): aarch64
Chip: Apple M1 Pro

Podman in a container

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

Here are the results of podman machine inspect.

[
  {
    "ConfigPath": {
      "Path": "/Users/me/.config/containers/podman/machine/qemu/podman-machine-default.json"
    },
    "ConnectionInfo": {
      "PodmanSocket": {
        "Path": "/Users/me/.local/share/containers/podman/machine/podman-machine-default/podman.sock"
      }
    },
    "Created": "2022-12-12T12:34:18.127475+02:00",
    "Image": {
      "IgnitionFilePath": {
        "Path": "/Users/me/.config/containers/podman/machine/qemu/podman-machine-default.ign"
      },
      "ImageStream": "testing",
      "ImagePath": {
        "Path": "/Users/me/.local/share/containers/podman/machine/qemu/podman-machine-default_fedora-coreos-37.20221127.2.0-qemu.aarch64.qcow2"
      }
    },
    "LastUp": "2023-01-31T16:41:24.124649+02:00",
    "Name": "podman-machine-default",
    "Resources": {
      "CPUs": 1,
      "DiskSize": 100,
      "Memory": 3072
    },
    "SSHConfig": {
      "IdentityPath": "/Users/me/.ssh/podman-machine-default",
      "Port": 50810,
      "RemoteUsername": "core"
    },
    "State": ""
  }
]

Additional information

It appears that the problem affects Podman machines that have been in use for some time (i.e. a few months). Why fresh Podman machines straight out-of-the-box work and what cause the slow decay remain a mystery.

About this issue

Original URL
State: closed
Created a year ago
Reactions: 21
Comments: 47 (9 by maintainers)

Commits related to this issue

Invert state and listening logic When waiting for a VM to boot up, we check for two indicators to before mounting volumes: is the vm "running" and is the qemu socket listening. In some cases, presum... — committed to baude/podman by baude a year ago
podman machine start: fix ready service When debugging #17403, the logs of sshd indicates that Podman tried to ssh into the machine too soon as the `core` user has not yet been fully set up: > erro... — committed to vrothberg/libpod by vrothberg a year ago
podman machine start: fix ready service When debugging #17403, the logs of sshd indicates that Podman tried to ssh into the machine too soon as the `core` user has not yet been fully set up: > erro... — committed to openshift-cherrypick-robot/podman by vrothberg a year ago
podman machine start: fix ready service When debugging #17403, the logs of sshd indicates that Podman tried to ssh into the machine too soon as the `core` user has not yet been fully set up: > erro... — committed to ashley-cui/podman by vrothberg a year ago
machine start: qemu: adjust backoffs Make sure that starting a qemu machine uses proper exponential backoffs and that a single variable isn't shared across multiple backoffs. DO NOT BACKPORT: I want... — committed to vrothberg/libpod by vrothberg a year ago
machine start: qemu: adjust backoffs Make sure that starting a qemu machine uses proper exponential backoffs and that a single variable isn't shared across multiple backoffs. DO NOT BACKPORT: I want... — committed to vrothberg/libpod by vrothberg a year ago
machine start: qemu: wait for SSH readiness During the exponential backoff waiting for the machine to be fully up and running, also make sure that SSH is ready. The systemd dependencies of the ready... — committed to vrothberg/libpod by vrothberg a year ago
machine start: qemu: wait for SSH readiness During the exponential backoff waiting for the machine to be fully up and running, also make sure that SSH is ready. The systemd dependencies of the ready... — committed to vrothberg/libpod by vrothberg a year ago
machine start: qemu: wait for SSH readiness During the exponential backoff waiting for the machine to be fully up and running, also make sure that SSH is ready. The systemd dependencies of the ready... — committed to vrothberg/libpod by vrothberg a year ago
podman machine start: fix ready service When debugging #17403, the logs of sshd indicates that Podman tried to ssh into the machine too soon as the `core` user has not yet been fully set up: > erro... — committed to cgiradkar/podman by vrothberg a year ago
machine start: qemu: adjust backoffs Make sure that starting a qemu machine uses proper exponential backoffs and that a single variable isn't shared across multiple backoffs. DO NOT BACKPORT: I want... — committed to cgiradkar/podman by vrothberg a year ago
machine start: qemu: wait for SSH readiness During the exponential backoff waiting for the machine to be fully up and running, also make sure that SSH is ready. The systemd dependencies of the ready... — committed to vrothberg/libpod by vrothberg a year ago
machine start: qemu: wait for SSH readiness During the exponential backoff waiting for the machine to be fully up and running, also make sure that SSH is ready. The systemd dependencies of the ready... — committed to vrothberg/libpod by vrothberg a year ago
podman machine start: fix ready service When debugging #17403, the logs of sshd indicates that Podman tried to ssh into the machine too soon as the `core` user has not yet been fully set up: > erro... — committed to cgiradkar/podman by vrothberg a year ago
machine start: qemu: adjust backoffs Make sure that starting a qemu machine uses proper exponential backoffs and that a single variable isn't shared across multiple backoffs. DO NOT BACKPORT: I want... — committed to cgiradkar/podman by vrothberg a year ago
machine start: qemu: adjust backoffs Make sure that starting a qemu machine uses proper exponential backoffs and that a single variable isn't shared across multiple backoffs. DO NOT BACKPORT: I want... — committed to ashley-cui/podman by vrothberg a year ago
machine start: qemu: wait for SSH readiness During the exponential backoff waiting for the machine to be fully up and running, also make sure that SSH is ready. The systemd dependencies of the ready... — committed to ashley-cui/podman by vrothberg a year ago
podman machine start: fix ready service When debugging #17403, the logs of sshd indicates that Podman tried to ssh into the machine too soon as the `core` user has not yet been fully set up: > erro... — committed to cgiradkar/podman by vrothberg a year ago
machine start: qemu: adjust backoffs Make sure that starting a qemu machine uses proper exponential backoffs and that a single variable isn't shared across multiple backoffs. DO NOT BACKPORT: I want... — committed to cgiradkar/podman by vrothberg a year ago
podman machine start: fix ready service When debugging #17403, the logs of sshd indicates that Podman tried to ssh into the machine too soon as the `core` user has not yet been fully set up: > erro... — committed to cgiradkar/podman by vrothberg a year ago

Most upvoted comments

I’ve come up with a slightly improved version of my script (read hack) above, which waits for the VM to finish booting before letting podman mount directories.

The above script ran podman/qemu in debug mode, which slowed down performance of the running containers.

This improved version of the script runs podman/qemu in normal mode, and relies on the user waiting for the CPU utilization of the qemu process to lower and stabilize (something that happens once the VM is done loading) before hitting enter and thus allowing podman to mount shared directories.

#!/bin/bash

pkill podman
pkill qemu

podman machine start &
PID=$!
sleep 2

pkill -STOP podman
pkill -STOP gvproxy

echo "^^^^"
echo "^^^^"
echo "^^^^"
echo "If the above says the VM already running or started"
echo "then edit the json file located at ~/.config/containers/podman/machine/qemu/"
echo "and change the line"
echo "\"Starting\": true"
echo "to be"
echo "\"Starting\": false"
echo ""
echo "don't forget to save, and rerun this script."
echo ""
echo "Else, continue with instructions below"
echo ""
echo "Wait until the displayed CPU utilization lowers and stabilizes to 1% or less"
echo "Then hit enter"


PID_QEMU=$(pgrep qemu)

while true; do
  CPU=$(ps -p $PID_QEMU -o %cpu | awk 'NR>1 {print $1}')
  printf "\rCPU utilization: %s%%             " $CPU

  read -s -t 1
  if [ $? -eq 0 ]; then
    break
  fi
done


pkill -CONT podman
pkill -CONT gvproxy

+36

jamesmikesell on May 5, 2023

@smakinen I think you are correct about this being a race condition, and some bug internal to Podman

I’m able to get podman to start a machine that was previously failing using the following shell script, which manually pauses the podman and gvproxy process (no idea what that second command does), and waits until the user verifies that the Qemu VM has finished booting before unpausing those processes.

#!/bin/bash

pkill podman
pkill qemu

podman machine start --log-level debug &
PID=$!
sleep 2

pkill -STOP podman
pkill -STOP gvproxy

echo "^^^^"
echo "^^^^"
echo "^^^^"
echo "If the above says the VM already running or started"
echo "then edit the json file located at ~/.config/containers/podman/machine/qemu/"
echo "and change the line"
echo "\"Starting\": true"
echo "to be"
echo "\"Starting\": false"
echo ""
echo "dont forget to save, and rerun this script."
echo ""
echo "Else, continue with instructions below"
echo ""
echo "Qemu will open in another window (likely in the background)"
echo "wait until you see a login prompt on that window"
read -p "then return to THIS terminal and hit enter"

pkill -CONT podman
pkill -CONT gvproxy

Unfortunately, this is only a bandaid. We need something to address whatever the underlying root problem is.

+13

jamesmikesell on Feb 15, 2023

For me, the issue was my SSH agent with a lot of keys already.

My fix:

ssh-add -D  # clear the SSH keys from my SSH agent

podman machine stop
podman machine start

samuel-phan on Apr 17, 2023

Yes, that’s it. The fixes will ship with the upcoming 4.6 release. The plan is to release 4.6 by the end of this week, so I the fix will reach you soon.

vrothberg on Jul 19, 2023

I kept on having issue and I am seeing some success thanks to this comment in this issue. I think the topic of this issue could be edited and the M1 part removed, I don’t think the issues discussed here are M1 specific. I run into the same on an Intel Mac. The issue is however likely related to the fact that users here use a Mac.

The podman troubleshooting guide mentions some extra cleanup so the following may help some:

podman machine stop
podman machine rm -f
rm -rf ~/.local/share/containers/podman
rm -rf ~/.config/containers/

then your regular:

podman machine init
podman machine start

AFAIK, the fix for me was to add the following:

# to fix podman issues
# see https://github.com/containers/podman/issues/14237
Host localhost
  IdentitiesOnly yes

to my ~/.ssh/config.

I was consistently getting this 255 error even from a clean rebooted system and when using a default podman machine.

With the fix above of the ssh config and yet-another cleanup, I was able to finally get things back in order.

chevdor on Apr 17, 2023

The bug is not fixed in 4.5.1, but will be part of the 4.6.0 release (see changelog)

Stausssi on Jul 19, 2023

It happened on an Intel MacBook (MacBookPro16,1) too, not just the M1. @jamesmikesell’s script works for me, as a workaround.

tonyseek on Feb 15, 2023

I’ve tried adding a 1 second sleep, and also putting the SSH portion into a for loop to re-try on failure

Hi, did you try this script: https://github.com/laurent-martin/podman_x86_64_on_apple_aach64

It is similar to @jamesmikesell script, but it automatically detect SSH availability in the VM.

laurent-martin on May 26, 2023

I also had to change “Starting” to false in json

I have updated the script here: https://github.com/laurent-martin/podman_x86_64_on_apple_aach64 So that it automatically fixes this dondition as well now:

./podmac.sh start intel_64
Checking jq...OK, jq found
Checking podman...OK, podman found
Checking curl...OK, curl found
Resetting stale starting state.         <--------------------
Starting machine "intel_64"
Waiting for VM ...
Pausing podman
SSH available: Resuming podman.
Mounting volume... /Users/laurent:/Users/laurent

This machine is currently configured in rootless mode. If your containers
require root permissions (e.g. ports < 1024), or if you run into compatibility
issues with non-podman clients, you can switch using the following command:

	podman machine set --rootful intel_64

API forwarding listening on: /var/run/docker.sock
Docker API clients default to this address. You do not need to set DOCKER_HOST.

Machine "intel_64" started successfully

laurent-martin on Apr 26, 2023

Hope to fix this BIG BUG as soon as possible.

hsoong on Jun 22, 2023

For me, the issue was my SSH agent with a lot of keys already.

Same here, i just ran into this issue. I just temporarily unset SSH_AUTH_SOCK.

@samuel-phan didn’t know about ssh-add -D, thx!

noyez on Apr 17, 2023

Hitting same issue with the following:

A1990 (2019 Intel)
podman 4.3.1 and 4.4.4 (upgrade did not help)

Workarounds mentioned here did not resolve the issue (removing VM and recreating, and script). Running into same failure location as what berndlosert is seeing.

barloff-st on Mar 29, 2023

I observed the same: podman vm starting successfully only once every 4 times. but pausing podman until qemu is stable solves the situation. thanks @jamesmikesell

See the script I use podmac.sh start.

Example:

$ ./podmac.sh start intel_64

Starting machine "intel_64"
Waiting for VM ...
Pausing podman
Waiting for SSH
SSH-2.0-OpenSSH_8.8
Resuming podman
...
Machine "intel_64" started successfully

laurent-martin on Apr 26, 2023