podman: [Bug]: Podman machine fails to start with exit status 255 on Mac

Issue Description

There seems to be an issue when trying to start Podman with podman machine start when using macOS and QEMU. I created a Podman machine about two months ago but now the machine fails to start. Somehow, starting the machine got gradually worse after a time before completely failing in startup.

Here is what happens when I try to start the machine.

podman machine start 
Starting machine "podman-machine-default"
Waiting for VM ...
Mounting volume... /Users/me:/Users/me
Error: exit status 255

From the podman machine start --log-level debug output, I can see the last statement that is executed is the SSH command related to creating the mount point directories (maybe line 669 in qemu/machine.go), which fails. Previous issues have considered this to be a sign of e.g. invalid SSH configuration (e.g. https://github.com/containers/podman/issues/14237) but maybe there is something more to it. When running with log-level debug and the QEMU window open, the exit status 255 error shows before all the Fedora services have started.

Could this be a race condition issue so that SSH related services have not yet started when the SSH mount commands are executed? I found one closed and apparently fixed issue where the race condition was suggested (https://github.com/containers/podman/issues/11532).

I tried connecting to the QEMU QMP monitor socket with nc -U qmp_podman-machine-default.sock and running the following queries just after QEMU has started and before all the services are running.

{"QMP": {"version": {"qemu": {"micro": 0, "minor": 2, "major": 7}, "package": ""}, "capabilities": ["oob"]}}
{ "execute": "qmp_capabilities"} 
{"return": {}}
{ "execute": "query-status" }
{"return": {"status": "running", "singlestep": false, "running": true}}

So the VM is in an running state and also the gvproxy port (50810 here) is in a listening state early on.

netstat -ap tcp | grep -i listen
tcp4       0      0  localhost.50810        *.*                    LISTEN

Is it possible that condition in qemu/machine.go (line 645) for state != machine.Running || !listening { cannot hold back the execution of the SSH statement before the machine is fully initialized?

Steps to reproduce the issue

Steps to reproduce the issue (happens on an existing Podman machine)

  1. podman machine start

Describe the results you received

podman machine start 
Starting machine "podman-machine-default"
Waiting for VM ...
Mounting volume... /Users/me:/Users/me
Error: exit status 255

Describe the results you expected

Podman machine should be able to start. Thinking more broadly, maybe there should additional guarantees that SSH on the QEMU machine is up and running before issuing commands to the machine. Perhaps the SSH connection should be polled for a couple of times with sensible timeout or then other events from QEMU used for the purpose? For instance, the QMP monitor emits a NIC_RX_FILTER_CHANGED event towards the end of the initialization (not sure of its purpose, though).

{"timestamp": {"seconds": 1675763180, "microseconds": 921089}, "event": "NIC_RX_FILTER_CHANGED", "data": {"path": "/machine/peripheral-anon/device[0]/virtio-backend"}}

podman info output

The machine (or gvproxy) is not up and running so cannot get info.

podman info   
Error: failed to connect: dial tcp [::1]:50810: connect: connection refused

Some supplemental environment details to podman info .

Podman (brew): stable 4.4.0 (bottled), HEAD
OS: macOS Ventura 13.2
Architecture (QEMU): aarch64
Chip: Apple M1 Pro

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

Here are the results of podman machine inspect.

[
  {
    "ConfigPath": {
      "Path": "/Users/me/.config/containers/podman/machine/qemu/podman-machine-default.json"
    },
    "ConnectionInfo": {
      "PodmanSocket": {
        "Path": "/Users/me/.local/share/containers/podman/machine/podman-machine-default/podman.sock"
      }
    },
    "Created": "2022-12-12T12:34:18.127475+02:00",
    "Image": {
      "IgnitionFilePath": {
        "Path": "/Users/me/.config/containers/podman/machine/qemu/podman-machine-default.ign"
      },
      "ImageStream": "testing",
      "ImagePath": {
        "Path": "/Users/me/.local/share/containers/podman/machine/qemu/podman-machine-default_fedora-coreos-37.20221127.2.0-qemu.aarch64.qcow2"
      }
    },
    "LastUp": "2023-01-31T16:41:24.124649+02:00",
    "Name": "podman-machine-default",
    "Resources": {
      "CPUs": 1,
      "DiskSize": 100,
      "Memory": 3072
    },
    "SSHConfig": {
      "IdentityPath": "/Users/me/.ssh/podman-machine-default",
      "Port": 50810,
      "RemoteUsername": "core"
    },
    "State": ""
  }
]

Additional information

It appears that the problem affects Podman machines that have been in use for some time (i.e. a few months). Why fresh Podman machines straight out-of-the-box work and what cause the slow decay remain a mystery.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 21
  • Comments: 47 (9 by maintainers)

Commits related to this issue

Most upvoted comments

I’ve come up with a slightly improved version of my script (read hack) above, which waits for the VM to finish booting before letting podman mount directories.

The above script ran podman/qemu in debug mode, which slowed down performance of the running containers.

This improved version of the script runs podman/qemu in normal mode, and relies on the user waiting for the CPU utilization of the qemu process to lower and stabilize (something that happens once the VM is done loading) before hitting enter and thus allowing podman to mount shared directories.

#!/bin/bash

pkill podman
pkill qemu

podman machine start &
PID=$!
sleep 2

pkill -STOP podman
pkill -STOP gvproxy

echo "^^^^"
echo "^^^^"
echo "^^^^"
echo "If the above says the VM already running or started"
echo "then edit the json file located at ~/.config/containers/podman/machine/qemu/"
echo "and change the line"
echo "\"Starting\": true"
echo "to be"
echo "\"Starting\": false"
echo ""
echo "don't forget to save, and rerun this script."
echo ""
echo "Else, continue with instructions below"
echo ""
echo "Wait until the displayed CPU utilization lowers and stabilizes to 1% or less"
echo "Then hit enter"


PID_QEMU=$(pgrep qemu)

while true; do
  CPU=$(ps -p $PID_QEMU -o %cpu | awk 'NR>1 {print $1}')
  printf "\rCPU utilization: %s%%             " $CPU

  read -s -t 1
  if [ $? -eq 0 ]; then
    break
  fi
done


pkill -CONT podman
pkill -CONT gvproxy

@smakinen I think you are correct about this being a race condition, and some bug internal to Podman

I’m able to get podman to start a machine that was previously failing using the following shell script, which manually pauses the podman and gvproxy process (no idea what that second command does), and waits until the user verifies that the Qemu VM has finished booting before unpausing those processes.

#!/bin/bash

pkill podman
pkill qemu

podman machine start --log-level debug &
PID=$!
sleep 2

pkill -STOP podman
pkill -STOP gvproxy

echo "^^^^"
echo "^^^^"
echo "^^^^"
echo "If the above says the VM already running or started"
echo "then edit the json file located at ~/.config/containers/podman/machine/qemu/"
echo "and change the line"
echo "\"Starting\": true"
echo "to be"
echo "\"Starting\": false"
echo ""
echo "dont forget to save, and rerun this script."
echo ""
echo "Else, continue with instructions below"
echo ""
echo "Qemu will open in another window (likely in the background)"
echo "wait until you see a login prompt on that window"
read -p "then return to THIS terminal and hit enter"

pkill -CONT podman
pkill -CONT gvproxy

Unfortunately, this is only a bandaid. We need something to address whatever the underlying root problem is.

For me, the issue was my SSH agent with a lot of keys already.

My fix:

ssh-add -D  # clear the SSH keys from my SSH agent

podman machine stop
podman machine start

Yes, that’s it. The fixes will ship with the upcoming 4.6 release. The plan is to release 4.6 by the end of this week, so I the fix will reach you soon.

I kept on having issue and I am seeing some success thanks to this comment in this issue. I think the topic of this issue could be edited and the M1 part removed, I don’t think the issues discussed here are M1 specific. I run into the same on an Intel Mac. The issue is however likely related to the fact that users here use a Mac.

The podman troubleshooting guide mentions some extra cleanup so the following may help some:

podman machine stop
podman machine rm -f
rm -rf ~/.local/share/containers/podman
rm -rf ~/.config/containers/

then your regular:

podman machine init
podman machine start

AFAIK, the fix for me was to add the following:

# to fix podman issues
# see https://github.com/containers/podman/issues/14237
Host localhost
  IdentitiesOnly yes

to my ~/.ssh/config.

I was consistently getting this 255 error even from a clean rebooted system and when using a default podman machine.

With the fix above of the ssh config and yet-another cleanup, I was able to finally get things back in order.

The bug is not fixed in 4.5.1, but will be part of the 4.6.0 release (see changelog)

It happened on an Intel MacBook (MacBookPro16,1) too, not just the M1. @jamesmikesell’s script works for me, as a workaround.

I’ve tried adding a 1 second sleep, and also putting the SSH portion into a for loop to re-try on failure

Hi, did you try this script: https://github.com/laurent-martin/podman_x86_64_on_apple_aach64

It is similar to @jamesmikesell script, but it automatically detect SSH availability in the VM.

I also had to change “Starting” to false in json

I have updated the script here: https://github.com/laurent-martin/podman_x86_64_on_apple_aach64 So that it automatically fixes this dondition as well now:

./podmac.sh start intel_64
Checking jq...OK, jq found
Checking podman...OK, podman found
Checking curl...OK, curl found
Resetting stale starting state.         <--------------------
Starting machine "intel_64"
Waiting for VM ...
Pausing podman
SSH available: Resuming podman.
Mounting volume... /Users/laurent:/Users/laurent

This machine is currently configured in rootless mode. If your containers
require root permissions (e.g. ports < 1024), or if you run into compatibility
issues with non-podman clients, you can switch using the following command:

	podman machine set --rootful intel_64

API forwarding listening on: /var/run/docker.sock
Docker API clients default to this address. You do not need to set DOCKER_HOST.

Machine "intel_64" started successfully

Hope to fix this BIG BUG as soon as possible.

For me, the issue was my SSH agent with a lot of keys already.

Same here, i just ran into this issue. I just temporarily unset SSH_AUTH_SOCK.

@samuel-phan didn’t know about ssh-add -D, thx!

Hitting same issue with the following:

  • A1990 (2019 Intel)
  • podman 4.3.1 and 4.4.4 (upgrade did not help)

Workarounds mentioned here did not resolve the issue (removing VM and recreating, and script). Running into same failure location as what berndlosert is seeing.

I observed the same: podman vm starting successfully only once every 4 times. but pausing podman until qemu is stable solves the situation. thanks @jamesmikesell

See the script I use podmac.sh start.

Example:

$ ./podmac.sh start intel_64

Starting machine "intel_64"
Waiting for VM ...
Pausing podman
Waiting for SSH
SSH-2.0-OpenSSH_8.8
Resuming podman
...
Machine "intel_64" started successfully