crc: Code Ready Containers becomes 'Not Reachable' due to etcd crashing and unable to restart

General information

OS: Linux
Hypervisor: KVM
Did you run crc setup before starting it (Yes/No)? Yes

CRC version

# Put the output of `crc version`
CodeReady Containers version: 1.21.0+68a4cdd7
OpenShift version: 4.6.9 (embedded in executable)

CRC status

# Put the output of `crc status`
CRC VM:          Running
OpenShift:       Not Reachable (v4.6.9)
Disk Usage:      25.71GB of 74.6GB (Inside the CRC VM)
Cache Usage:     27.04GB
Cache Directory: /home/crcuser/.crc/cache

CRC config

# Put the output of `crc config view`
- consent-telemetry                     : no
- cpus                                  : 12
- disk-size                             : 70
- enable-cluster-monitoring             : true
- memory                                : 48000

Host Operating System

# Put the output of `cat /etc/os-release` in case of Linux
NAME="CentOS Linux"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"

Steps to reproduce

Start CRC, leave it running
After 1-7 days (perhaps more… unclear)
CRC OpenShift API stops responding and status shows ‘Not Reachable’
Some container workloads (e.g. other pods, services, routes for applications) stay operational
Stopping / restarting crc does not recover

Expected

CRC OpenShift API functions allowing oc login and other actions on the cluster against the API:

oc login -u developer -p developer
Login successful.

You have one project on this server: "victim"

Using project "victim".

Actual

Unable to login with cli or web console:

oc login -u developer -p developer
error: dial tcp 192.168.130.11:6443: connect: connection refused - verify you have provided the correct host and port and that the server is currently running.

Logs

You can start crc with crc start --log-level debug to collect logs. Link to gist with logs

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 37 (16 by maintainers)

Most upvoted comments

figured it out:

crc start -p pull-secret
...
INFO Loading bundle: crc_libvirt_4.6.9.crcbundle ... 
INFO Verifying bundle crc_libvirt_4.6.9.crcbundle ... 
INFO Creating CodeReady Containers VM for OpenShift 4.7.0-0.nightly-2021-01-27-110023...

I will keep this instance running and monitor status. May take a little while to see any clear signs one way or another.

timroster on Jan 27, 2021

I experimented a bit and found out that with cobra, only the the last parameter is accepted. It makes node-ip is always empty in crc.

kubelet code is then doing a lookup to get the IP. https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/nodestatus/setters.go#L225

Then, it is compared to the list of all interfaces to see if it’s valid (net.InterfaceAddrs()). Can you give me the output of ifconfig and route -n ? If it doesn’t match, it picks the interface of the default gateway.

I guess we fall in the last case.

guillaumerose on Jan 27, 2021

Hi @timroster , The release is almost out. Can you try with http://mirror.openshift.com/pub/openshift-v4/clients/crc/1.22.0/ ? Thanks.

guillaumerose on Feb 10, 2021

I think I possibly found the issue. We define 2 times the same parameter node-ip in /etc/systemd/system/kubelet.service

ExecStart=/usr/bin/hyperkube \
    kubelet \
      --node-ip=192.168.126.11 \
      --config=/etc/kubernetes/kubelet.conf \
      --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \
      --kubeconfig=/var/lib/kubelet/kubeconfig \
      --container-runtime=remote \
      --container-runtime-endpoint=/var/run/crio/crio.sock \
      --runtime-cgroups=/system.slice/crio.service \
      --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=${ID} \
      --node-ip=${KUBELET_NODE_IP} \

I don’t know where this KUBELET_NODE_IP env variable is coming but it is definitely suspect!

The second parameter was introduced in OpenShift 4.6. https://github.com/openshift/machine-config-operator/commit/0b1b2d5b10751e41af79d2d75705ca03589a1f7e

guillaumerose on Jan 27, 2021