origin: cluster up fails with "getsockopt: connection refused ()"

I am facing problem with oc cluster up when use it in nested virtualization environments, for example: RHEL7 VM in which I run CentOS VM on which I deploy the cluster. Deployment sometimes goes well, however 90% of cases it fails with getsockopt: connection refused (). It is also reproducible with v3.9.0, however with that error looks a little bit different.

Version

v3.11.0 v3.10.0 v3.9.0

Steps To Reproduce

oc cluster up --public-hostname 192.168.42.18 --routing-suffix 192.168.42.18.nip.io --base-dir /var/lib/minishift/base

Current Result

v3.10.0:

-- Starting OpenShift cluster ..................................................................................Error during 'cluster up' execution: Error starting the cluster. ssh command error:
command : /var/lib/minishift/bin/oc cluster up --public-hostname 192.168.42.18 --routing-suffix 192.168.42.18.nip.io --base-dir /var/lib/minishift/base
err     : exit status 1
output  : Getting a Docker client ...
Checking if image openshift/origin-control-plane:v3.10 is available ...
Pulling image openshift/origin-control-plane:v3.10
Image pull complete
Pulling image openshift/origin-cli:v3.10
Pulled 1/4 layers, 32% complete
Pulled 2/4 layers, 51% complete
Pulled 3/4 layers, 94% complete
Pulled 4/4 layers, 100% complete
Extracting
Image pull complete
Pulling image openshift/origin-node:v3.10
Pulled 5/6 layers, 85% complete
Pulled 6/6 layers, 100% complete
Extracting
Image pull complete
Checking type of volume mount ...
Determining server IP ...
Using public hostname IP 192.168.42.18 as the host IP
Checking if OpenShift is already running ...
Checking for supported Docker version (=>1.22) ...
Checking if insecured registry is configured properly in Docker ...
Checking if required ports are available ...
Checking if OpenShift client is configured properly ...
Checking if image openshift/origin-control-plane:v3.10 is available ...
Starting OpenShift using openshift/origin-control-plane:v3.10 ...
I0809 07:34:08.277022    2024 config.go:42] Running "create-master-config"
I0809 07:34:34.990393    2024 config.go:46] Running "create-node-config"
I0809 07:34:38.233104    2024 flags.go:30] Running "create-kubelet-flags"
I0809 07:34:40.487997    2024 run_kubelet.go:48] Running "start-kubelet"
I0809 07:34:41.401100    2024 run_self_hosted.go:172] Waiting for the kube-apiserver to be ready ...
E0809 07:39:42.208988    2024 run_self_hosted.go:542] API server error: Get https://192.168.42.18:8443/healthz?timeout=32s: dial tcp 192.168.42.18:8443: getsockopt: connection refused ()
Error: timed out waiting for the condition

v3.9:

[hudson@agajdosi-test1 ~]$ minishift start
-- Starting profile 'minishift'
[...]
   Version: v3.9.0
-- Pulling the Openshift Container Image ........................................ OK
-- Copying oc binary from the OpenShift container image to VM ... OK
-- Starting OpenShift cluster ...........................Error during 'cluster up' execution: Error starting the cluster. ssh command error:
command : /var/lib/minishift/bin/oc cluster up --use-existing-config --host-config-dir /var/lib/minishift/openshift.local.config --host-data-dir /var/lib/minishift/hostdata --host-volumes-dir /var/lib/minishift/openshift.local.volumes --host-pv-dir /var/lib/minishift/openshift.local.pv --public-hostname 192.168.42.206 --routing-suffix 192.168.42.206.nip.io
err     : exit status 1
output  : Using nsenter mounter for OpenShift volumes
Using public hostname IP 192.168.42.206 as the host IP
Using 192.168.42.206 as the server IP
Starting OpenShift using openshift/origin:v3.9.0 ...
-- Starting OpenShift container ... 
   Creating initial OpenShift configuration
   Starting OpenShift using container 'origin'
   Waiting for API server to start listening
FAIL
   Error: timed out waiting for OpenShift container "origin" 
   WARNING: 192.168.42.206:8443 may be blocked by firewall rules
   Details:
     Last 10 lines of "origin" container log:
     E0807 13:04:13.932270    2468 leaderelection.go:224] error retrieving resource lock kube-system/kube-controller-manager: Get https://127.0.0.1:8443/api/v1/namespaces/kube-system/configmaps/kube-controller-manager: net/http: TLS handshake timeout
     E0807 13:04:15.511476    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/cmd/kube-scheduler/app/server.go:594: Failed to list *v1.Pod: Get https://127.0.0.1:8443/api/v1/pods?fieldSelector=spec.schedulerName%3Ddefault-scheduler%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0: net/http: TLS handshake timeout
     E0807 13:04:15.713451    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1beta1.ReplicaSet: Get https://127.0.0.1:8443/apis/extensions/v1beta1/replicasets?limit=500&resourceVersion=0: net/http: TLS handshake timeout
     E0807 13:04:15.784421    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Node: Get https://127.0.0.1:8443/api/v1/nodes?limit=500&resourceVersion=0: net/http: TLS handshake timeout
     E0807 13:04:15.787247    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1beta1.StatefulSet: Get https://127.0.0.1:8443/apis/apps/v1beta1/statefulsets?limit=500&resourceVersion=0: net/http: TLS handshake timeout
     E0807 13:04:15.793474    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1beta1.PodDisruptionBudget: Get https://127.0.0.1:8443/apis/policy/v1beta1/poddisruptionbudgets?limit=500&resourceVersion=0: net/http: TLS handshake timeout
     E0807 13:04:15.795902    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.PersistentVolume: Get https://127.0.0.1:8443/api/v1/persistentvolumes?limit=500&resourceVersion=0: net/http: TLS handshake timeout
     E0807 13:04:15.798232    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Service: Get https://127.0.0.1:8443/api/v1/services?limit=500&resourceVersion=0: net/http: TLS handshake timeout
     E0807 13:04:15.802930    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.PersistentVolumeClaim: Get https://127.0.0.1:8443/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: net/http: TLS handshake timeout
     E0807 13:04:15.805170    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.ReplicationController: Get https://127.0.0.1:8443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: net/http: TLS handshake timeout


   Solution:
     Ensure that you can access 192.168.42.206:8443 from your machine

Expected Result

Cluster should be up and running.

Additional Information

Minishift issue: https://github.com/minishift/minishift/issues/2675

[try to run $ oc adm diagnostics (or oadm diagnostics) command if possible] [if you are reporting issue related to builds, provide build logs with BUILD_LOGLEVEL=5] [consider attaching output of the $ oc get all -o json -n <namespace> command to the issue] [visit https://docs.openshift.org/latest/welcome/index.html]

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 2
Comments: 39 (5 by maintainers)

Most upvoted comments

I’m seeing this problem when I use Minishift. My system is a stand alone CentOS box on my home network. The version of Minishift is 1.24.0 which I pulled down 2 days ago. It appears to be running 3.10.0 of Openshift. Is there a workaround for this issue?

bill0425 on Sep 20, 2018

This issue is also reproducible with OKD v3.11.0. It affects Minishift users and also any QE efforts which depend on cluster up - for example, Minishift QE team and also DevStudio QE team - by making the tests quite unstable. ping @deads2k

agajdosi on Oct 15, 2018

I had done everything according to the described procedures, including setting up the firewall zone as described here: https://github.com/openshift/origin/blob/release-3.11/docs/cluster_up_down.md. I was still getting this API server error: Get https://XXX.XXX.XXX.XXX:8443/healthz?timeout=32s: dial tcp XXX.XXX.XXX.XXX:8443: getsockopt: connection refused () Error: timed out waiting for the condition error while trying to run “oc cluster up” on a CentOS7 VM on MacOS. Solution for me was to allocate more RAM and CPU to the guest OS.

co-de on May 2, 2019

After some testing i found that iptables rules can interfere with the oc cluster up execution, so i created a little script [1] that outlines part of the recommended best practices in the manual [2]

[1] https://github.com/imcsk8/origin-tools/blob/master/run-oc-cluster-up.sh [2] https://docs.okd.io/latest/getting_started/administrators.html

imcsk8 on Jan 10, 2019

I would like to confirm the same issue as described above.

Error: timed out waiting for the condition

My case: VM machines: windows7/10 + rhel7 - 8cpu 16 GB ram, CDK 3.7.0-alpha-1.1 (oc v3.11.16).

Please, take a look at this issue, thanks.

odockal on Nov 14, 2018

alerting the team that owns oc cluster up, but nested virtualization and corporate proxies just sounds like a recipe for problems

@openshift/sig-master

jwforres on Aug 21, 2018

What would help you folks diagnose this problem? If you let people know what you need and any changes that need to occur, I’m sure someone on the thread would be willing to help. Just give folks directions so we can help you.

/Bill

wopalka on Oct 15, 2018

cc @deads2k

juanvallejo on Oct 3, 2018