vHive: Errors in setting up Single Node Cluster

Description

While trying to set up vhive for a single node cluster (following Section III of the Quick-Start Guide). The last step of the first subsection (the single node cluster setup script - scripts/cluster/create_one_node_cluster.sh) fails. It is always a timeout error and suggests that kubelet service is not properly configured, but the information given in console isn’t detailed enough to debug.

Logs and System Configurations

Follow the steps in Section III of the Quick-Start Guide till step 6 of Part 1 (Manual).

$ ./scripts/cluster/create_one_node_cluster.sh

The console log is as follows:

+++ dirname ./scripts/cluster/create_one_node_cluster.sh
++ cd ./scripts/cluster
++ pwd
+ DIR=/home/alan/vhive/scripts/cluster
++ cd /home/alan/vhive/scripts/cluster
++ cd ..
++ cd ..
++ pwd
+ ROOT=/home/alan/vhive
+ STOCK_CONTAINERD=
+ /home/alan/vhive/scripts/cluster/setup_worker_kubelet.sh
[sudo] password for alan:
+ '[' '' == stock-only ']'
+ CRI_SOCK=/etc/firecracker-containerd/fccd-cri.sock
+ sudo kubeadm init --ignore-preflight-errors=all --cri-socket /etc/firecracker-containerd/fccd-cri.sock --pod-network-cidr=192.168.0.0/16
W0902 20:05:33.734787 1834100 version.go:103] could not fetch a Kubernetes version from the internet: unable to get URL "https://dl.k8s.io/release/stable-1.txt": Get "https://dl.k8s.io/release/stable-1.txt": proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
W0902 20:05:33.734874 1834100 version.go:104] falling back to the local client version: v1.22.0
[init] Using Kubernetes version: v1.22.0
[preflight] Running pre-flight checks
        [WARNING HTTPProxy]: Connection to "https://10.223.93.218" uses proxy "http://http://proxy-dmz.intel.com:912%22%20". If that is not intended, adjust your proxy settings
        [WARNING HTTPProxyCIDR]: connection to "10.96.0.0/12" uses proxy "http://http://proxy-dmz.intel.com:912%22%20". This may lead to malfunctional cluster setup. Make sure that Pod and Services IP ranges specified correctly as exceptions in proxy configuration
        [WARNING HTTPProxyCIDR]: connection to "192.168.0.0/16" uses proxy "http://http://proxy-dmz.intel.com:912%22%20". This may lead to malfunctional cluster setup. Make sure that Pod and Services IP ranges specified correctly as exceptions in proxy configuration
        [WARNING CRI]: container runtime is not running: output: time="2021-09-02T20:05:35+05:30" level=fatal msg="failed to connect: failed to connect, make sure you are running as root and the runtime has been started: context deadline exceeded"
, error: exit status 1
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
        [WARNING ImagePull]: failed to pull image k8s.gcr.io/kube-apiserver:v1.22.0: output: time="2021-09-02T20:05:48+05:30" level=fatal msg="failed to connect: failed to connect, make sure you are running as root and the runtime has been started: context deadline exceeded"
, error: exit status 1
        [WARNING ImagePull]: failed to pull image k8s.gcr.io/kube-controller-manager:v1.22.0: output: time="2021-09-02T20:06:00+05:30" level=fatal msg="failed to connect: failed to connect, make sure you are running as root and the runtime has been started: context deadline exceeded"
, error: exit status 1
        [WARNING ImagePull]: failed to pull image k8s.gcr.io/kube-scheduler:v1.22.0: output: time="2021-09-02T20:06:12+05:30" level=fatal msg="failed to connect: failed to connect, make sure you are running as root and the runtime has been started: context deadline exceeded"
, error: exit status 1
        [WARNING ImagePull]: failed to pull image k8s.gcr.io/kube-proxy:v1.22.0: output: time="2021-09-02T20:06:24+05:30" level=fatal msg="failed to connect: failed to connect, make sure you are running as root and the runtime has been started: context deadline exceeded"
, error: exit status 1
        [WARNING ImagePull]: failed to pull image k8s.gcr.io/pause:3.5: output: time="2021-09-02T20:06:36+05:30" level=fatal msg="failed to connect: failed to connect, make sure you are running as root and the runtime has been started: context deadline exceeded"
, error: exit status 1
        [WARNING ImagePull]: failed to pull image k8s.gcr.io/etcd:3.5.0-0: output: time="2021-09-02T20:06:48+05:30" level=fatal msg="failed to connect: failed to connect, make sure you are running as root and the runtime has been started: context deadline exceeded"
, error: exit status 1
        [WARNING ImagePull]: failed to pull image k8s.gcr.io/coredns/coredns:v1.8.4: output: time="2021-09-02T20:07:00+05:30" level=fatal msg="failed to connect: failed to connect, make sure you are running as root and the runtime has been started: context deadline exceeded"
, error: exit status 1
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [adr-par-inspur5.iind.intel.com kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.223.93.218]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [adr-par-inspur5.iind.intel.com localhost] and IPs [10.223.93.218 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [adr-par-inspur5.iind.intel.com localhost] and IPs [10.223.93.218 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused.

        Unfortunately, an error has occurred:
                timed out waiting for the condition

        This error is likely caused by:
                - The kubelet is not running
                - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

        If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
                - 'systemctl status kubelet'
                - 'journalctl -xeu kubelet'

        Additionally, a control plane component may have crashed or exited when started by the container runtime.
        To troubleshoot, list all containers using your preferred container runtimes CLI.

        Here is one example how you may list all Kubernetes containers running in cri-o/containerd using crictl:
                - 'crictl --runtime-endpoint /etc/firecracker-containerd/fccd-cri.sock ps -a | grep kube | grep -v pause'
                Once you have found the failing container, you can inspect its logs with:
                - 'crictl --runtime-endpoint /etc/firecracker-containerd/fccd-cri.sock logs CONTAINERID'

error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher
+ mkdir -p /home/alan/.kube
+ sudo cp -i /etc/kubernetes/admin.conf /home/alan/.kube/config
cp: overwrite '/home/alan/.kube/config'? y
++ id -u
++ id -g
+ sudo chown 1004:1004 /home/alan/.kube/config
+ '[' 1004 -eq 0 ']'
+ kubectl taint nodes --all node-role.kubernetes.io/master-
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
+ /home/alan/vhive/scripts/cluster/setup_master_node.sh
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
error: failed to create secret Post "https://10.223.93.218:6443/api/v1/namespaces/metallb-system/secrets?fieldManager=kubectl-create": proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   102  100   102    0     0     79      0  0:00:01  0:00:01 --:--:--    79
100  4549  100  4549    0     0   2751      0  0:00:01  0:00:01 --:--:--  2751

Downloading istio-1.7.1 from https://github.com/istio/istio/releases/download/1.7.1/istio-1.7.1-linux-amd64.tar.gz ...
tar: istio-1.7.1/bin/istioctl: Cannot open: File exists
...
tar: istio-1.7.1: Cannot utime: Operation not permitted
tar: Exiting with failure status due to previous errors
Error: failed to install manifests: Get "https://10.223.93.218:6443/api?timeout=32s": proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving
Unable to connect to the server: proxyconnect tcp: dial tcp: lookup http on 127.0.0.53:53: server misbehaving

Note that there is no issue with internet connectivity outside this script. The ~/.kube/config is as follows:

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: [ KEY ]
    server: https://10.223.93.218:6443
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: kubernetes-admin
  name: kubernetes-admin@kubernetes
current-context: kubernetes-admin@kubernetes
kind: Config
preferences: {}
users:
- name: kubernetes-admin
  user:
    client-certificate-data: [ KEY ]
    client-key-data: [ KEY ]

The KEY here is always the same key. The server shows the IP address of the server followed by the port 6443 which (as per lsof -i -P -n) is not used by any service.

If anyone recognizes the issue or knows a solution or workaround, it would be really helpful. Please do tell if any additional piece of data is required. Thanks.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 24 (15 by maintainers)

Most upvoted comments

ok, thanks for the heads up. Please keep us posted, I’ll keep the Issue open for now.

ustiugov on Sep 22, 2021