kubernetes: weave-net CrashLoopBackOff for the second node

Is this a request for help?

I think it is an issue either with software or documentation, but I am not quite sure. I have started with a question on stackoverflow: http://stackoverflow.com/questions/39872332/how-to-fix-weave-net-crashloopbackoff-for-the-second-node

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

I think it is a bug or request to improve documentation

Kubernetes version (use kubectl version): 1.4.0

Environment:

  • Cloud provider or hardware configuration: Vagrant
  • OS (e.g. from /etc/os-release): Ubuntu 16.04
  • Kernel (e.g. uname -a):
  • Install tools: kubeadm init/join
  • Others:

What happened:

I have got 2 VMs nodes. Both see each other either by hostname (through /etc/hosts) or by ip address. One has been provisioned with kubeadm as a master. Another as a worker node. Following the instructions (http://kubernetes.io/docs/getting-started-guides/kubeadm/) I have added weave-net. The list of pods looks like the following:

vagrant@vm-master:~$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                    READY     STATUS             RESTARTS   AGE
kube-system   etcd-vm-master                          1/1       Running            0          3m
kube-system   kube-apiserver-vm-master                1/1       Running            0          5m
kube-system   kube-controller-manager-vm-master       1/1       Running            0          4m
kube-system   kube-discovery-982812725-x2j8y          1/1       Running            0          4m
kube-system   kube-dns-2247936740-5pu0l               3/3       Running            0          4m
kube-system   kube-proxy-amd64-ail86                  1/1       Running            0          4m
kube-system   kube-proxy-amd64-oxxnc                  1/1       Running            0          2m
kube-system   kube-scheduler-vm-master                1/1       Running            0          4m
kube-system   kubernetes-dashboard-1655269645-0swts   1/1       Running            0          4m
kube-system   weave-net-7euqt                         2/2       Running            0          4m
kube-system   weave-net-baao6                         1/2       CrashLoopBackOff   2          2m

CrashLoopBackOff appears for each worker node connected. I have spent several ours playing with network interfaces, but it seems the network is fine. I have found similar question on stackoverflow, where the answer advised to look into the logs and no follow up. So, here are the logs:

vagrant@vm-master:~$ kubectl logs weave-net-baao6 -c weave --namespace=kube-system
2016-10-05 10:48:01.350290 I | error contacting APIServer: Get https://100.64.0.1:443/api/v1/nodes: dial tcp 100.64.0.1:443: getsockopt: connection refused; trying with blank env vars
2016-10-05 10:48:01.351122 I | error contacting APIServer: Get http://localhost:8080/api: dial tcp [::1]:8080: getsockopt: connection refused
Failed to get peers

What you expected to happen:

I would expect the weave-net to be in Running state

How to reproduce it (as minimally and precisely as possible):

I have not done anything special, just followed the documentation on Getting Started. If it is essencial, I can share Vagrant project, which I used to provision everything. Please, let me know if you need one.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 24
  • Comments: 56 (27 by maintainers)

Commits related to this issue

Most upvoted comments

What does your last command do?

As this thread is getting quite noisy, here is a recap.

First, find out what IP address you want to use on the master, it’s probably the one on the second network interface. For this example I’ll use IP="172.42.42.1".

Next, run kubeadm init --api-advertise-addresses=$IP.

Now, you want to append --advertise-address to kube-apiserver in the static pod manifest, you can do it like this:

jq \
   '.spec.containers[0].command |= .+ ["--advertise-address=$IP"]' \
   /etc/kubernetes/manifests/kube-apiserver.json > /tmp/kube-apiserver.json
mv /tmp/kube-apiserver.json /etc/kubernetes/manifests/kube-apiserver.json

And finally, you need to update flags in kube-proxy daemonset and append --proxy-mode=userspace, which can be done like this:

kubectl -n kube-system get ds -l 'component=kube-proxy-amd64' -o json \
  | jq '.items[0].spec.template.spec.containers[0].command |= .+ ["--proxy-mode=userspace"]' \
  |   kubectl apply -f - && kubectl -n kube-system delete pods -l 'component=kube-proxy-amd64'

It seems Calico has got similar issue:

vagrant@vm-master:~$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                    READY     STATUS             RESTARTS   AGE
kube-system   calico-etcd-iemkd                       1/1       Running            0          57m
kube-system   calico-node-178gc                       1/2       CrashLoopBackOff   8          56m
kube-system   calico-node-zsym2                       2/2       Running            0          57m
kube-system   calico-policy-controller-6gh7b          1/1       Running            0          57m
kube-system   etcd-vm-master                          1/1       Running            0          56m
kube-system   kube-apiserver-vm-master                1/1       Running            0          58m
kube-system   kube-controller-manager-vm-master       1/1       Running            0          57m
kube-system   kube-discovery-982812725-7jsmb          1/1       Running            0          57m
kube-system   kube-dns-2247936740-xiee1               3/3       Running            0          57m
kube-system   kube-proxy-amd64-iywb7                  1/1       Running            0          57m
kube-system   kube-proxy-amd64-ok9bx                  1/1       Running            0          56m
kube-system   kube-scheduler-vm-master                1/1       Running            0          56m
kube-system   kubernetes-dashboard-1655269645-g4cyd   1/1       Running            0          57m
  info: 1 completed object(s) was(were) not shown in pods list. Pass --show-all to see all objects.

vagrant@vm-master:~$ kubectl logs calico-node-178gc -c calico-node --namespace=kube-system
Waiting for etcd connection...
No IP provided. Using detected IP: 10.0.10.11
Traceback (most recent call last):
  File "startup.py", line 336, in <module>
    main()
  File "startup.py", line 287, in main
    warn_if_hostname_conflict(ip)
  File "startup.py", line 210, in warn_if_hostname_conflict
    current_ipv4, _ = client.get_host_bgp_ips(hostname)
  File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 134, in wrapped
    "running?" % (fn.__name__, e.message))
pycalico.datastore_errors.DataStoreError: get_host_bgp_ips: Error accessing etcd (Connection to etcd failed due to MaxRetryError("HTTPConnectionPool(host='100.78.232.136', port=6666): Max retries exceeded with url: /v2/keys/calico/bgp/v1/host/vm-worker/ip_addr_v4 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f50416bfc50>, 'Connection to 100.78.232.136 timed out. (connect timeout=60)'))",)).  Is etcd running?
Calico node failed to start

What could I try to progress this issue further?

I am starting kubeadm with --advertise-address option: kubeadm init --api-advertise-addresses=$master_address where my $master_address is not NAT interface. It still not enough to remove this issue.

I’m also experiencing the same issue:

kubectl get pods --namespace=kube-system
NAME                                   READY     STATUS             RESTARTS   AGE
etcd-kubernetes-1                      1/1       Running            0          2h
kube-apiserver-kubernetes-1            1/1       Running            0          2h
kube-controller-manager-kubernetes-1   1/1       Running            0          2h
kube-dns-3913472980-79r8q              0/3       Pending            0          2h
kube-proxy-11bfl                       1/1       Running            0          2h
kube-proxy-8qn3z                       1/1       Running            0          1h
kube-proxy-f7ptd                       1/1       Running            0          1h
kube-scheduler-kubernetes-1            1/1       Running            0          2h
tiller-deploy-1651596238-87wvd         0/1       Pending            0          1h
weave-cortex-agent-2343136017-gnt6z    0/1       Pending            0          45m
weave-cortex-node-exporter-9qh20       1/1       Running            0          45m
weave-cortex-node-exporter-g6hlj       1/1       Running            0          45m
weave-cortex-node-exporter-q2zpm       1/1       Running            0          45m
weave-flux-agent-478881469-4dctc       0/1       Pending            0          45m
weave-net-0rlkr                        1/2       CrashLoopBackOff   14         50m
weave-net-35tkz                        1/2       CrashLoopBackOff   14         50m
weave-net-ph8wj                        1/2       CrashLoopBackOff   14         50m
weave-scope-agent-750g6                1/1       Running            0          45m
weave-scope-agent-f6cd5                1/1       Running            0          45m
weave-scope-agent-lwwfl                1/1       Running            0          45m



$ kubectl version
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.4", GitCommit:"d6f433224538d4f9ca2f7ae19b252e6fcb66a3ae", GitTreeState:"clean", BuildDate:"2017-05-19T18:44:27Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.4", GitCommit:"d6f433224538d4f9ca2f7ae19b252e6fcb66a3ae", GitTreeState:"clean", BuildDate:"2017-05-19T18:33:17Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}


$ kubeadm version
kubeadm version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.4", GitCommit:"d6f433224538d4f9ca2f7ae19b252e6fcb66a3ae", GitTreeState:"clean", BuildDate:"2017-05-19T18:33:17Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

$ uname -a
Linux kubernetes-1 4.4.0-77-generic #98-Ubuntu SMP Wed Apr 26 08:34:02 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Any ideas on how to deal with the issue ?

i am having the same issue as @avkonst. If you leave validation on, you get this error. error validating "STDIN": error validating data: items[0].apiVersion not set; i included the input from kubectl -n kube-system get ds -l 'component=kube-proxy-amd64' -o json | jq '.items[0].spec.template.spec.containers[0].command |= .+ ["--proxy-mode=userspace"]' and both kind and apiVersion are clearly there.
{ "kind": "List", "apiVersion": "v1", "metadata": {}, "items": [ { "spec": { "template": { "spec": { "containers": [ { "command": [ "--cluster-cidr=10.32.0.0/12" ] } ] } } } } ] }

kubectl apply -f - it seems to not get the outermost portion of the input which is where the kind and apiVersion are both at.

Ok, so it turns out that this flag is not enough, we still have an issue reaching kubernetes service IP. The simplest solution to this is to run kube-proxy with --proxy-mode=userspace. To enable this, you can use kubectl -n kube-system edit ds kube-proxy-amd64 && kubectl -n kube-system delete pods -l name=kube-proxy-amd64.

Okay I ran into this exact same issue and here is how I fixed it.

This problem seems to be due to kube-proxy looking at the wrong network interface. If you look at the kube-proxy logs on a worker node you will most likely see something like:

-A KUBE-SEP-4C6YEJQ2VXV53FEZ -m comment --comment default/kubernetes:https -s 10.0.2.15/32 -j KUBE-MARK-MASQ

This is the wrong network interface. The kube-proxy should be looking at the master node’s IP address not the NAT IP address.

As far as I know the kube-proxy gets this value from the Kube API Server when starting up. If you look at the Kube API Server’s documentation it states that if --advertise-address flag isn’t set it will default to --bind-address and if --bind-address isn’t set it will default to host’s default interface. Which in my case and yours seems to be the NAT interface, which isn’t what we want. So what I did was set the Kube API Server’s --advertise-address flag and everything started working. So right after Step 2 and before Step 3 of

Installing Kubernetes on Linux with kubeadm

You will need to update your /etc/kubernetes/manifests/kube-apiserver.json and add the --advertise-address flag to point to your master node’s IP address.

So for example: My master node’s IP address is 172.28.128.2, which means right after Step 2 I do:

cat <<EOF > /etc/kubernetes/manifests/kube-apiserver.json
{
  "kind": "Pod",
  "apiVersion": "v1",
  "metadata": {
    "name": "kube-apiserver",
    "namespace": "kube-system",
    "creationTimestamp": null,
    "labels": {
      "component": "kube-apiserver",
      "tier": "control-plane"
    }
  },
  "spec": {
    "volumes": [
      {
        "name": "certs",
        "hostPath": {
          "path": "/etc/ssl/certs"
        }
      },
      {
        "name": "pki",
        "hostPath": {
          "path": "/etc/kubernetes"
        }
      }
    ],
    "containers": [
      {
        "name": "kube-apiserver",
        "image": "gcr.io/google_containers/kube-apiserver-amd64:v1.4.0",
        "command": [
          "/usr/local/bin/kube-apiserver",
          "--v=4",
          "--insecure-bind-address=127.0.0.1",
          "--etcd-servers=http://127.0.0.1:2379",
          "--admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,PersistentVolumeLabel,DefaultStorageClass,ResourceQuota",
          "--service-cluster-ip-range=100.64.0.0/12",
          "--service-account-key-file=/etc/kubernetes/pki/apiserver-key.pem",
          "--client-ca-file=/etc/kubernetes/pki/ca.pem",
          "--tls-cert-file=/etc/kubernetes/pki/apiserver.pem",
          "--tls-private-key-file=/etc/kubernetes/pki/apiserver-key.pem",
          "--token-auth-file=/etc/kubernetes/pki/tokens.csv",
          "--secure-port=443",
          "--allow-privileged",
          "--advertise-address=172.28.128.2",
          "--etcd-servers=http://127.0.0.1:2379"
        ],
        "resources": {
          "requests": {
            "cpu": "250m"
          }
        },
        "volumeMounts": [
          {
            "name": "certs",
            "mountPath": "/etc/ssl/certs"
          },
          {
            "name": "pki",
            "readOnly": true,
            "mountPath": "/etc/kubernetes/"
          }
        ],
        "livenessProbe": {
          "httpGet": {
            "path": "/healthz",
            "port": 8080,
            "host": "127.0.0.1"
          },
          "initialDelaySeconds": 15,
          "timeoutSeconds": 15
        }
      }
    ],
    "hostNetwork": true
  },
  "status": {}
}
EOF

I am not to sure if this is a valid long term solution, because if the default kube-apiserver.json changes then those changes wouldn’t get reflected by doing what I am doing. Ideally, I think the user would want some way to set these flags via kubeadm or maybe the user should be responsible for parsing the JSON themselves. Thoughts?

However, it still may be a good idea to update Step 2 of Installing Kubernetes on Linux with kubeadm to atleast mention to the users that they can update the kube component flags by modifying the their json found at: /etc/kubernetes/manifests/.

I have the same issue. I’m using VirtualBox to run 2 VM based on minimal Centos 7 image. All VMs are attached to 2 interfaces, a NAT and an host-only network. The two VMs are able to connect to each other using the host-only network interfaces.

I tried also with the instructions about Calico and Canal, and I cannot make them work either.

Same problem, solved it with https://stackoverflow.com/questions/39872332/how-to-fix-weave-net-crashloopbackoff-for-the-second-node

resolved my issue via adding a routing rule to use eth1 for node machines to kubernetes service range ip. Example:

echo "100.64.0.0/12 dev eth1" >> /etc/sysconfig/network-scripts/route-eth1
ip route add 100.64.0.0/12 dev eth1

@avkonst see https://github.com/kubernetes/kubernetes/pull/34607.

Also, you can do this for now:

jq \
   '.spec.containers[0].command |= .+ ["--advertise-address=172.42.42.1"]' \
   /etc/kubernetes/manifests/kube-apiserver.json > /tmp/kube-apiserver.json
mv /tmp/kube-apiserver.json /etc/kubernetes/manifests/kube-apiserver.json

I encountered this issue too.

Adding --advertise-address when starting kube-apiserver solved this issue.