openshift-ansible: Empty /etc/cni/net.d/ causing nodes to fail

Description

The ansible installer sometimes fails to create the file /etc/cni/net.d/80-openshift-network.conf which causes the node to fail. This seems to be a random event and happens on OpenStack and AWS environments. In one OpenStack environment this happens for approx 50% of nodes, on other environments it is less frequent.

If OpenShift is uninstalled (using the uninstall.yaml playbook) and then reinstalled the same nodes fail.

This has been discussed in the users mailing list in these topics, and at least one other user has reported similar problems: https://lists.openshift.redhat.com/openshift-archives/users/2018-April/msg00035.html https://lists.openshift.redhat.com/openshift-archives/users/2018-April/msg00045.html

Version

$ ansible --version
ansible 2.5.0
  config file = /home/centos/37-test/ansible.cfg
  configured module search path = [u'/home/centos/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Aug  4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
[centos@orndev-bastion-002 37-test]$ cd ../openshift-ansible
[centos@orndev-bastion-002 openshift-ansible]$ git describe
openshift-ansible-3.7.44-1-6-gbb4a8d7

$ git describe
openshift-ansible-3.7.44-1-6-gbb4a8d7

Nodes are based on plain Centos7 cloud images with the necessary packages described in the Origin docs installed.

OpenShift is version 3.7.

Steps To Reproduce

Using an Inventory file for a minimal 4 node setup (see below) we run the installer:

$ ansible-playbook -i inventory -vv ~/openshift-ansible/playbooks/byo/config.yml
<snip>
PLAY RECAP *****************************************************************************************************************************************************
localhost                  : ok=12   changed=0    unreachable=0    failed=0   
test37-infra               : ok=163  changed=61   unreachable=0    failed=1   
test37-master              : ok=615  changed=264  unreachable=0    failed=1   
test37-node-001            : ok=187  changed=65   unreachable=0    failed=0   
test37-node-002            : ok=163  changed=61   unreachable=0    failed=1

In this case the test37- infra and test37-node-002 nodes failed. On both of those the /etc/cni/net.d/ directory was empty. On the test37-master and test37-node-001 nodes the file 80-openshift-network.conf was present. The master node fails because there was no infra node, not because of the problem described here. If new nodes were created and the process repeated a different set of nodes would fail.

Expected Results

All nodes should start. The file /etc/cni/net.d/80-openshift-network.conf should be present on all nodes and the origin-node service should start successfully.

Observed Results

On the failed nodes the file /etc/cni/net.d/80-openshift-network.conf is not present and the origin-node service fails to start. Note that this missing file may just be a symptom of the real problem that lies upstream.

On the failed nodes:

$ sudo systemctl status -l origin-node.service
● origin-node.service - OpenShift Node
   Loaded: loaded (/etc/systemd/system/origin-node.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/origin-node.service.d
           └─openshift-sdn-ovs.conf
   Active: activating (start) since Mon 2018-04-16 09:36:24 UTC; 39s ago
     Docs: https://github.com/openshift/origin
  Process: 18844 ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string: (code=exited, status=0/SUCCESS)
  Process: 18840 ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf (code=exited, status=0/SUCCESS)
  Process: 18886 ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1 (code=exited, status=0/SUCCESS)
  Process: 18882 ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/ (code=exited, status=0/SUCCESS)
 Main PID: 18889 (openshift)
   Memory: 43.7M
   CGroup: /system.slice/origin-node.service
           ├─18889 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2
           └─18933 journalctl -k -f

Apr 16 09:36:54 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:54.950946   18889 manager.go:306] Starting recovery of all containers
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.020197   18889 kubelet_node_status.go:270] Setting node annotation to enable volume controller attach/detach
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.027120   18889 kubelet_node_status.go:433] Recording NodeHasSufficientDisk event message for node test37-infra.openstacklocal
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.027172   18889 kubelet_node_status.go:433] Recording NodeHasSufficientMemory event message for node test37-infra.openstacklocal
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.027194   18889 kubelet_node_status.go:433] Recording NodeHasNoDiskPressure event message for node test37-infra.openstacklocal
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.027224   18889 kubelet_node_status.go:82] Attempting to register node test37-infra.openstacklocal
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.071793   18889 manager.go:311] Recovery completed
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: E0416 09:36:55.237121   18889 eviction_manager.go:238] eviction manager: unexpected err: failed GetNode: node 'test37-infra.openstacklocal' not found
Apr 16 09:37:00 test37-infra.openstacklocal origin-node[18889]: W0416 09:37:00.225298   18889 cni.go:189] Unable to update cni config: No networks found in /etc/cni/net.d
Apr 16 09:37:00 test37-infra.openstacklocal origin-node[18889]: E0416 09:37:00.227084   18889 kubelet.go:2112] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

The bit that says Unable to update cni config: No networks found in /etc/cni/net.d seems to be the key thing, and is the result of the missing file.

Another result of this is that DNS does not get set up correctly. On failed nodes:

$ sudo netstat -tunlp | grep tcp | grep 53 | grep -v tcp6
tcp        0      0 10.0.0.30:53            0.0.0.0:*               LISTEN      14567/dnsmasq       
tcp        0      0 172.17.0.1:53           0.0.0.0:*               LISTEN      14567/dnsmasq

On successful nodes:

$ sudo netstat -tunlp | grep tcp | grep 53 | grep -v tcp6
tcp        0      0 127.0.0.1:53            0.0.0.0:*               LISTEN      17353/openshift     
tcp        0      0 10.129.0.1:53           0.0.0.0:*               LISTEN      14544/dnsmasq       
tcp        0      0 10.0.0.35:53            0.0.0.0:*               LISTEN      14544/dnsmasq       
tcp        0      0 172.17.0.1:53           0.0.0.0:*               LISTEN      14544/dnsmasq

Ansible log: https://gist.github.com/a4f6e17554e6c77db7d97eeecd2cde8f

The log makes no mention of 80-openshift-network.conf so no obvious clues to why it did not get created.

Additional Information

Provide any additional information which may help us diagnose the issue.

$ cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)

Inventory file for a minimal 4 node setup:

[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
ansible_ssh_user=centos
ansible_become=yes

openshift_deployment_type=origin
openshift_release=v3.7

openshift_disable_check=disk_availability,docker_storage,memory_availability
openshift_clock_enabled=true

# Enable htpasswd authentication
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/master/users.htpasswd'}]
# make sure this htpasswd file exists
openshift_master_htpasswd_file=/home/centos/users.htpasswd

openshift_master_cluster_public_hostname=130.238.28.199.nip.io
openshift_master_default_subdomain=130.238.28.131.nip.io

# default project node selector
osm_default_node_selector='zone=default'

openshift_docker_additional_registries = registry.access.redhat.com
openshift_docker_insecure_registries = registry.access.redhat.com

openshift_metrics_install_metrics=false
openshift_logging_install_logging=false
openshift_hosted_prometheus_deploy=false

ansible_service_broker_image_prefix=registry.access.redhat.com/openshift3/ose-
ansible_service_broker_registry_url=registry.access.redhat.com

[masters]
test37-master

[etcd]
test37-master

[nodes]
test37-master   openshift_hostname=test37-master.openstacklocal
test37-infra    openshift_hostname=test37-infra.openstacklocal openshift_node_labels="{'region': 'infra', 'zone': 'default'}" 
test37-node-001 openshift_hostname=test37-node-001.openstacklocal openshift_node_labels="{'region': 'primary', 'zone': 'default'}" 
test37-node-002 openshift_hostname=test37-node-002.openstacklocal openshift_node_labels="{'region': 'primary', 'zone': 'default'}"

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 60 (16 by maintainers)

Most upvoted comments

Here’s how it works, so everyone is clear.

kubelet/openshift-node starts
kubelet registers itself as a Node object with the apiserver (eg ‘oc get nodes’)
master SDN controller notices the new Node and creates a new HostSubnet for the node (eg ‘oc get hostsubnets’)
node SDN (built into openshift-node in <= 3.9, but a DaemonSet in 3.10+) periodically polls for its HostSubnet; eventually times out if it doesn’t find one
if the node SDN finds a HostSubnet matching its hostname, continues initialization
when initialization has completed successfully, writes out /etc/cni/net.d/80-openshift-sdn.conf
kubelet polls /etc/cni/net.d every 30 seconds for a config file, so it finally sees the config that’s written and updates the apiserver with network-ready status

So an empty /etc/cni/net.d is a symptom of the SDN not being able to initialize, not a cause of any particular problem.

dcbw on Apr 19, 2018

For me there should be a file named /etc/cni/net.d/80-openshift-network.conf that has content like this:

{
  "cniVersion": "0.2.0",
  "name": "openshift-sdn",
  "type": "openshift-sdn"
}

But I believe the file gets created as a result of successful creation of the SDN. Just creating it manually will not solve the problem.

tdudgeon on Aug 23, 2018

What I’m finding now with the latest code on the release-3.9 branch of the ansible installer is that this problem is NOT manifesting itself. I ran through the installation process about 8 times and did not hit this problem.

I’m not confident that this means its ‘fixed’, but right now I cannot reproduce it.

If the other users who have encountered this could also check if its still happening for them it would be useful.

tdudgeon on Jul 23, 2018

Found issue with 3.10 but manually creating file and directory did solve the problem. I rebooted after creating the file. 80-openshift-network.conf was missing on some nodes (approximately 50% of them).

magicalyak on Aug 25, 2018

I’m using “openshift-ansible-3.9.40-1” and seeing this problem. Is there a work-around so I can get past this?

ServerNinja on Aug 23, 2018

For me error was with disabled ip forwarding: F0716 09:39:01.849262 2645 network.go:46] SDN node startup failed: node SDN setup failed: net/ipv4/ip_forward=0, it must be set to 1

After setting it to 1 docker container started and node became “Ready”.

tabu-a on Jul 16, 2018

This is happening to our cluster as well, same OpenShift version. Has anybody sourced a fix for it yet?

colemanjackson on Jun 8, 2018