openshift-ansible: Empty /etc/cni/net.d/ causing nodes to fail
Description
The ansible installer sometimes fails to create the file /etc/cni/net.d/80-openshift-network.conf which causes the node to fail. This seems to be a random event and happens on OpenStack and AWS environments. In one OpenStack environment this happens for approx 50% of nodes, on other environments it is less frequent.
If OpenShift is uninstalled (using the uninstall.yaml playbook) and then reinstalled the same nodes fail.
This has been discussed in the users mailing list in these topics, and at least one other user has reported similar problems: https://lists.openshift.redhat.com/openshift-archives/users/2018-April/msg00035.html https://lists.openshift.redhat.com/openshift-archives/users/2018-April/msg00045.html
Version
$ ansible --version
ansible 2.5.0
config file = /home/centos/37-test/ansible.cfg
configured module search path = [u'/home/centos/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /usr/bin/ansible
python version = 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
[centos@orndev-bastion-002 37-test]$ cd ../openshift-ansible
[centos@orndev-bastion-002 openshift-ansible]$ git describe
openshift-ansible-3.7.44-1-6-gbb4a8d7
$ git describe
openshift-ansible-3.7.44-1-6-gbb4a8d7
Nodes are based on plain Centos7 cloud images with the necessary packages described in the Origin docs installed.
OpenShift is version 3.7.
Steps To Reproduce
Using an Inventory file for a minimal 4 node setup (see below) we run the installer:
$ ansible-playbook -i inventory -vv ~/openshift-ansible/playbooks/byo/config.yml
<snip>
PLAY RECAP *****************************************************************************************************************************************************
localhost : ok=12 changed=0 unreachable=0 failed=0
test37-infra : ok=163 changed=61 unreachable=0 failed=1
test37-master : ok=615 changed=264 unreachable=0 failed=1
test37-node-001 : ok=187 changed=65 unreachable=0 failed=0
test37-node-002 : ok=163 changed=61 unreachable=0 failed=1
In this case the test37- infra and test37-node-002 nodes failed. On both of those the /etc/cni/net.d/ directory was empty. On the test37-master and test37-node-001 nodes the file 80-openshift-network.conf was present. The master node fails because there was no infra node, not because of the problem described here. If new nodes were created and the process repeated a different set of nodes would fail.
Expected Results
All nodes should start. The file /etc/cni/net.d/80-openshift-network.conf should be present on all nodes and the origin-node service should start successfully.
Observed Results
On the failed nodes the file /etc/cni/net.d/80-openshift-network.conf is not present and the origin-node service fails to start. Note that this missing file may just be a symptom of the real problem that lies upstream.
On the failed nodes:
$ sudo systemctl status -l origin-node.service
● origin-node.service - OpenShift Node
Loaded: loaded (/etc/systemd/system/origin-node.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/origin-node.service.d
└─openshift-sdn-ovs.conf
Active: activating (start) since Mon 2018-04-16 09:36:24 UTC; 39s ago
Docs: https://github.com/openshift/origin
Process: 18844 ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string: (code=exited, status=0/SUCCESS)
Process: 18840 ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf (code=exited, status=0/SUCCESS)
Process: 18886 ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1 (code=exited, status=0/SUCCESS)
Process: 18882 ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/ (code=exited, status=0/SUCCESS)
Main PID: 18889 (openshift)
Memory: 43.7M
CGroup: /system.slice/origin-node.service
├─18889 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2
└─18933 journalctl -k -f
Apr 16 09:36:54 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:54.950946 18889 manager.go:306] Starting recovery of all containers
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.020197 18889 kubelet_node_status.go:270] Setting node annotation to enable volume controller attach/detach
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.027120 18889 kubelet_node_status.go:433] Recording NodeHasSufficientDisk event message for node test37-infra.openstacklocal
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.027172 18889 kubelet_node_status.go:433] Recording NodeHasSufficientMemory event message for node test37-infra.openstacklocal
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.027194 18889 kubelet_node_status.go:433] Recording NodeHasNoDiskPressure event message for node test37-infra.openstacklocal
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.027224 18889 kubelet_node_status.go:82] Attempting to register node test37-infra.openstacklocal
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: I0416 09:36:55.071793 18889 manager.go:311] Recovery completed
Apr 16 09:36:55 test37-infra.openstacklocal origin-node[18889]: E0416 09:36:55.237121 18889 eviction_manager.go:238] eviction manager: unexpected err: failed GetNode: node 'test37-infra.openstacklocal' not found
Apr 16 09:37:00 test37-infra.openstacklocal origin-node[18889]: W0416 09:37:00.225298 18889 cni.go:189] Unable to update cni config: No networks found in /etc/cni/net.d
Apr 16 09:37:00 test37-infra.openstacklocal origin-node[18889]: E0416 09:37:00.227084 18889 kubelet.go:2112] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
The bit that says Unable to update cni config: No networks found in /etc/cni/net.d seems to be the key thing, and is the result of the missing file.
Another result of this is that DNS does not get set up correctly. On failed nodes:
$ sudo netstat -tunlp | grep tcp | grep 53 | grep -v tcp6
tcp 0 0 10.0.0.30:53 0.0.0.0:* LISTEN 14567/dnsmasq
tcp 0 0 172.17.0.1:53 0.0.0.0:* LISTEN 14567/dnsmasq
On successful nodes:
$ sudo netstat -tunlp | grep tcp | grep 53 | grep -v tcp6
tcp 0 0 127.0.0.1:53 0.0.0.0:* LISTEN 17353/openshift
tcp 0 0 10.129.0.1:53 0.0.0.0:* LISTEN 14544/dnsmasq
tcp 0 0 10.0.0.35:53 0.0.0.0:* LISTEN 14544/dnsmasq
tcp 0 0 172.17.0.1:53 0.0.0.0:* LISTEN 14544/dnsmasq
Ansible log: https://gist.github.com/a4f6e17554e6c77db7d97eeecd2cde8f
The log makes no mention of 80-openshift-network.conf so no obvious clues to why it did not get created.
Additional Information
Provide any additional information which may help us diagnose the issue.
$ cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)
Inventory file for a minimal 4 node setup:
[OSEv3:children]
masters
nodes
etcd
[OSEv3:vars]
ansible_ssh_user=centos
ansible_become=yes
openshift_deployment_type=origin
openshift_release=v3.7
openshift_disable_check=disk_availability,docker_storage,memory_availability
openshift_clock_enabled=true
# Enable htpasswd authentication
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/master/users.htpasswd'}]
# make sure this htpasswd file exists
openshift_master_htpasswd_file=/home/centos/users.htpasswd
openshift_master_cluster_public_hostname=130.238.28.199.nip.io
openshift_master_default_subdomain=130.238.28.131.nip.io
# default project node selector
osm_default_node_selector='zone=default'
openshift_docker_additional_registries = registry.access.redhat.com
openshift_docker_insecure_registries = registry.access.redhat.com
openshift_metrics_install_metrics=false
openshift_logging_install_logging=false
openshift_hosted_prometheus_deploy=false
ansible_service_broker_image_prefix=registry.access.redhat.com/openshift3/ose-
ansible_service_broker_registry_url=registry.access.redhat.com
[masters]
test37-master
[etcd]
test37-master
[nodes]
test37-master openshift_hostname=test37-master.openstacklocal
test37-infra openshift_hostname=test37-infra.openstacklocal openshift_node_labels="{'region': 'infra', 'zone': 'default'}"
test37-node-001 openshift_hostname=test37-node-001.openstacklocal openshift_node_labels="{'region': 'primary', 'zone': 'default'}"
test37-node-002 openshift_hostname=test37-node-002.openstacklocal openshift_node_labels="{'region': 'primary', 'zone': 'default'}"
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 60 (16 by maintainers)
Here’s how it works, so everyone is clear.
So an empty /etc/cni/net.d is a symptom of the SDN not being able to initialize, not a cause of any particular problem.
For me there should be a file named
/etc/cni/net.d/80-openshift-network.confthat has content like this:But I believe the file gets created as a result of successful creation of the SDN. Just creating it manually will not solve the problem.
What I’m finding now with the latest code on the release-3.9 branch of the ansible installer is that this problem is NOT manifesting itself. I ran through the installation process about 8 times and did not hit this problem.
I’m not confident that this means its ‘fixed’, but right now I cannot reproduce it.
If the other users who have encountered this could also check if its still happening for them it would be useful.
Found issue with 3.10 but manually creating file and directory did solve the problem. I rebooted after creating the file. 80-openshift-network.conf was missing on some nodes (approximately 50% of them).
I’m using “openshift-ansible-3.9.40-1” and seeing this problem. Is there a work-around so I can get past this?
For me error was with disabled ip forwarding: F0716 09:39:01.849262 2645 network.go:46] SDN node startup failed: node SDN setup failed: net/ipv4/ip_forward=0, it must be set to 1
After setting it to 1 docker container started and node became “Ready”.
This is happening to our cluster as well, same OpenShift version. Has anybody sourced a fix for it yet?