rancher: Unable to provision K8s 1.19 cluster with firewalld enabled when using Calico, Canal, or Flannel
What kind of request is this (question/bug/enhancement/feature request): Bug
Steps to reproduce (least amount of steps as possible): In Rancher, update rke-metadata-config to the following:
{
"refresh-interval-minutes": "1440",
"url": "https://raw.githubusercontent.com/Oats87/kontainer-driver-metadata/k8s-1-19-v2.5/data/data.json"
}
Refresh Kubernetes metadata
Prepare 1 node with Oracle Linux 7.7 by following the documentation: https://rancher.com/docs/rancher/v2.x/en/installation/options/firewall/
(Or use this AMI which has ports opened: ami-06dd5f94499093e3d)
In Rancher, add a v1.19.1-rancher1-1 custom cluster. 1 node, all roles using the node.
Result:
Cluster is stuck provisioning:

Error in kubelet:
E0910 20:10:19.442169 25373 pod_workers.go:191] Error syncing pod c658b6ca-fbff-42e0-a3aa-f79e02f26a2c ("cattle-cluster-agent-987c678c-77rkl_cattle-system(c658b6ca-fbff-42e0-a3aa-f79e02f26a2c)"), skipping: failed to "StartContainer" for "cluster-register" with CrashLoopBackOff: "back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-987c678c-77rkl_cattle-system(c658b6ca-fbff-42e0-a3aa-f79e02f26a2c)"
I0910 20:10:31.441910 25373 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: 23ea099c02a5e01f00adc5e7ae2cc098b1dd3ff3fa2f4f9005574c38b8b8fe78
E0910 20:10:31.442379 25373 pod_workers.go:191] Error syncing pod c658b6ca-fbff-42e0-a3aa-f79e02f26a2c ("cattle-cluster-agent-987c678c-77rkl_cattle-system(c658b6ca-fbff-42e0-a3aa-f79e02f26a2c)"), skipping: failed to "StartContainer" for "cluster-register" with CrashLoopBackOff: "back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-987c678c-77rkl_cattle-system(c658b6ca-fbff-42e0-a3aa-f79e02f26a2c)"
I0910 20:10:44.441995 25373 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: 23ea099c02a5e01f00adc5e7ae2cc098b1dd3ff3fa2f4f9005574c38b8b8fe78
E0910 20:10:44.442565 25373 pod_workers.go:191] Error syncing pod c658b6ca-fbff-42e0-a3aa-f79e02f26a2c ("cattle-cluster-agent-987c678c-77rkl_cattle-system(c658b6ca-fbff-42e0-a3aa-f79e02f26a2c)"), skipping: failed to "StartContainer" for "cluster-register" with CrashLoopBackOff: "back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-987c678c-77rkl_cattle-system(c658b6ca-fbff-42e0-a3aa-f79e02f26a2c)"
I0910 20:10:58.442061 25373 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: 23ea099c02a5e01f00adc5e7ae2cc098b1dd3ff3fa2f4f9005574c38b8b8fe78
E0910 20:10:58.442582 25373 pod_workers.go:191] Error syncing pod c658b6ca-fbff-42e0-a3aa-f79e02f26a2c ("cattle-cluster-agent-987c678c-77rkl_cattle-system(c658b6ca-fbff-42e0-a3aa-f79e02f26a2c)"), skipping: failed to "StartContainer" for "cluster-register" with CrashLoopBackOff: "back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-987c678c-77rkl_cattle-system(c658b6ca-fbff-42e0-a3aa-f79e02f26a2c)"
Other details that may be helpful:
Environment information
- Rancher version (
rancher/rancher/rancher/serverimage tag or shown bottom left in the UI):rancher\rancher:master-headversion 198ec5b - Installation option (single install/HA): HA
Cluster information
- Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom
- Kubernetes version (use
kubectl version):v1.19.1-rancher1-1
gz#14269
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 3
- Comments: 35 (5 by maintainers)
For Calico (E.g. a RKE clsuter for Rancher with mostly defaults), these rules work: (It allows all traffic to/from the Calico interfaces.
Hi,
I solved my problem in CentOS 8 by creating a new firewalld zone for kubernetes pods and setting its target to ACCEPT. So, firewalld will accept packets going into POD SUBNET CIDR (ingress zone) and also packets coming out of POD SUBNET CIDR (egress zone)
Commands :
firewall-cmd man :
Versions :
To see what is getting rejected by firewalld, use the below commands
This is an issue in any EL7 operating system where
firewalldis running. This includes RHEL 7.x and CentOS 7.The root appears to be caused by a change in where Calico places the
Policy explicitly accepted packetrule. In Calico3.16.x, this rule is placed at the end of theFORWARDchain, i.e.whereas on earlier versions of Calico, it was on the
cali-FORWARDchain:This change was implemented for: https://github.com/projectcalico/felix/pull/2424 which is an orthogonal issue to the current one at hand.
By appending this rule to the
cali-FORWARDchain, traffic would automatically be accepted, and things “worked”. Now that thePolicy explicitly accepted packetis at the end of theFORWARDchain, there is afirewalld-inserted rule that can now blackhole traffic:which is inserted before the rest of the chain i.e.:
and thus, the traffic is never able to make it to the final
ACCEPTrule as it is dropped instead.Further investigation will need to be done to determine the best course of action to mitigate this, as Calico is advertised as “not working well with firewalld” in the first place.
@Oats87 – forgive me for not being that familiar with Calico yet, but can you point me in the direction of info regarding Calico 3.16x. and that being expected behaviour?
Have firewalls turned off on all my nodes because of this, and it makes me kinda uncomfortable. Would like more info on what works in the security sense.
I had the same issue on CentOS 7 nodes (fully patched as of 8.1.2021) that I was installing a new RKE cluster on. If I enable masquerade on my default firewalld zone (sudo firewall-cmd --add-masquerade --permanent;sudo firewall-cmd --reload) my overlay network tests work as expected.
I have not added any route or anything else.
Hope this helps.
如果并不是
firewalld的重度用户可以参考我的办法,使用ufw替换掉firewalld,我替换之后整个集群运行正常。Switching to this works as well, and it’s easier to translate if you use
linux-system-roles.firewallor similar things on Red Hat based systems. This is what we have in ansible:Works for both
iptables(el7) andnftables(el8+/fedora).This is probably the “correct” way of going forward.
Do you mind sharing the Calico policies?
We were facing the problem described by @Oats87 above.
As we are not (yet) allowed to disable firewalld, we setup a workaround. As explained by @Oats87 the issue happens on iptable’s
FORWARDchain, where firewalld rules conflict with rules set by calico :The problem is that rule 18 rejects all traffic so that rules 19 to 21 are never reached. As a consequence traffic coming from/going to 10.42.0.0/16 (i.e. the node/pod address space) is not routed by the node. Instead rule 18 is applied which rejects the traffic, causing clients to get a “no route to host” response (reject-with icmp-host-prohibited).
Our workaround is to replicate the rules added by calico to iptable’s builtin
FORWARDchain (19 to 21) in firewalld’sFORWARD_directchain which is traversed thanks to rule 12. This is done permanently with the following firewalld commands:Now when we list the content of the
FORWARD_directdirect chain, we should have the following result:This way kubernetes packets are correctly routed and firewalld is still rejecting other packets not matching these rules.
Important:
FORWARDchain (they are never reached) so we can still compare them to the replicated rules to check whether calico changed them.@andrew-landsverk-win I tried that, and I was not able to get a better/working result. @finnzi Do you happen to have a link to Rancher stating that? Would help me with research into the problem more. So far I have not observed any issues with the cluster, though nothing is cutover from prod yet. Still testing.
This is an ongoing issue, with the route missing and the firewalld errors with CentOS7 clusters. Is there any new guidance for this issue? Currently the “best” work around seems to be to disable firewalld.
This is more or less expected behavior, and comes with using Calico v3.16.x+