calico: [kubernetes 1.21] [calico 3.18] pods cannot route traffic to external services
Hi folks,
I’m operating a Kubernetes cluster that has until recently been using Calico 3.14. On consumer-facing request paths, the cluster mostly handles incoming user traffic while occassionally making requests out to internet-based services.
After upgrading to Kubernetes 1.21, an issue report has led to discovery of external connectivity issues from pods in the cluster.
After the issue report, I upgraded the cluster to Calico 3.18 in order to catch up-to-date with the latest fixes and support focus. Despite the Calico 3.18 upgrade, the issue persists.
Documentation for the Kubernetes cluster configuration is available here.
Only a single Calico IPPool is in place; it has natOutgoing enabled (true) and has a CIDR configuration that covers the pod network CIDR.
The host is an Ubuntu Linux 20.04 machine running kernel 5.4.0-72-generic (x86_64).
I’ve previously reported some notes about this under a probably-unrelated issue at https://github.com/projectcalico/calico/issues/3467#issuecomment-820767111.
The only routing table entry that appears on the pods is a link-local 169.x default gateway route, and I understand that this is expected as a way to route traffic to the Calico router.
Expected Behavior
All four types of IP traffic below should be routable:
- ✔️ Pod-to-internet traffic
- ✔️ Pod-to-kubernetes-DNS-service traffic
- ✔️ Pod-to-kubernetes-service traffic
- ✔️ Pod-to-pod traffic
Current Behavior
Two of the four types of traffic are routable:
- ✖️ Pod-to-internet traffic (tested by using
wget github.com) - ✔️ Pod-to-kubernetes-DNS-service traffic (tested by using
nslookup github.com 10.96.0.10– IP from/etc/resolv.conf) - ✔️ Pod-to-kubernetes-service traffic (tested by using
wget x-service– service name fromkubectl describe svc) - ✔️ Pod-to-pod traffic (tested by using
wget 192.168.75.x– pod IP fromkubectl describe pods)
Possible Solution
My guess is that this somehow relates to CIDR allocations, iptables entries, or connection tracking; with the allocation of the DNS service on a 10.x subnet as potentially somehow related.
Despite the strikethrough commentary in https://github.com/projectcalico/calico/issues/3467#issuecomment-821885318, I also do still think there may be something going on related to WorkloadEndpointUpdate events.
Steps to Reproduce (for bugs)
- Configure a Kubernetes cluster following the instructions here
- Create a deployment in the service (
make build && make deployof ablogservice is a quickly-deployable example I’ve been using) - Attempt outbound connectivity from a pod within the deployment (
kubectl exec -it deployments/blog-deployment -- sh…)
Context
One of the workloads within the cluster makes outbound SMTP requests in order to send feedback emails; it is this functionality that is currently broken, and this is what led to the discovery of the connectivity issues.
Your Environment
- Calico version 3.18
- Orchestrator version (e.g. kubernetes, mesos, rkt):
kubectlandkubeadmmanaging a CRI-O Kubernetes cluster - Operating System and version: Ubuntu 20.04
Edit 1: fixup for the working-and-not-working traffic types (intended: DNS and external traffic routing are not working) Edit 2: update to indicate that routing for DNS is working
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (4 by maintainers)
For the record:
corednspod logs appear very similar@jayaddison sorry for the delay!
NetworkManager sounds like a potential culprit… we have some steps you can take to configure it to stand down: https://docs.projectcalico.org/maintenance/troubleshoot/troubleshooting#configure-networkmanager
Essentially:
Create the following configuration file at /etc/NetworkManager/conf.d/calico.conf to prevent NetworkManager from interfering with the interfaces:
Might be worth a shot if you’re still seeing this.
Very curious that it’s a single-node cluster (meaning the source pod and the DNS pod are likely co-located).
One other thing to try would be to do an nslookup using the exact IP address of the coredns pod (not the Service IP) to see if there is a difference between accessing the DNS server via the Service or not.
If it can be accessed via pod IP but not via service IP, then that points to something wrong with kube-proxy programming service rules.
Other thing might be if some other agent like NetworkManager or something is trying to take control over the interfaces created by Calico. Given all the pods are on the same node, there’s not a ton of network convergence that needs to happen for pods to talk to each other (i.e., no BGP or route sharing across nodes, etc). But, something might be fighting with Calico’s local programming.
Yeah, let’s leave this open for a little bit longer. It does seem suspicious that it just “went away”.
I wouldn’t expect this to be a problem with the routing table within the pods - the network configuration within a pod is pretty unaware of where the traffic is ultimately going and what you’ve described sounds like business as usual in that area.
Most likely this is something to do with your nodes, or something else in your network between the nodes and the internet. Where are you running this cluster, by the way?
Sorry, I made a mistake in the list of working/not-working traffic types. Trust me to make the pretty emojified traffic summary first, spend ages on the writeup, and then not notice that the (important) summary itself was incorrect.