calico: [kubernetes 1.21] [calico 3.18] pods cannot route traffic to external services

Hi folks,

I’m operating a Kubernetes cluster that has until recently been using Calico 3.14. On consumer-facing request paths, the cluster mostly handles incoming user traffic while occassionally making requests out to internet-based services.

After upgrading to Kubernetes 1.21, an issue report has led to discovery of external connectivity issues from pods in the cluster.

After the issue report, I upgraded the cluster to Calico 3.18 in order to catch up-to-date with the latest fixes and support focus. Despite the Calico 3.18 upgrade, the issue persists.

Documentation for the Kubernetes cluster configuration is available here.

Only a single Calico IPPool is in place; it has natOutgoing enabled (true) and has a CIDR configuration that covers the pod network CIDR.

The host is an Ubuntu Linux 20.04 machine running kernel 5.4.0-72-generic (x86_64).

I’ve previously reported some notes about this under a probably-unrelated issue at https://github.com/projectcalico/calico/issues/3467#issuecomment-820767111.

The only routing table entry that appears on the pods is a link-local 169.x default gateway route, and I understand that this is expected as a way to route traffic to the Calico router.

Expected Behavior

All four types of IP traffic below should be routable:

✔️ Pod-to-internet traffic
✔️ Pod-to-kubernetes-DNS-service traffic
✔️ Pod-to-kubernetes-service traffic
✔️ Pod-to-pod traffic

Current Behavior

Two of the four types of traffic are routable:

✖️ Pod-to-internet traffic (tested by using wget github.com)
✔️ Pod-to-kubernetes-DNS-service traffic (tested by using nslookup github.com 10.96.0.10 – IP from /etc/resolv.conf)
✔️ Pod-to-kubernetes-service traffic (tested by using wget x-service – service name from kubectl describe svc)
✔️ Pod-to-pod traffic (tested by using wget 192.168.75.x – pod IP from kubectl describe pods)

Possible Solution

My guess is that this somehow relates to CIDR allocations, iptables entries, or connection tracking; with the allocation of the DNS service on a 10.x subnet as potentially somehow related.

Despite the strikethrough commentary in https://github.com/projectcalico/calico/issues/3467#issuecomment-821885318, I also do still think there may be something going on related to WorkloadEndpointUpdate events.

Steps to Reproduce (for bugs)

Configure a Kubernetes cluster following the instructions here
Create a deployment in the service (make build && make deploy of a blog service is a quickly-deployable example I’ve been using)
Attempt outbound connectivity from a pod within the deployment (kubectl exec -it deployments/blog-deployment -- sh …)

Context

One of the workloads within the cluster makes outbound SMTP requests in order to send feedback emails; it is this functionality that is currently broken, and this is what led to the discovery of the connectivity issues.

Your Environment

Calico version 3.18
Orchestrator version (e.g. kubernetes, mesos, rkt): kubectl and kubeadm managing a CRI-O Kubernetes cluster
Operating System and version: Ubuntu 20.04

Edit 1: fixup for the working-and-not-working traffic types (intended: DNS and external traffic routing are not working) Edit 2: update to indicate that routing for DNS is working

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (4 by maintainers)

Most upvoted comments

For the record:

Updating the pod CIDR IP ranges for the cluster seems to have resolved the issue (https://github.com/openculinary/infrastructure/commit/7eea0a20e0f8a5bbb5804856d8de36efccd01792)
#2962 reads like a closely-related previous issue report - in particular the coredns pod logs appear very similar

jayaddison on May 22, 2021

@jayaddison sorry for the delay!

NetworkManager sounds like a potential culprit… we have some steps you can take to configure it to stand down: https://docs.projectcalico.org/maintenance/troubleshoot/troubleshooting#configure-networkmanager

Essentially:

Create the following configuration file at /etc/NetworkManager/conf.d/calico.conf to prevent NetworkManager from interfering with the interfaces:

[keyfile]
unmanaged-devices=interface-name:cali*;interface-name:tunl*;interface-name:vxlan.calico

Might be worth a shot if you’re still seeing this.

caseydavenport on May 20, 2021

Very curious that it’s a single-node cluster (meaning the source pod and the DNS pod are likely co-located).

One other thing to try would be to do an nslookup using the exact IP address of the coredns pod (not the Service IP) to see if there is a difference between accessing the DNS server via the Service or not.

If it can be accessed via pod IP but not via service IP, then that points to something wrong with kube-proxy programming service rules.

Other thing might be if some other agent like NetworkManager or something is trying to take control over the interfaces created by Calico. Given all the pods are on the same node, there’s not a ton of network convergence that needs to happen for pods to talk to each other (i.e., no BGP or route sharing across nodes, etc). But, something might be fighting with Calico’s local programming.

caseydavenport on Apr 27, 2021

Yeah, let’s leave this open for a little bit longer. It does seem suspicious that it just “went away”.

I wouldn’t expect this to be a problem with the routing table within the pods - the network configuration within a pod is pretty unaware of where the traffic is ultimately going and what you’ve described sounds like business as usual in that area.

Most likely this is something to do with your nodes, or something else in your network between the nodes and the internet. Where are you running this cluster, by the way?

caseydavenport on Apr 23, 2021

Sorry, I made a mistake in the list of working/not-working traffic types. Trust me to make the pretty emojified traffic summary first, spend ages on the writeup, and then not notice that the (important) summary itself was incorrect.

jayaddison on Apr 19, 2021