calico: Calico in eBPF mode doesn't forward traffic from AWS NLBs properly
Motivating Scenario
After creating a Kubernetes Service of type “LoadBalancer” annotated with “service.beta.kubernetes.io/aws-load-balancer-type” with value “nlb,” the Kubernetes AWS cloud provider creates an EC2 Network Load Balancer. My Service has an external traffic policy of “Cluster,” in a Kubernetes cluster with three worker machines striding three subnets in three availability zones, so in this simple case the NLB winds up with a listener per Service port each forwarding to a target group with three targets (arriving at the corresponding node port). Since our external traffic policy is “Cluster,” all these worker machines are eligible targets, regardless of whether a pod selected by the Service is running there.
I have one pod running that’s selected by this Service. Calico is in eBPF mode, with BIRD as the backend (for now, per kops issue kubernetes/kops#10168), and my lone IP pool is using VXLAN in cross-subnet mode.
The network ACLs in play are all wide open, allowing everything, and the security group rules protecting the worker machines are also now wide open, allowing all traffic on all protocols from everywhere, for the sake of discerning what could be responsible for the blocking here.
Expected Behavior
Connections established with and traffic sent through the NLB should arrive at the target pod, whether the NLB is allowing cross-zone load balancing or not. Even with cross-zone load balancing disabled, each NLB listener should contact worker machine targets in its same availability zone on the node port, and that machine should forward the packets if necessary to a different worker machine that’s hosting a target pod.
Current Behavior
My pod only receives this traffic in two cases:
- With cross-zone load balancing disabled, requests landing at the NLB listener sitting in the same zone as the target pod’s hosting machine reach the target pod.
That is, contacting a listener in a different zone fails, even though the AWS API reports the target group as having three healthy targets—again, one per zone—and includes all three listeners’ IP addresses in the NLB’s DNS A record. - With cross-zone load balancing enabled, approximately 1/3 of the requests landing at any of the NLB listeners reach the target pod.
Since each listener can contact any target, there’s an approximately one in three chance of the listener choosing the machine hosting the target pod.
Presented differently, consider machines M1, M2, and M3 in availability zones Z1, Z2, and Z3, with one pod running on machine M2 in zone Z2 (denoted by 🎯 below).
| Machine | Zone | Success with CZB Enabled | Success with CZB Disabled |
|---|---|---|---|
| M1 | Z1 | ~33% | 🚫 0% |
| M2 🎯 | Z2 | ~33% | ✅ 100% |
| M3 | Z3 | ~33% | 🚫 0% |
I visited each of these three machines and ran tcpdump, isolating different network interfaces, looking to see which telltale traffic arrives. The successful requests’ HTTP content appeared on one of the veth-prefixed interfaces, originating from the original client IP address on a random port and destined for the pod IP address on the container port.
Past that, I tried—and I mean reallly tried—to find evidence of this traffic arriving and getting blocked on other interfaces, but to no avail. With cross-zone balancing disabled, I figured that a given NLB listener could only send traffic to one worker machine. Running tcpdump on one of the two worker machines not hosting the target pod, I expected to see traffic arriving and either getting blocked before leaving that machine, or getting forwarded to the machine hosting the pod. There were lots of packets arriving from the NLB listeners—probably the health checks—all of small size with no payload, but I couldn’t see any of the HTTP requests arriving.
The VPC flow logs indicate no rejected traffic involving the service port, node port, or container port. I see most—if not all—of my connection attempts present in the flow logs with an “ACCEPT” outcome. If some are not being accepted, though, they’re not present in the logs with a “REJECT” outcome.
Variation and Comparison
If I switch from eBBF to using kube-proxy with either iptables or ipvs, with no other changes, all requests succeed. If I switch from an NLB to a Classic Load Balancer, with no other changes, all requests succeed. If I switch from cross-subnet encapsulation to always use encapsulation, the outcome doesn’t change. If I switch from VXLAN to IP-in-IP encapsulation, the outcome doesn’t change. (I think. I may need to test that again.)
Steps to Reproduce
- Set up Calico using eBPF mode in a Kubernetes cluster with machines spread across multiple subnets, and probably across multiple availability zones.
Unfortunately, it is difficult for me to add more subnets in a single zone to see whether it’s the zonal or subnet-level separation that makes the difference here. - Create a Kubernetes Service of type “LoadBalancer,” annotated to summon an NLB.
I’m not certain that there’s anything special about Kubernetes here; this same problem would probably apply to a hand-crafted NLB, and probably outside of a Kubernetes cluster. - Look up the IP addresses of the NLB’s listeners using its DNS A record.
- Create a pod with a server, such as NGINX running an “echo” handler.
Note which node hosts the pod, and in which availability zone that node’s machine sits. - Issue requests against each of the NLB listeners with cross-zone load balancing disabled.
Observe that requests against only one of the listeners succeed. - Issue requests against each of the NLB listeners with cross-zone load balancing enabled.
Observe that all listeners behave similarly, only succeeding approximately 1 / listener count of the time.
Context
We would like to have the option of using NLBs for our Kubernetes Services together with Calico in eBPF mode, partly to take advantage of being able to preserve client IP addresses with external traffic policy “Cluster” without using the PROXY protocol.
Your Environment
- Cloud Provider: AWS EC2 (“us-east-2” region, VPC networking)
- Calico version: 3.16.4 (as deployed by kops 1.19)
- Orchestrator version: Kubernetes 1.19.3
- Operating System and version: Flatcar Container Linux
/etc/os-release
NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=2643.1.0
VERSION_ID=2643.1.0
BUILD_ID=2020-10-13-1801
PRETTY_NAME="Flatcar Container Linux by Kinvolk 2643.1.0 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar-linux.org/"
BUG_REPORT_URL="https://issues.flatcar-linux.org"
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 18 (5 by maintainers)
My testing confirms that this problem is fixed by projectcalico/felix#2589 (and projectcalico/felix#2588). Thank you, @tomastigera and @fasaxc!
I did confirm that changing my Service using an external traffic policy of “Local” alleviates this problem.
From outside the cluster. I tried these experiments both from outside the VPC and from inside on a bastion machine, though, with equivalent results.
It’s internet-facing (public).
It times out trying to establish the TCP connection.
Right, these targets are registered by instance ID, which makes them ineligible for hairpin connections, as you noted below.
I don’t think so. Considering a client outside of AWS’s network, the connection attempt lands at the NLB listener, which should in turn connect to any of my Kubernetes worker machines. Each of those might in turn establish a connection to a sibling machine, but none of them should go back out to the NLB as a client for any reason.
@hakman, @rifelpet, and @johngmyers, I thought you might be interested in this issue at the interesction of kops, Calico, and NLBs.