calico: projectcalico.org/IPv4Address annotation pointing to wrong node's IP CIDR

On Kubernetes clusters > 1.20, calico-node fails because of this reason. When it occurs, the logs from this calico-node (running on node2 in example) looks like: startup/startup.go 411: Determined node name: node2 startup/startup.go 103: Starting node node2 with version v3.18.1 ... startup/reachaddr.go 57: Checking CIDR CIDR="10.240.0.4/16" startup/reachaddr.go 59: Found matching interface CIDR CIDR="10.240.0.4/16" startup/startup.go 808: Using autodetected IPv4 address 10.240.0.4/16, detected by connecting to 1.1.1.1 startup/startup.go 585: Node IPv4 changed, will check for conflicts startup/startup.go 1128: Calico node 'node1' is already using the IPv4 address 10.240.0.4. <----- problem startup/startup.go 347: Clearing out-of-date IPv4 address from this node IP="10.240.0.4/16" startup/startup.go 1340: Terminating

If you look at the annotations on node1, it will show projectcalico.org/IPv4Address: 10.240.0.4/16. However, 10.240.0.4 is not node1’s IP, it is node2’s IP. Thus node1’s IP annotation is incorrect.

Expected Behavior

All nodes to receive an annotation that matches up with the node’s IP. Example: If a the internal IP of a node is 10.240.0.4, than it’s the annotation it receives from calico-node should be projectcalico.org/IPv4Address: 10.240.0.4/16.

Current Behavior

A node will receive an annotation that does not match up with it’s IP. Example: A node’s internal IP may be 10.240.0.5, but the annotation it receives from calico-node will be projectcalico.org/IPv4Address: 10.240.0.4/16.

Steps to Reproduce (for bugs)

The issue seems to be intermediate with low reproducibility. However, all the times this has happened has been from clusters coming from an upgrade.

Particularly, clusters upgrading from a Kubernetes version below 1.20 to above 1.20, which introduces the tigera-operator for managing the installation of Calico.

Context

It seems likely that this comes from a race during an upgrade, possibly similar to https://github.com/projectcalico/calico/issues/4525.

Your Environment

  • Calico version: v3.18.1 (from tigera-operator v1.15.1)
  • Orchestrator version: Upgraded to 1.20.x from previous version
  • Operating System and version: Ubuntu 18

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 15 (6 by maintainers)

Most upvoted comments

Mitigation

If anyone comes across this problem in their cluster, here are the mitigation steps:

Identify which node has the incorrect annotation by doing kubectl log <failed calico-node pod> -n calico-system and look for something like: startup/startup.go 1128: Calico node 'node1' is already using the IPv4 address <IP of different node>

Check node1 and see if it has annotation projectcalico.org/IPv4Address that does not match up with it’s internal IP. If so, we must the update annotation then restart the calico-node running on that node so it gets the receives the correct annotation:

  1. kubectl annotate node <node1> projectcalico.org/IPv4Address= --overwrite
  2. kubectl delete pod <running calico-node pod on node1> -n calico-system

At this point when the failed calico-node restarts it will come up correctly, but if your in a hurry you can manually restart it:

  1. kubectl delete pod <failed calico-node pod> -n calico-system

All the projectcalico.org/IPv4Address node annotations should now match up with their respective IPs, which will allow all calico-node’s to run.

Fix is in (upgrade will no longer copy annotations/labels with projectcalico.org) but will take 2 weeks before it is in all AKS regions.

Yep this is fixed everywhere

@lmm there is a good chance this is aks upgrade related. (I can speak for azure/aks). During an aks upgrade for legacy resaons we preserve node labels by copying them between nodes during the upgrade. We have a blacklist of labels/annotations that I’m goign to expand to containe everything under *.projectcalico.org. Should take another 2+ weeks to hit all regions though. If you’ve seen this on none AKS clusters then what I say probably doesn’t apply.

@lmm I believe you were looking into this?