calico: projectcalico.org/IPv4Address annotation pointing to wrong node's IP CIDR
On Kubernetes clusters > 1.20, calico-node
fails because of this reason. When it occurs, the logs from this calico-node
(running on node2 in example) looks like:
startup/startup.go 411: Determined node name: node2
startup/startup.go 103: Starting node node2 with version v3.18.1
...
startup/reachaddr.go 57: Checking CIDR CIDR="10.240.0.4/16"
startup/reachaddr.go 59: Found matching interface CIDR CIDR="10.240.0.4/16"
startup/startup.go 808: Using autodetected IPv4 address 10.240.0.4/16, detected by connecting to 1.1.1.1
startup/startup.go 585: Node IPv4 changed, will check for conflicts
startup/startup.go 1128: Calico node 'node1' is already using the IPv4 address 10.240.0.4.
<----- problem
startup/startup.go 347: Clearing out-of-date IPv4 address from this node IP="10.240.0.4/16"
startup/startup.go 1340: Terminating
If you look at the annotations on node1, it will show projectcalico.org/IPv4Address: 10.240.0.4/16
. However, 10.240.0.4
is not node1’s IP, it is node2’s IP. Thus node1’s IP annotation is incorrect.
Expected Behavior
All nodes to receive an annotation that matches up with the node’s IP.
Example: If a the internal IP of a node is 10.240.0.4
, than it’s the annotation it receives from calico-node
should be projectcalico.org/IPv4Address: 10.240.0.4/16
.
Current Behavior
A node will receive an annotation that does not match up with it’s IP.
Example: A node’s internal IP may be 10.240.0.5
, but the annotation it receives from calico-node
will be projectcalico.org/IPv4Address: 10.240.0.4/16
.
Steps to Reproduce (for bugs)
The issue seems to be intermediate with low reproducibility. However, all the times this has happened has been from clusters coming from an upgrade.
Particularly, clusters upgrading from a Kubernetes version below 1.20 to above 1.20, which introduces the tigera-operator
for managing the installation of Calico.
Context
It seems likely that this comes from a race during an upgrade, possibly similar to https://github.com/projectcalico/calico/issues/4525.
Your Environment
- Calico version: v3.18.1 (from tigera-operator v1.15.1)
- Orchestrator version: Upgraded to 1.20.x from previous version
- Operating System and version: Ubuntu 18
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 2
- Comments: 15 (6 by maintainers)
Mitigation
If anyone comes across this problem in their cluster, here are the mitigation steps:
Identify which node has the incorrect annotation by doing
kubectl log <failed calico-node pod> -n calico-system
and look for something like:startup/startup.go 1128: Calico node 'node1' is already using the IPv4 address <IP of different node>
Check
node1
and see if it has annotationprojectcalico.org/IPv4Address
that does not match up with it’s internal IP. If so, we must the update annotation then restart thecalico-node
running on that node so it gets the receives the correct annotation:kubectl annotate node <node1> projectcalico.org/IPv4Address= --overwrite
kubectl delete pod <running calico-node pod on node1> -n calico-system
At this point when the failed calico-node restarts it will come up correctly, but if your in a hurry you can manually restart it:
kubectl delete pod <failed calico-node pod> -n calico-system
All the
projectcalico.org/IPv4Address
node annotations should now match up with their respective IPs, which will allow allcalico-node
’s to run.Fix is in (upgrade will no longer copy annotations/labels with projectcalico.org) but will take 2 weeks before it is in all AKS regions.
Yep this is fixed everywhere
@lmm there is a good chance this is aks upgrade related. (I can speak for azure/aks). During an aks upgrade for legacy resaons we preserve node labels by copying them between nodes during the upgrade. We have a blacklist of labels/annotations that I’m goign to expand to containe everything under *.projectcalico.org. Should take another 2+ weeks to hit all regions though. If you’ve seen this on none AKS clusters then what I say probably doesn’t apply.
@lmm I believe you were looking into this?