operator: Tigera Operator unable to uninstall cleanly
Expected Behavior
When uninstalling the Tigera Operator helm chart, it should be able to remove all resource and clean itself up. Then we should be able to reinstall Tigera Operator without having to restart the nodes in an EKS cluster.
Current Behavior
Uninstalling the helm release locally (using helm delete -n tigera-operator tigera-operator
) allows the chart to be removed successfully. But when reinstalling the same chart, the tigera-operator
pod (keeps crashing) and no calico-node
pods get created. The following are logs in the tigera-operator
container.
2022/06/20 15:13:44 [INFO] Version: v1.27.5
2022/06/20 15:13:44 [INFO] Go Version: go1.17.9b7
2022/06/20 15:13:44 [INFO] Go OS/Arch: linux/amd64
2022/06/20 15:14:04 [ERROR] Get "https://XXX.gr7.eu-west-1.eks.amazonaws.com:443/api?timeout=32s": dial tcp: lookup 1D36CABDB9FD90B6E9FB961600E4045D.gr7.eu-west-1.eks.amazonaws.com on 10.52.X.X:53: read udp 100.65.X.X:50421->10.52.X.X:53: i/o timeout
This breaks all network connectivity in the cluster. Any new images can’t be pulled.
When deleting the Tigera Operator helm release through terraform, the resource is unable to delete and times out. This may be due to this finaliser on the Installation
resource, which installs calico.
finalizers:
- tigera.io/operator-cleanup
This problem has begun occurring on Tigera Operator helm chart v2.2.0 (Tigera image v1.27.0 & Calico image v3.23.0).
Possible Solution
A temporary workaround is to restart the nodes. The calico-node
pods then begin running on these new nodes, and the tigera-operator
pod starts running without any restarts.
Steps to Reproduce (for bugs)
- Install chart using command
helm install tigera-operator stevehipwell/tigera-operator -n tigera-operator --version 2.2.4 --values tigera-values.yaml
. The following values can be used
dnsPolicy: Default
env:
- name: KUBERNETES_SERVICE_HOST
value: XXX.gr7.eu-west-1.eks.amazonaws.com
- name: KUBERNETES_SERVICE_PORT
value: "443"
hostNetwork: false
installation:
enabled: true
spec:
cni:
type: AmazonVPC
componentResources:
- componentName: Node
resourceRequirements:
limits:
cpu: 1000m
memory: 256Mi
requests:
cpu: 50m
memory: 256Mi
- componentName: Typha
resourceRequirements:
limits:
cpu: 1000m
memory: 128Mi
requests:
cpu: 10m
memory: 128Mi
- componentName: KubeControllers
resourceRequirements:
limits:
cpu: 1000m
memory: 64Mi
requests:
cpu: 100m
memory: 64Mi
controlPlaneNodeSelector:
kubernetes.io/os: linux
lnrs.io/tier: system
controlPlaneTolerations:
- key: CriticalAddonsOnly
operator: Exists
- key: system
operator: Exists
kubernetesProvider: EKS
nodeMetricsPort: 9091
nodeUpdateStrategy:
rollingUpdate:
maxUnavailable: 25%
type: RollingUpdate
registry: quay.io/
typhaAffinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: lnrs.io/tier
operator: In
values:
- system
typhaMetricsPort: 9093
variant: Calico
priorityClassName: ""
rbac:
create: true
resources:
limits:
cpu: 1000m
memory: 512Mi
requests:
cpu: 50m
memory: 512Mi
serviceAccount:
create: true
serviceMonitor:
additionalLabels:
monitoring-platform: "true"
enabled: true
tolerations:
- key: system
operator: Exists
- Delete chart by running
helm delete -n tigera-operator tigera-operator
. - Re-install chart using the same command in step 1.
Context
We are unable to destroy EKS clusters through terraform, as it times out when uninstalling the Tigera Operator helm release.
Your Environment
- Operating System and version: Amazon EKS, v1.21.
- Link to your project (optional):
This is possibly related to projectcalico/calico/issues/6210
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 28
- Comments: 34 (19 by maintainers)
installations.operator.tigera.io/default
blocks uninstallation in my case.Add some my experiences for uninstalling calico tigera-operator in our EKS cluster.
helm uninstall
to uninstall tigera-operator but onlytigera-operator
get uninstalled and other resource in namespacecalico-system
still exist. (same issue with this thread)helm install
again to install tiger-operator, then all the resources in namespacecalico-system
get cleaned, but onlytigera-operator
is there in the cluster.helm uninstall
to uninstall tigera-operator again, and then everything is gone.That’s what I did for uninstalling calico as a workaround and I haven’t dug into a bit for the detailed behaviors of those yet. Not sure if anybody else doing this before.
After uninstalling the operator with Helm and cleaning up the rest with kubectl, I’m stuck with a calico-node ServiceAccount in the calico-system namespace that refuses to be deleted. How can I delete it?
Edit: Solved. I had to patch away the finalizer.
+1
For us, installations.operator.tigera.io/default finalizers block tigera helm uninstallation and ServiceAccount/calico-node finalizers block deletion of the calico_system namespace
Deleting the installations.operator.tigera.io default before destroying the helm tigera, does remove installations and also removes the finalizers on calico-system ServiceAccount/calico-node and destroys cleanly.
@Rambatino in this case it’s explicitly a stuck resource with no associated controller to manage it. I agree that removing finalizers is generally a bad idea but in this case, given the cluster is likely in the process of being destroyed, it’s the lesser of two evils. That said if you’re not destroying the cluster and are removing the finalizer to delete the installation you’re going to need to either replace all the nodes or manually fix them.
Example PR with one approach for resolving this: https://github.com/tigera/operator/pull/2662/
It doesn’t (anymore), but although it works for day 0 (kind of) it doesn’t allow for CRD updates so is functionally useless for a project like Tigera Operator where the CRDs change.
Helm understands built in resource types and ordering so should be good on that front. I also think you can’t delete a resource which is being used by a pod.
This would have been my first thought as well - add a termination grace period, have the operator handle SIGTERM and rather than exit immediately, delay exiting until it is confident it doesn’t need to remove any more finalizers or do any further cleanup. In normal operation that should happen pretty quickly. If we hit the end of the grace period (maybe 60s or so) then we’ll get a SIGKILL and be forced to shutdown.
@SamuZad although you’re technically correct (well assuming enough of a pause between the
kubectl delete
andhelm uninstall
); the point here is that installation method can’t also support uninstallation which isn’t great UX and means it isn’t suitable for declarative IaC.The chart
stevehipwell/tigera-operator
is not supported by the tigera/operator team. But I expect that this would probably be an issue with the official helm chart also. I’m guessing this is because the operator deployment is being removed, so the finalizer it adds, is not being removed. I think first removing the operator CustomResource (CR) (the Installation “default” resource) and allowing the operator to remove the finalizer then removing the operator I believe would work.@caseydavenport do you know what the correct way to handle ensuring the operator deployment isn’t removed before the Installation CR is since it puts a finalizer on the CR? Should the operator be putting a finalizer on itself (the operator deployment) also? That seems like a bad idea to me, if that was the right way then it would probably need to put a finalizer on all the resources it uses to prevent helm from removing them before it could clean up too.
Maybe there is a helm chart feature that will delete the Installation CR and ensure it is removed before deleting everything else? (I am doubtful of this since I don’t think helm has much in the way of being able to sequence things.)
Maybe we just need to remove the use of the finalizer which I added in #1710, though from what was trying to be fixed it seems like the same issue would be hit in this use case too.
You should uninstall
custom-resources.yaml
first, and thentigera-operator.yaml
- just the opposite of install ordering.If you uninstall the operator first, then it won’t be running in the cluster in order to clean up after itself when
custom-resources.yaml
is deleted.