operator: Tigera Operator unable to uninstall cleanly

Expected Behavior

When uninstalling the Tigera Operator helm chart, it should be able to remove all resource and clean itself up. Then we should be able to reinstall Tigera Operator without having to restart the nodes in an EKS cluster.

Current Behavior

Uninstalling the helm release locally (using helm delete -n tigera-operator tigera-operator) allows the chart to be removed successfully. But when reinstalling the same chart, the tigera-operator pod (keeps crashing) and no calico-node pods get created. The following are logs in the tigera-operator container.

2022/06/20 15:13:44 [INFO] Version: v1.27.5
2022/06/20 15:13:44 [INFO] Go Version: go1.17.9b7
2022/06/20 15:13:44 [INFO] Go OS/Arch: linux/amd64
2022/06/20 15:14:04 [ERROR] Get "https://XXX.gr7.eu-west-1.eks.amazonaws.com:443/api?timeout=32s": dial tcp: lookup 1D36CABDB9FD90B6E9FB961600E4045D.gr7.eu-west-1.eks.amazonaws.com on 10.52.X.X:53: read udp 100.65.X.X:50421->10.52.X.X:53: i/o timeout

This breaks all network connectivity in the cluster. Any new images can’t be pulled.

When deleting the Tigera Operator helm release through terraform, the resource is unable to delete and times out. This may be due to this finaliser on the Installation resource, which installs calico.

finalizers:
  - tigera.io/operator-cleanup

This problem has begun occurring on Tigera Operator helm chart v2.2.0 (Tigera image v1.27.0 & Calico image v3.23.0).

Possible Solution

A temporary workaround is to restart the nodes. The calico-node pods then begin running on these new nodes, and the tigera-operator pod starts running without any restarts.

Steps to Reproduce (for bugs)

Install chart using command helm install tigera-operator stevehipwell/tigera-operator -n tigera-operator --version 2.2.4 --values tigera-values.yaml. The following values can be used

dnsPolicy: Default
env:
  - name: KUBERNETES_SERVICE_HOST
    value: XXX.gr7.eu-west-1.eks.amazonaws.com
  - name: KUBERNETES_SERVICE_PORT
    value: "443"
hostNetwork: false
installation:
  enabled: true
  spec:
    cni:
      type: AmazonVPC
    componentResources:
      - componentName: Node
        resourceRequirements:
          limits:
            cpu: 1000m
            memory: 256Mi
          requests:
            cpu: 50m
            memory: 256Mi
      - componentName: Typha
        resourceRequirements:
          limits:
            cpu: 1000m
            memory: 128Mi
          requests:
            cpu: 10m
            memory: 128Mi
      - componentName: KubeControllers
        resourceRequirements:
          limits:
            cpu: 1000m
            memory: 64Mi
          requests:
            cpu: 100m
            memory: 64Mi
    controlPlaneNodeSelector:
      kubernetes.io/os: linux
      lnrs.io/tier: system
    controlPlaneTolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - key: system
        operator: Exists
    kubernetesProvider: EKS
    nodeMetricsPort: 9091
    nodeUpdateStrategy:
      rollingUpdate:
        maxUnavailable: 25%
      type: RollingUpdate
    registry: quay.io/
    typhaAffinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: lnrs.io/tier
                  operator: In
                  values:
                    - system
    typhaMetricsPort: 9093
    variant: Calico
priorityClassName: ""
rbac:
  create: true
resources:
  limits:
    cpu: 1000m
    memory: 512Mi
  requests:
    cpu: 50m
    memory: 512Mi
serviceAccount:
  create: true
serviceMonitor:
  additionalLabels:
    monitoring-platform: "true"
  enabled: true
tolerations:
  - key: system
    operator: Exists

Delete chart by running helm delete -n tigera-operator tigera-operator.
Re-install chart using the same command in step 1.

Context

We are unable to destroy EKS clusters through terraform, as it times out when uninstalling the Tigera Operator helm release.

Your Environment

Operating System and version: Amazon EKS, v1.21.
Link to your project (optional):

This is possibly related to projectcalico/calico/issues/6210

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 28
Comments: 34 (19 by maintainers)

Most upvoted comments

installations.operator.tigera.io/default blocks uninstallation in my case.

kubectl patch -n calico-system installations.operator.tigera.io/default  --type json \
  --patch='[{"op":"remove","path":"/metadata/finalizers"}]'

+12

ib-ak on Dec 18, 2022

Add some my experiences for uninstalling calico tigera-operator in our EKS cluster.

run helm uninstall to uninstall tigera-operator but only tigera-operator get uninstalled and other resource in namespace calico-system still exist. (same issue with this thread)
run helm install again to install tiger-operator, then all the resources in namespace calico-system get cleaned, but only tigera-operator is there in the cluster.
run helm uninstall to uninstall tigera-operator again, and then everything is gone.

That’s what I did for uninstalling calico as a workaround and I haven’t dug into a bit for the detailed behaviors of those yet. Not sure if anybody else doing this before.

rockc2020 on Nov 30, 2022

After uninstalling the operator with Helm and cleaning up the rest with kubectl, I’m stuck with a calico-node ServiceAccount in the calico-system namespace that refuses to be deleted. How can I delete it?

Edit: Solved. I had to patch away the finalizer.

kubectl patch -n calico-system ServiceAccount/calico-node --type json \
  --patch='[{"op":"remove","path":"/metadata/finalizers"}]'

rgson on Aug 9, 2022

For us, installations.operator.tigera.io/default finalizers block tigera helm uninstallation and ServiceAccount/calico-node finalizers block deletion of the calico_system namespace

Deleting the installations.operator.tigera.io default before destroying the helm tigera, does remove installations and also removes the finalizers on calico-system ServiceAccount/calico-node and destroys cleanly.

resource "helm_release" "tigera_calico" {
  name       = "tigera-calico-release"
  chart      = "tigera-operator"
  repository = "https://projectcalico.docs.tigera.io/charts"
  namespace  = "tigera-operator"
  timeout    = 300
  version    = "3.25.0"

  depends_on = [
    kubernetes_namespace.tigera_operator,
    kubernetes_namespace.calico_system,
    kubernetes_namespace.calico_apiserver
    ]
  set {
    name  = "installation.kubernetesProvider"
    value = "EKS"
  }
}

resource "null_resource" "remove_finalizers" {
  depends_on = [helm_release.tigera_calico]

  provisioner "local-exec" {
    when    = destroy
    command = <<-EOT
      kubectl delete installations.operator.tigera.io default
    EOT
  }

  triggers = {
    helm_tigera = helm_release.tigera_calico.status
  }
}

vijay-veeranki on Mar 2, 2023

@Rambatino in this case it’s explicitly a stuck resource with no associated controller to manage it. I agree that removing finalizers is generally a bad idea but in this case, given the cluster is likely in the process of being destroyed, it’s the lesser of two evils. That said if you’re not destroying the cluster and are removing the finalizer to delete the installation you’re going to need to either replace all the nodes or manually fix them.

stevehipwell on Jun 14, 2023

Example PR with one approach for resolving this: https://github.com/tigera/operator/pull/2662/

caseydavenport on May 24, 2023

That doc doesn’t seem to say explicitly that you shouldn’t use Method 1, but does provide some caveats / tradeoffs.

It doesn’t (anymore), but although it works for day 0 (kind of) it doesn’t allow for CRD updates so is functionally useless for a project like Tigera Operator where the CRDs change.

Hm, yeah that could be a problem if helm doesn’t order the deletes. Is helm not smart enough to wait for termination of the deployment before deleting the resources that the deployment depends on?

Helm understands built in resource types and ordering so should be good on that front. I also think you can’t delete a resource which is being used by a pod.

stevehipwell on Jun 27, 2022

If neither of these helps then maybe the operator can have a termination wait arg added so it can process any CRs deleted at the same time before terminating.

This would have been my first thought as well - add a termination grace period, have the operator handle SIGTERM and rather than exit immediately, delay exiting until it is confident it doesn’t need to remove any more finalizers or do any further cleanup. In normal operation that should happen pretty quickly. If we hit the end of the grace period (maybe 60s or so) then we’ll get a SIGKILL and be forced to shutdown.

caseydavenport on Jun 24, 2022

@SamuZad although you’re technically correct (well assuming enough of a pause between the kubectl delete and helm uninstall); the point here is that installation method can’t also support uninstallation which isn’t great UX and means it isn’t suitable for declarative IaC.

stevehipwell on May 30, 2023

The chart stevehipwell/tigera-operator is not supported by the tigera/operator team. But I expect that this would probably be an issue with the official helm chart also. I’m guessing this is because the operator deployment is being removed, so the finalizer it adds, is not being removed. I think first removing the operator CustomResource (CR) (the Installation “default” resource) and allowing the operator to remove the finalizer then removing the operator I believe would work.

@caseydavenport do you know what the correct way to handle ensuring the operator deployment isn’t removed before the Installation CR is since it puts a finalizer on the CR? Should the operator be putting a finalizer on itself (the operator deployment) also? That seems like a bad idea to me, if that was the right way then it would probably need to put a finalizer on all the resources it uses to prevent helm from removing them before it could clean up too.

Maybe there is a helm chart feature that will delete the Installation CR and ensure it is removed before deleting everything else? (I am doubtful of this since I don’t think helm has much in the way of being able to sequence things.)

Maybe we just need to remove the use of the finalizer which I added in #1710, though from what was trying to be fixed it seems like the same issue would be hit in this use case too.

tmjd on Jun 24, 2022

What is the correct order to uninstall the complete manifest? I’ve deployed via kubectl

You should uninstall custom-resources.yaml first, and then tigera-operator.yaml - just the opposite of install ordering.

If you uninstall the operator first, then it won’t be running in the cluster in order to clean up after itself when custom-resources.yaml is deleted.

caseydavenport on Jul 12, 2023