calico: Calico Operator Installation got stuck

We had a calico installation without the operator with version 3.19.1 and now trying to move to operator installation. I did it on other clusters smoothly but one of the kubernetes cluster that we have, the installation got stuck. There are no errors or logs in anywhere that I can find. I followed this documentation: https://projectcalico.docs.tigera.io/maintenance/operator-migration

Expected Behavior

Calico resources migrated from the kube-system namespace used by the Calico manifests to a new calico-system namespace

Current Behavior

Typha is failed to scale and failed to move the calico-node pods to calico-system namespace. The good thing is calico-node pods are still running in kube-system namespace. There was a calico-typha deployment in the kube-system namespace before the installation. I am suspecting that might be the issue. That deployment has been removed after the Installation got into this stucked state.

Possible Solution

Maybe retriggering the installation would work but how ?

Steps to Reproduce (for bugs)

Install the Tigera Calico operator and custom resource definitions.

kubectl create -f https://projectcalico.docs.tigera.io/manifests/tigera-operator.yaml

Trigger the operator to start a migration by creating an Installation resource. The operator will auto-detect your existing Calico settings and fill out the spec section.

kubectl create -f - <<EOF
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec: {}
EOF

Monitor the migration status with the following command:

# kubectl describe tigerastatus calico
Name:         calico
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  operator.tigera.io/v1
Kind:         TigeraStatus
Metadata:
  Creation Timestamp:  2022-07-19T12:54:49Z
  Generation:          1
  Managed Fields:
    API Version:  operator.tigera.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
      f:status:
        .:
        f:conditions:
    Manager:         operator
    Operation:       Update
    Time:            2022-07-19T12:54:54Z
  Resource Version:  841265704
  UID:               7550ded0-0137-4f67-ac9d-6d2409ef0104
Spec:
Status:
  Conditions:
    Last Transition Time:  2022-07-19T12:54:54Z
    Message:               not enough linux nodes to schedule typha pods on, require 1 and have 0
    Reason:                Failed to scale typha
    Status:                True
    Type:                  Degraded
Events:                    <none>

However some typha pods are created

# kubectl get pods -n calico-system
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-557cd74586-nnrm2   1/1     Running   0          19h
calico-typha-7648f46566-r27jh              1/1     Running   0          19h
calico-typha-7648f46566-w6trf              1/1     Running   0          19h

Your Environment

Calico version

# calicoctl version
Client Version:    v3.23.2
Git commit:        a52cb86db
Cluster Version:   v3.19.1
Cluster Type:      k8s,bgp,kdd,typha

Orchestrator version (e.g. kubernetes, mesos, rkt):

# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.12", GitCommit:"696a9fdd2a58340e61e0d815c5769d266fca0802", GitTreeState:"clean", BuildDate:"2022-04-13T19:07:00Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.12", GitCommit:"696a9fdd2a58340e61e0d815c5769d266fca0802", GitTreeState:"clean", BuildDate:"2022-04-13T19:01:10Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

Operating System and version:

Ubuntu 20.04.4 LTS
Linux s941 5.4.0-110-generic #124-Ubuntu SMP Thu Apr 14 19:46:19 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 3
Comments: 17 (8 by maintainers)

Most upvoted comments

so, we just removed the projectcalico.org/operator-node-migration: pre-operator label out of curiosity from a single node and that lead to the operator actually starting the migration and moving calico-node pods from the kube-system daemonset to the operator managed calico-system daemonset

psych0d0g on Jul 28, 2022