operator-sdk: Scorecard does not wait for finalisers to complete before removing operator

Bug Report

What did you do? Created a simple Ansible Operator with SDK v0.11.0 to reproduce an issue we are seeing on our Ansible operators where finalisers do not complete within a few seconds.

Using the following tasks:

---
- name: "Set state of {{ meta.name }}-test-cfg"
  k8s:
    state: "{{ state }}"
    definition:
      apiVersion: "v1"
      kind: ConfigMap
      metadata:
        name: "{{ meta.name }}-test-cfg"
        namespace: "{{ meta.namespace }}"
      data:
        key: value

- name: test wait
  shell: sleep 10

Set a finaliser set in watches:

- version: v1
  group: example.com
  kind: TestCr
  playbook: /opt/ansible/playbook.yml
  finalizer:
    name: example.com
    vars:
      state: absent

Used the following scorecard configuration:

scorecard:
  plugins:
    - basic:
        olm-deployed: false
        cr-manifest:
          - "deploy/crds/icm.ibm.com_v1_testcr_cr.yaml"
        init-timeout: 300

What did you expect to see? With the sleep task disabled I see the following as expected:

operator-sdk scorecard --verbose   
DEBU[0000] Debug logging is set                         
INFO[0000] Using config file: /Users/oseoin/playground/kubernetes/operators/operator-test/.osdk-scorecard.yaml 
Basic Tests:
        Writing into CRs has an effect: 1/1
        Spec Block Exists: 1/1
        Status Block Exists: 1/1

What did you see instead? Under which circumstances? Instead I get an error that cleanup does not complete:

operator-sdk scorecard --verbose       
DEBU[0000] Debug logging is set                         
INFO[0000] Using config file: /Users/oseoin/playground/kubernetes/operators/operator-test/.osdk-scorecard.yaml 
WARN[0044] time="2019-10-23T15:19:35+01:00" level=info msg="a cleanup function failed with error: cleanup function failed: timed out waiting for the condition\n"
time="2019-10-23T15:19:45+01:00" level=info msg="a cleanup function failed with error: cleanup function failed: timed out waiting for the condition\n"
time="2019-10-23T15:19:45+01:00" level=error msg="Failed to cleanup resources: (a cleanup function failed; see stdout for more details)" 
Basic Tests:
        Spec Block Exists: 1/1
        Status Block Exists: 1/1
        Writing into CRs has an effect: 1/1

Watching the operator deployment I can see that it gets removed before the finaliser completes.

Environment

  • operator-sdk version:

operator-sdk version: “v0.11.0” commit: “39c65c36159a9c249e5f3c178205cc6e86c16f8d”

  • go version: go version: “go1.13.1 darwin/amd64”

  • Kubernetes version information:

Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.4", GitCommit:"67d2fcf276fcd9cf743ad4be9a9ef5828adc082f", GitTreeState:"clean", BuildDate:"2019-09-18T14:41:55Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster kind: Docker desktop 2.1.4.0 on macOs 10.15

  • Are you writing your operator in ansible, helm, or go? Ansible

Possible Solution Scorecard should wait before removing the operator deployment if finalisers are set.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 17 (10 by maintainers)

Commits related to this issue

Most upvoted comments

The Deployment.apps is the operator itself, right? If so, the correct order would be:

  1. CustomResourceDefinition.apiextensions.k8s.io
  2. Deployment.apps
  3. RoleBinding.rbac.authorization.k8s.io
  4. Role.rbac.authorization.k8s.io
  5. ServiceAccount

Deleting the CRD causes all CRs of that type to be deleted, but if the CR has a finalizer, then the operator still needs to be running. So we would need to make sure that our code waits until the CRDs from step 1 are actually deleted before proceeding to step 2.

@joelanford I confirmed that order does indeed work, so the fix here would be to change scorecard to delete in that order it sounds like.

based on my debugging so far, it appears the scorecard deletes resources in an order like this: Deployment.apps RoleBinding.rbac.authorization.k8s.io Role.rbac.authorization.k8s.io ServiceAccount CustomResourceDefinition.apiextensions.k8s.io

what I’m suspecting is that scorecard assumes that when it removes the CRD that it will cause any associated CRs to be removed, this is not happening when a finalizer is specified, you instead see the CRD deletion hang and the CR not being removed.

It seems to me the CR should be removed prior to the CRD being removed which works around this hang.

Hi @camilamacedo86, apologies for the delay in getting back to you. I tested using init-timeout: 300. Our projects are not public but I have attached the test operator that I created to recreate the issue, which I have verified uses the same permissions as in the memcached sample. Unfortunately there is no other output in stdout, and the JSON output does not provide any additional information. I think that this has to be a scorecard issue rather than a permissions one as the finaliser works fine when run manually - unless scorecard uses permissions differently? operator-test.zip. Thanks!