operator-sdk: Ansible: how to prevent CR deletion on finalizer playbook failure

Type of question

how to implement a specific feature

Question

What did you do? When deleting a Custom Resource(CR) object the relevant playbook is executed (as per my finalizer dict config in watches.yml) however the CR is deleted from k8s no matter what happens in the playbook i.e. even if it fails. This causes inconsistent state, in which CR is no longer present in k8s, but the underlaying/related resources which live outside of the cluster are not really cleaned up properly. I was hoping that if the playbook fails, then the finalizer is not going to be removed, which will keep the resource object in k8s. If that would be the case, then I could use k8_status from the playbook to flag to end user that deletion and related cleanup actions didn’t succeed and the user would be able to manually clean up things and then edit the resource and remove the finalizer manually for the delete process to complete. Could you please advise how to achieve this? BTW> I’m using “manageStatus: False”

What did you expect to see? CR not deleted from k8s in case finalizer linked playbook fails

What did you see instead? Under which circumstances? CR is deleted from k8s no matter that finalizer linked playbook failed to complete all its actions

Environment

operator-sdk version: quay.io/operator-framework/ansible-operator:v0.15.1

insert release or Git SHA here
Kubernetes version information:

insert output of kubectl version here
Kubernetes cluster kind:

Additional context Add any other context about the question here.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 23 (13 by maintainers)

Most upvoted comments

So you have established that when the Ansible runs successfully but with an error, it behaves as you expect. But the fact that you had a syntax error is interesting. While a syntax error is the most likely culprit, I doubt it is the only way that it could fail.

There are two scenarios where ansible-runner (ansible-playbook under the covers) fails that should be considered:

Normal Reconcile, ie. when the playbook is being run in response to generic event.
Pending Finalizers, ie. when the playbook is being run and the resource has been deleted.

What concerns me is that it appears that the run is being treated as successful when the actual command fails.

In the reconciler https://github.com/operator-framework/operator-sdk/blob/master/pkg/ansible/controller/reconcile.go#L158-L166
In the runner you can see only an error is logged https://github.com/operator-framework/operator-sdk/blob/0eac8a1169dfc0a8918b9e41ada6278307a0852b/pkg/ansible/runner/runner.go#L221-L226

The only question that remains for me is what should the expected behavior be when the ansible command fails? My guess is that the ansible controller should:

Get notified of the error by the runner (ie. runner needs to be updated to bubble up the failure)
Put something on the status of the resource because it failed
Get requeued return reconcileResult, err

djzager on Feb 11, 2020

Hi @tomsucho,

I did a POC with a syntax error in the Ansible task, and I could check that:

The reconcile still be retrigged
The error Failed to get ansible-runner stdout will be faced as described in https://github.com/operator-framework/operator-sdk/issues/2546#issuecomment-585638370
Also, the same behaviour will occur no matter the manageStatus: False in this scenario.

You can see all the details below.

Then, could you please input here the content of your build/Dockerfile?

PS.: I am asking that because in the 0.14.0 version of SDK we did a fix related to the issue of the reconcile not be re-trigged in failure scenarios. See in the CHANGELOG. You described that you are using the operator-sdk version 0.15.1. However, was the project scaffolded with this version and/or upgrade to use the 0.15.1 ansible image?

Following my POC.

I added the main task which has a syntax error as follows: (roles/testcr/tasks/main.yml)

---
# tasks file for testcr

- msg: invalid sintax
    include_tasks: include_tasks:

I added a watch to call this task:

---
- version: v1alpha1
  group: glothriel.com
  kind: TestCR
  role: testcr

And then, see that the reconcile will be retrigged as I described above.

Then, if the ansible command fails I understand that it means that no event.Status was returned. Am I right? So, see follows that it will be marked with Failed to get ansible-runner stdout and retrigger the reconcile.

https://github.com/operator-framework/operator-sdk/blob/v0.15.x/pkg/ansible/controller/reconcile.go#L189-L198

Following the logs (kubectl logs deployment.apps/testoperator -c operator -n default)

 $ kubectl logs deployment.apps/testoperator -c operator -n default
{"level":"info","ts":1581591309.5070565,"logger":"cmd","msg":"Go Version: go1.13.3"}
{"level":"info","ts":1581591309.5071228,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1581591309.5071518,"logger":"cmd","msg":"Version of operator-sdk: v0.15.0+git"}
{"level":"info","ts":1581591309.507189,"logger":"cmd","msg":"Watching namespace.","Namespace":"default"}
{"level":"info","ts":1581591309.828779,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"0.0.0.0:8383"}
{"level":"info","ts":1581591309.8295484,"logger":"watches","msg":"Environment variable not set; using default value","envVar":"WORKER_TESTCR_GLOTHRIEL_COM","default":1}
{"level":"info","ts":1581591309.8299675,"logger":"watches","msg":"Environment variable not set; using default value","envVar":"ANSIBLE_VERBOSITY_TESTCR_GLOTHRIEL_COM","default":2}
{"level":"info","ts":1581591309.8303409,"logger":"ansible-controller","msg":"Watching resource","Options.Group":"glothriel.com","Options.Version":"v1alpha1","Options.Kind":"TestCR"}
{"level":"info","ts":1581591309.8307326,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1581591310.1490538,"logger":"leader","msg":"No pre-existing lock was found."}
{"level":"info","ts":1581591310.1549094,"logger":"leader","msg":"Became the leader."}
{"level":"info","ts":1581591310.7842658,"logger":"metrics","msg":"Metrics Service object created","Service.Name":"testoperator-metrics","Service.Namespace":"default"}
{"level":"info","ts":1581591311.0901892,"logger":"cmd","msg":"Could not create ServiceMonitor object","Namespace":"default","error":"no ServiceMonitor registered with the API"}
{"level":"info","ts":1581591311.0908258,"logger":"cmd","msg":"Install prometheus-operator in your cluster to create ServiceMonitor objects","Namespace":"default","error":"no ServiceMonitor registered with the API"}
{"level":"info","ts":1581591311.0921226,"logger":"proxy","msg":"Starting to serve","Address":"127.0.0.1:8888"}
{"level":"info","ts":1581591311.092865,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"testcr-controller","source":"kind source: glothriel.com/v1alpha1, Kind=TestCR"}
{"level":"info","ts":1581591311.09327,"logger":"controller-runtime.controller","msg":"Starting Controller","controller":"testcr-controller"}
{"level":"info","ts":1581591311.094092,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1581591311.1940362,"logger":"controller-runtime.controller","msg":"Starting workers","controller":"testcr-controller","worker count":1}
{"level":"error","ts":1581591313.8124533,"logger":"runner","msg":"ansible-playbook 2.9.4\r\n  config file = /etc/ansible/ansible.cfg\r\n  configured module search path = ['/usr/share/ansible/openshift']\r\n  ansible python module location = /usr/local/lib/python3.6/site-packages/ansible\r\n  executable location = /usr/local/bin/ansible-playbook\r\n  python version = 3.6.8 (default, Oct 11 2019, 15:04:54) [GCC 8.3.1 20190507 (Red Hat 8.3.1-4)]\r\nUsing /etc/ansible/ansible.cfg as config file\r\nERROR! Syntax Error while loading YAML.\r\n  mapping values are not allowed here\r\n\r\nThe error appears to be in '/opt/ansible/roles/testcr/tasks/main.yml': line 5, column 18, but may\r\nbe elsewhere in the file depending on the exact syntax problem.\r\n\r\nThe offending line appears to be:\r\n\r\n- msg: invalid sintax\r\n    include_tasks: include_tasks:\r\n                 ^ here\r\n","job":"6129484611666145821","name":"example-testcr","namespace":"default","error":"exit status 4","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tpkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/operator-framework/operator-sdk/pkg/ansible/runner.(*runner).Run.func1\n\tsrc/github.com/operator-framework/operator-sdk/pkg/ansible/runner/runner.go:223"}
{"level":"error","ts":1581591313.8128512,"logger":"reconciler","msg":"ansible-playbook 2.9.4\r\n  config file = /etc/ansible/ansible.cfg\r\n  configured module search path = ['/usr/share/ansible/openshift']\r\n  ansible python module location = /usr/local/lib/python3.6/site-packages/ansible\r\n  executable location = /usr/local/bin/ansible-playbook\r\n  python version = 3.6.8 (default, Oct 11 2019, 15:04:54) [GCC 8.3.1 20190507 (Red Hat 8.3.1-4)]\r\nUsing /etc/ansible/ansible.cfg as config file\r\nERROR! Syntax Error while loading YAML.\r\n  mapping values are not allowed here\r\n\r\nThe error appears to be in '/opt/ansible/roles/testcr/tasks/main.yml': line 5, column 18, but may\r\nbe elsewhere in the file depending on the exact syntax problem.\r\n\r\nThe offending line appears to be:\r\n\r\n- msg: invalid sintax\r\n    include_tasks: include_tasks:\r\n                 ^ here\r\n","job":"6129484611666145821","name":"example-testcr","namespace":"default","error":"did not receive playbook_on_stats event","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tpkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/operator-framework/operator-sdk/pkg/ansible/controller.(*AnsibleOperatorReconciler).Reconcile\n\tsrc/github.com/operator-framework/operator-sdk/pkg/ansible/controller/reconcile.go:197\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\tpkg/mod/k8s.io/apimachinery@v0.0.0-20191004115801-a2eda9f80ab8/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tpkg/mod/k8s.io/apimachinery@v0.0.0-20191004115801-a2eda9f80ab8/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\tpkg/mod/k8s.io/apimachinery@v0.0.0-20191004115801-a2eda9f80ab8/pkg/util/wait/wait.go:88"}
{"level":"error","ts":1581591313.8130338,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"testcr-controller","request":"default/example-testcr","error":"did not receive playbook_on_stats event","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tpkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\tpkg/mod/k8s.io/apimachinery@v0.0.0-20191004115801-a2eda9f80ab8/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tpkg/mod/k8s.io/apimachinery@v0.0.0-20191004115801-a2eda9f80ab8/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\tpkg/mod/k8s.io/apimachinery@v0.0.0-20191004115801-a2eda9f80ab8/pkg/util/wait/wait.go:88"}
{"level":"error","ts":1581591317.3889155,"logger":"runner","msg":"ansible-playbook 2.9.4\r\n  config file = /etc/ansible/ansible.cfg\r\n  configured module search path = ['/usr/share/ansible/openshift']\r\n  ansible python module location = /usr/local/lib/python3.6/site-packages/ansible\r\n  executable location = /usr/local/bin/ansible-playbook\r\n  python version = 3.6.8 (default, Oct 11 2019, 15:04:54) [GCC 8.3.1 20190507 (Red Hat 8.3.1-4)]\r\nUsing /etc/ansible/ansible.cfg as config file\r\nERROR! Syntax Error while loading YAML.\r\n  mapping values are not allowed here\r\n\r\nThe error appears to be in '/opt/ansible/roles/testcr/tasks/main.yml': line 5, column 18, but may\r\nbe elsewhere in the file depending on the exact syntax problem.\r\n\r\nThe offending line appears to be:\r\n\r\n- msg: invalid sintax\r\n    include_tasks: include_tasks:\r\n                 ^ here\r\n","job":"4037200794235010051","name":"example-testcr","namespace":"default","error":"exit status 4","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tpkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/operator-framework/operator-sdk/pkg/ansible/runner.(*runner).Run.func1\n\tsrc/github.com/operator-framework/operator-sdk/pkg/ansible/runner/runner.go:223"}
{"level":"error","ts":1581591317.3891814,"logger":"reconciler","msg":"ansible-playbook 2.9.4\r\n  config file = /etc/ansible/ansible.cfg\r\n  configured module search path = ['/usr/share/ansible/openshift']\r\n  ansible python module location = /usr/local/lib/python3.6/site-packages/ansible\r\n  executable location = /usr/local/bin/ansible-playbook\r\n  python version = 3.6.8 (default, Oct 11 2019, 15:04:54) [GCC 8.3.1 20190507 (Red Hat 8.3.1-4)]\r\nUsing /etc/ansible/ansible.cfg as config file\r\nERROR! Syntax Error while loading YAML.\r\n  mapping values are not allowed here\r\n\r\nThe error appears to be in '/opt/ansible/roles/testcr/tasks/main.yml': line 5, column 18, but may\r\nbe elsewhere in the file depending on the exact syntax problem.\r\n\r\nThe offending line appears to be:\r\n\r\n- msg: invalid sintax\r\n    include_tasks: include_tasks:\r\n                 ^ here\r\n","job":"4037200794235010051","name":"example-testcr","namespace":"default","error":"did not receive playbook_on_stats event","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tpkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/operator-framework/operator-sdk/pkg/ansible/controller.(*AnsibleOperatorReconciler).Reconcile\n\tsrc/github.com/operator-framework/operator-sdk/pkg/ansible/controller/reconcile.go:197\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\tpkg/mod/k8s.io/apimachinery@v0.0.0-20191004115801-a2eda9f80ab8/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tpkg/mod/k8s.io/apimachinery@v0.0.0-20191004115801-a2eda9f80ab8/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\tpkg/mod/k8s.io/apimachinery@v0.0.0-20191004115801-a2eda9f80ab8/pkg/util/wait/wait.go:88"}
{"level":"error","ts":1581591317.3893642,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"testcr-controller","request":"default/example-testcr","error":"did not receive playbook_on_stats event","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tpkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\tpkg/mod/k8s.io/apimachinery@v0.0.0-20191004115801-a2eda9f80ab8/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tpkg/mod/k8s.io/apimachinery@v0.0.0-20191004115801-a2eda9f80ab8/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\tpkg/mod/k8s.io/apimachinery@v0.0.0-20191004115801-a2eda9f80ab8/pkg/util/wait/wait.go:88"}

Look that the logs has:

: nERROR! Syntax Error while loading YAML And ,“error”:“did not receive playbook_on_stats event”,“stacktrace”:"github.com/go-logr/zapr.

And then, see that the reconcile still been retrigged forever by checking the Ansible logs too. ( kubectl logs deployment.apps/testoperator -c ansible -n default )

The offending line appears to be:

- msg: invalid sintax
    include_tasks: include_tasks:
                 ^ here
/tmp/ansible-operator/runner/glothriel.com/v1alpha1/TestCR/default/example-testcr/artifacts/3510942875414458836//stdout
ansible-playbook 2.9.4
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/usr/share/ansible/openshift']
  ansible python module location = /usr/local/lib/python3.6/site-packages/ansible
  executable location = /usr/local/bin/ansible-playbook
  python version = 3.6.8 (default, Oct 11 2019, 15:04:54) [GCC 8.3.1 20190507 (Red Hat 8.3.1-4)]
Using /etc/ansible/ansible.cfg as config file
ERROR! Syntax Error while loading YAML.
  mapping values are not allowed here

The error appears to be in '/opt/ansible/roles/testcr/tasks/main.yml': line 5, column 18, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

- msg: invalid sintax
    include_tasks: include_tasks:
                 ^ here
/tmp/ansible-operator/runner/glothriel.com/v1alpha1/TestCR/default/example-testcr/artifacts/2933568871211445515//stdout
ansible-playbook 2.9.4
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/usr/share/ansible/openshift']
  ansible python module location = /usr/local/lib/python3.6/site-packages/ansible
  executable location = /usr/local/bin/ansible-playbook
  python version = 3.6.8 (default, Oct 11 2019, 15:04:54) [GCC 8.3.1 20190507 (Red Hat 8.3.1-4)]
Using /etc/ansible/ansible.cfg as config file
ERROR! Syntax Error while loading YAML.
  mapping values are not allowed here

The error appears to be in '/opt/ansible/roles/testcr/tasks/main.yml': line 5, column 18, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

- msg: invalid sintax
    include_tasks: include_tasks:
                 ^ here
/tmp/ansible-operator/runner/glothriel.com/v1alpha1/TestCR/default/example-testcr/artifacts/4324745483838182873//stdout
ansible-playbook 2.9.4
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/usr/share/ansible/openshift']
  ansible python module location = /usr/local/lib/python3.6/site-packages/ansible
  executable location = /usr/local/bin/ansible-playbook
  python version = 3.6.8 (default, Oct 11 2019, 15:04:54) [GCC 8.3.1 20190507 (Red Hat 8.3.1-4)]
Using /etc/ansible/ansible.cfg as config file
ERROR! Syntax Error while loading YAML.
  mapping values are not allowed here

The error appears to be in '/opt/ansible/roles/testcr/tasks/main.yml': line 5, column 18, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

- msg: invalid sintax
    include_tasks: include_tasks:
                 ^ here
/tmp/ansible-operator/runner/glothriel.com/v1alpha1/TestCR/default/example-testcr/artifacts/2610529275472644968//stdout
ansible-playbook 2.9.4
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/usr/share/ansible/openshift']
  ansible python module location = /usr/local/lib/python3.6/site-packages/ansible
  executable location = /usr/local/bin/ansible-playbook
  python version = 3.6.8 (default, Oct 11 2019, 15:04:54) [GCC 8.3.1 20190507 (Red Hat 8.3.1-4)]
Using /etc/ansible/ansible.cfg as config file
ERROR! Syntax Error while loading YAML.
  mapping values are not allowed here

The error appears to be in '/opt/ansible/roles/testcr/tasks/main.yml': line 5, column 18, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

- msg: invalid sintax
    include_tasks: include_tasks:
                 ^ here
/tmp/ansible-operator/runner/glothriel.com/v1alpha1/TestCR/default/example-testcr/artifacts/2703387474910584091//stdout
ansible-playbook 2.9.4
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/usr/share/ansible/openshift']
  ansible python module location = /usr/local/lib/python3.6/site-packages/ansible
  executable location = /usr/local/bin/ansible-playbook
  python version = 3.6.8 (default, Oct 11 2019, 15:04:54) [GCC 8.3.1 20190507 (Red Hat 8.3.1-4)]
Using /etc/ansible/ansible.cfg as config file
ERROR! Syntax Error while loading YAML.
  mapping values are not allowed here

The error appears to be in '/opt/ansible/roles/testcr/tasks/main.yml': line 5, column 18, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

- msg: invalid sintax
    include_tasks: include_tasks:
                 ^ here

And then, I also checked it with ManageStatus=False as well. (watches.yaml)

---
- version: v1alpha1
  group: glothriel.com
  kind: TestCR
  role: testcr
  manageStatus: False

c/c @joelanford @djzager

camilamacedo86 on Feb 13, 2020