operator-sdk: Ansible: how to prevent CR deletion on finalizer playbook failure
Type of question
how to implement a specific feature
Question
What did you do? When deleting a Custom Resource(CR) object the relevant playbook is executed (as per my finalizer dict config in watches.yml) however the CR is deleted from k8s no matter what happens in the playbook i.e. even if it fails. This causes inconsistent state, in which CR is no longer present in k8s, but the underlaying/related resources which live outside of the cluster are not really cleaned up properly. I was hoping that if the playbook fails, then the finalizer is not going to be removed, which will keep the resource object in k8s. If that would be the case, then I could use k8_status from the playbook to flag to end user that deletion and related cleanup actions didn’t succeed and the user would be able to manually clean up things and then edit the resource and remove the finalizer manually for the delete process to complete. Could you please advise how to achieve this? BTW> I’m using “manageStatus: False”
What did you expect to see? CR not deleted from k8s in case finalizer linked playbook fails
What did you see instead? Under which circumstances? CR is deleted from k8s no matter that finalizer linked playbook failed to complete all its actions
Environment
-
operator-sdk version: quay.io/operator-framework/ansible-operator:v0.15.1
insert release or Git SHA here
-
Kubernetes version information:
insert output of
kubectl versionhere -
Kubernetes cluster kind:
Additional context Add any other context about the question here.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 23 (13 by maintainers)
So you have established that when the Ansible runs successfully but with an error, it behaves as you expect. But the fact that you had a syntax error is interesting. While a syntax error is the most likely culprit, I doubt it is the only way that it could fail.
There are two scenarios where
ansible-runner(ansible-playbookunder the covers) fails that should be considered:What concerns me is that it appears that the run is being treated as successful when the actual command fails.
The only question that remains for me is what should the expected behavior be when the ansible command fails? My guess is that the ansible controller should:
return reconcileResult, errHi @tomsucho,
I did a POC with a syntax error in the Ansible task, and I could check that:
Failed to get ansible-runner stdoutwill be faced as described in https://github.com/operator-framework/operator-sdk/issues/2546#issuecomment-585638370manageStatus: Falsein this scenario.You can see all the details below.
Then, could you please input here the content of your
build/Dockerfile?PS.: I am asking that because in the 0.14.0 version of SDK we did a fix related to the issue of the reconcile not be re-trigged in failure scenarios. See in the CHANGELOG. You described that you are using the operator-sdk version 0.15.1. However, was the project scaffolded with this version and/or upgrade to use the 0.15.1 ansible image?
Following my POC.
roles/testcr/tasks/main.yml)Following the logs (
kubectl logs deployment.apps/testoperator -c operator -n default)Look that the logs has:
And then, see that the reconcile still been retrigged forever by checking the Ansible logs too. ( kubectl logs deployment.apps/testoperator -c ansible -n default )
The offending line appears to be:
ManageStatus=Falseas well. (watches.yaml)c/c @joelanford @djzager