operator-sdk: ansible-operator hangs on k8s state: present
Bug Report
What did you do? The ansible operator hangs in different places/on different resources on
- k8s:
      state: present.
The k8s request is successfully executed, but next operation is never started.
kubectl get/apply are working on the pod.
Presence of resources.requests and requests.limits make it happens more often, while memory reported is will under limits.
What did you expect to see?
The playbooks should continue on.
What did you see instead? Under which circumstances? A clear and concise description of what you expected to happen (or insert a code snippet).
The latest log is on start of task
{"level":"info","ts":1571992864.8623073,"logger":"logging_event_handler","msg":"[playbook task]","name":"iscplatform","namespace":"isc-relite","gvk":"isc.ibm.com/v1, Kind=ISCSequence","event_type":"playbook_on_task_start","job":"8236951410584007958","EventData.Name":"icp.operator.actions : default deployment for inf-configstore"}
and the next operation never started.
Environment
- operator-sdk version:
happens with v0.10.0 and v0.11.0
FROM quay.io/operator-framework/ansible-operator:v0.11.0
...
ENTRYPOINT ["/tini", "--", "bash", "-c", "${OPERATOR} run ansible --watches-file=/opt/ansible/watches.yaml --inject-owner-ref=false $@"]
ansible-operator version
operator-sdk version: "v0.11.0", commit: "39c65c36159a9c249e5f3c178205cc6e86c16f8d", go version: "go1.12.10 linux/amd64"
- go version: go1.12.10 linux/amd64
- Kubernetes version information:
Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0+d4cacc0", GitCommit:"d4cacc0", GitTreeState:"clean", BuildDate:"2019-09-12T23:41:09Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
- Kubernetes cluster kind:
Red Hat Openshift 3.11
- Are you writing your operator in ansible, helm, or go?
ansible
Possible Solution
Additional context Add any other context about the problem here.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 2
- Comments: 41 (15 by maintainers)
Just to be clear, the issue is in the client generator. Seems to generate code that does not respect the CPython contract, so the issue/fix will be to fix the code generator. In this sense, the bug is swaggers, not CPythons.
also have the issue that my ansible based operator gets stuck after a while. Some reconciliation loops are fine, then after few minutes to hours the ansible process gets stuck blocking the operator from further processing.
I enabled
ANSIBLE_DEBUG=1to get bit more output. here are the last 30 lines:This last log line is from here: https://github.com/ansible/ansible/blob/480b106d6535978ae6ecab68b40942ca4fa914a0/lib/ansible/plugins/connection/local.py#L127
I can make the thing “unstuck” by killing the python process. Loglines after I killed the process:
Its random on which resource the ansible process gets stuck.
operator-sdk: 0.13 OpenShift 3.11
Root cause seems to be a Python3 issue reported 6 days ago: https://bugs.python.org/issue39360 . We make use of the kubernetes-client/python project, which is generated by swagger-codegen.
Corresponding issue for swagger-codegen is here: https://github.com/swagger-api/swagger-codegen/issues/9991
Hi @camilamacedo86, as soon as we have a 4.2 cluster, we will try it out.
But so far, as the documentation is not clear which version of operator-sdk is compatible with which OpenShift version, we will stay with operator-sdk 0.8.1.
Also, according to the k8s module of Ansible 2.7/2.8/2.9, it is compatible with Openshift >= 0.6: https://docs.ansible.com/ansible/latest/modules/k8s_module.html
Even if that k8s module is using Python OpenShift and Python Kubernetes underneath, I don’t see why it would fail on OpenShift 3.11 or OpenShift 4.1. Keep also in mind that there are numerous of operator reconcile runs that work fine and it just freezes after a random number of reconciles. If there would be a compatibility issue, then I guess we would see it from the first run.
We have the same issue on all our OpenShift 3.11 and 4.1 clusters with operators using an ansible role. Our operators are namespaced.
The operator freezes at a random task in the role and we see that memory usage stays flatlined at his peak (Grafana). It can happen a few minutes after the operator was deployed, but also after two days. Eventually, it always freezes in all namespaces. We tried to add additional cpu/resources and even disable quota’s entirely, nothing helped so far.
We have the issue with v0.10 and v0.11. We don’t have the issue with v0.8, but that one consumes a lot of CPU (1 full core) and Mem (almost 1gb), which is about ten times more than v0.10 and v0.11.
We prefer not to move to a Go operator because of the effort of refactoring, but it seems that this is much more stable. It also uses only one container iso two in the operator pod. Also, all the operators included in Openshift 4.x are based on Go. I wonder if RedHat is still investing in Ansible Operators.
So we have the exact same problem. Any update on this?
@fabianvf the output of ps
The AnsiballZ_k8s.py - is a k8s ‘present’ command.
If I’m sending
kill 19348the current ansible command returns 0 and the playbook continues