operator-sdk: ansible-operator hangs on k8s state: present

Bug Report

What did you do? The ansible operator hangs in different places/on different resources on

- k8s:
      state: present.

The k8s request is successfully executed, but next operation is never started.

kubectl get/apply are working on the pod.

Presence of resources.requests and requests.limits make it happens more often, while memory reported is will under limits.

What did you expect to see?

The playbooks should continue on.

What did you see instead? Under which circumstances? A clear and concise description of what you expected to happen (or insert a code snippet).

The latest log is on start of task

{"level":"info","ts":1571992864.8623073,"logger":"logging_event_handler","msg":"[playbook task]","name":"iscplatform","namespace":"isc-relite","gvk":"isc.ibm.com/v1, Kind=ISCSequence","event_type":"playbook_on_task_start","job":"8236951410584007958","EventData.Name":"icp.operator.actions : default deployment for inf-configstore"}

and the next operation never started.

Environment

operator-sdk version:

happens with v0.10.0 and v0.11.0

FROM quay.io/operator-framework/ansible-operator:v0.11.0
...

ENTRYPOINT ["/tini", "--", "bash", "-c", "${OPERATOR} run ansible --watches-file=/opt/ansible/watches.yaml --inject-owner-ref=false $@"]

ansible-operator version
operator-sdk version: "v0.11.0", commit: "39c65c36159a9c249e5f3c178205cc6e86c16f8d", go version: "go1.12.10 linux/amd64"

go version: go1.12.10 linux/amd64

Kubernetes version information:

Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0+d4cacc0", GitCommit:"d4cacc0", GitTreeState:"clean", BuildDate:"2019-09-12T23:41:09Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes cluster kind:

Red Hat Openshift 3.11

Are you writing your operator in ansible, helm, or go?

ansible

Possible Solution

Additional context Add any other context about the problem here.

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 2
Comments: 41 (15 by maintainers)

Most upvoted comments

Just to be clear, the issue is in the client generator. Seems to generate code that does not respect the CPython contract, so the issue/fix will be to fix the code generator. In this sense, the bug is swaggers, not CPythons.

gaborbernat on Jan 22, 2020

also have the issue that my ansible based operator gets stuck after a while. Some reconciliation loops are fine, then after few minutes to hours the ansible process gets stuck blocking the operator from further processing.

I enabled ANSIBLE_DEBUG=1 to get bit more output. here are the last 30 lines:

#> cat ./stdout  | tail -n30
  5054 1576664654.12903: _low_level_execute_command() done: rc=0, stdout=/opt/ansible
, stderr=
  5054 1576664654.13001: _low_level_execute_command(): starting
  5054 1576664654.13204: _low_level_execute_command(): executing: /bin/sh -c '( umask 77 && mkdir -p "` echo /opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070 `" && echo ansible-tmp-1576664654.0683-190010863364070="` echo /opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070 `" ) && sleep 0'
  5054 1576664654.13292: in local.exec_command()
  5054 1576664654.13405: opening command with Popen()
  5054 1576664654.14975: done running command with Popen()
  5054 1576664654.15097: getting output with communicate()
  5054 1576664654.16764: done communicating
  5054 1576664654.16904: done with local.exec_command()
  5054 1576664654.17008: _low_level_execute_command() done: rc=0, stdout=ansible-tmp-1576664654.0683-190010863364070=/opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070
, stderr=
  5054 1576664654.17213: ANSIBALLZ: using cached module: /opt/ansible/.ansible/tmp/ansible-local-4889xiy1pbw_/ansiballz_cache/k8s-ZIP_DEFLATED
  5054 1576664654.17666: transferring module to remote /opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070/AnsiballZ_k8s.py
  5054 1576664654.17939: done transferring module to remote
  5054 1576664654.18086: _low_level_execute_command(): starting
  5054 1576664654.18329: _low_level_execute_command(): executing: /bin/sh -c 'chmod u+x /opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070/ /opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070/AnsiballZ_k8s.py && sleep 0'
  5054 1576664654.18476: in local.exec_command()
  5054 1576664654.18533: opening command with Popen()
  5054 1576664654.19800: done running command with Popen()
  5054 1576664654.19937: getting output with communicate()
  5054 1576664654.21163: done communicating
  5054 1576664654.21302: done with local.exec_command()
  5054 1576664654.21491: _low_level_execute_command() done: rc=0, stdout=, stderr=
  5054 1576664654.21596: _low_level_execute_command(): starting
  5054 1576664654.21701: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python3.6 /opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070/AnsiballZ_k8s.py && sleep 0'
  5054 1576664654.21843: in local.exec_command()
  5054 1576664654.21996: opening command with Popen()
  5054 1576664654.44408: done running command with Popen()
  5054 1576664654.44574: getting output with communicate()

This last log line is from here: https://github.com/ansible/ansible/blob/480b106d6535978ae6ecab68b40942ca4fa914a0/lib/ansible/plugins/connection/local.py#L127

I can make the thing “unstuck” by killing the python process. Loglines after I killed the process:

  5054 1576677442.60567: done communicating
  5054 1576677442.61731: done with local.exec_command()
  5054 1576677442.61878: _low_level_execute_command() done: rc=143, stdout=
{"changed": false, "result": {"kind": "NetworkPolicy", "apiVersion": .... }
, stderr=/usr/local/lib/python3.6/site-packages/kubernetes/config/kube_config.py:509: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config_dict=yaml.load(f),
/bin/sh: line 1:  5064 Terminated              /usr/bin/python3.6 /opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070/AnsiballZ_k8s.py
  5054 1576677442.62367: done with _execute_module (k8s, {'state': 'present', 'definition': {'apiVersion': 'networking.k8s.io/v1', 'kind': 'NetworkPolicy' ..., '_ansible_check_mode': False, '_ansible_no_log': False, '_ansible_debug': True, '_ansible_diff': False, '_ansible_verbosity': 2, '_ansible_version': '2.9.2', '_ansible_module_name': 'k8s', '_ansible_syslog_facility': 'LOG_USER', '_ansible_selinux_special_fs': ['fuse', 'nfs', 'vboxsf', 'ramfs', '9p', 'vfat'], '_ansible_string_conversion_action': 'warn', '_ansible_socket': None, '_ansible_shell_executable': '/bin/sh', '_ansible_keep_remote_files': False, '_ansible_tmpdir': '/opt/ansible/.ansible/tmp/ansible-tmp-1576664654.0683-190010863364070/', '_ansible_remote_tmp': '~/.ansible/tmp'})

Its random on which resource the ansible process gets stuck.

operator-sdk: 0.13 OpenShift 3.11

vinzent on Dec 18, 2019

Root cause seems to be a Python3 issue reported 6 days ago: https://bugs.python.org/issue39360 . We make use of the kubernetes-client/python project, which is generated by swagger-codegen.

Corresponding issue for swagger-codegen is here: https://github.com/swagger-api/swagger-codegen/issues/9991

fabianvf on Jan 22, 2020

Hi @camilamacedo86, as soon as we have a 4.2 cluster, we will try it out.

But so far, as the documentation is not clear which version of operator-sdk is compatible with which OpenShift version, we will stay with operator-sdk 0.8.1.

Also, according to the k8s module of Ansible 2.7/2.8/2.9, it is compatible with Openshift >= 0.6: https://docs.ansible.com/ansible/latest/modules/k8s_module.html

Even if that k8s module is using Python OpenShift and Python Kubernetes underneath, I don’t see why it would fail on OpenShift 3.11 or OpenShift 4.1. Keep also in mind that there are numerous of operator reconcile runs that work fine and it just freezes after a random number of reconciles. If there would be a compatibility issue, then I guess we would see it from the first run.

fitex on Dec 2, 2019

We have the same issue on all our OpenShift 3.11 and 4.1 clusters with operators using an ansible role. Our operators are namespaced.

The operator freezes at a random task in the role and we see that memory usage stays flatlined at his peak (Grafana). It can happen a few minutes after the operator was deployed, but also after two days. Eventually, it always freezes in all namespaces. We tried to add additional cpu/resources and even disable quota’s entirely, nothing helped so far.

We have the issue with v0.10 and v0.11. We don’t have the issue with v0.8, but that one consumes a lot of CPU (1 full core) and Mem (almost 1gb), which is about ten times more than v0.10 and v0.11.

We prefer not to move to a Go operator because of the effort of refactoring, but it seems that this is much more stable. It also uses only one container iso two in the operator pod. Also, all the operators included in Openshift 4.x are based on Go. I wonder if RedHat is still investing in Ansible Operators.

fitex on Nov 25, 2019

So we have the exact same problem. Any update on this?

depauna on Nov 21, 2019

@fabianvf the output of ps

Tue Oct 29 11:44:51 UTC 2019
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ansible       1  0.0  0.0   4352   412 ?        Ss   11:31   0:00 /tini -- bash -c ${OPERATOR} run ansible --watches-file=/opt/ansible/watches.yaml --inject-owner-ref=false $@
ansible       8  2.6  0.1 158972 56928 ?        Sl   11:31   0:21 /usr/local/bin/ansible-operator run ansible --watches-file=/opt/ansible/watches.yaml --inject-owner-ref=false
ansible     960  0.0  0.0  12036  1708 ?        Ss   11:33   0:00 bash -c while sleep 5; do date; ps auxww; done
ansible    5938  0.0  0.0  12036  1708 ?        Ss   11:37   0:00 bash -c while sleep 5; do date; ps 
auxww; done
ansible   13857 32.6  0.0 121368 25364 ?        S    11:41   0:57 /usr/bin/python3.6 /usr/local/bin/ansible-runner -vv --rotate-artifacts 20 -p /opt/ansible/playbooks/sequence.yml -i 1598098976185383115 run /tmp/ansible-operator/runner/isc.ibm.com/v1/ISCSequence/cp4s/iscplatform
ansible   13861 34.0  0.2 297380 87904 pts/0    Ssl+ 11:41   0:59 /usr/bin/python3.6 /usr/local/bin/ansible-playbook -i /tmp/ansible-operator/runner/isc.ibm.com/v1/ISCSequence/cp4s/iscplatform/inventory -e @/tmp/ansible-operator/runner/isc.ibm.com/v1/ISCSequence/cp4s/iscplatform/env/extravars -vv /opt/ansible/playbooks/sequence.yml
ansible   19336  8.0  0.2 299504 83624 pts/0    S+   11:44   0:00 /usr/bin/python3.6 /usr/local/bin/ansible-playbook -i /tmp/ansible-operator/runner/isc.ibm.com/v1/ISCSequence/cp4s/iscplatform/inventory -e @/tmp/ansible-operator/runner/isc.ibm.com/v1/ISCSequence/cp4s/iscplatform/env/extravars -vv /opt/ansible/playbooks/sequence.yml
ansible   19338  0.0  0.0  23032   948 ?        S    11:44   0:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 5
ansible   19347  1.0  0.0  12396  1756 pts/0    S+   11:44   0:00 /bin/sh -c /usr/bin/python3.6 /tmp/.ansible/tmp/ansible-tmp-1572349491.1019864-279436973975095/AnsiballZ_k8s.py && sleep 0
ansible   19348 42.0  0.0  90548 21960 pts/0    R+   11:44   0:00 /usr/bin/python3.6 /tmp/.ansible/tmp/ansible-tmp-1572349491.1019864-279436973975095/AnsiballZ_k8s.py
ansible   19350  0.0  0.0  43956  1748 ?        R    11:44   0:00 ps auxww

The AnsiballZ_k8s.py - is a k8s ‘present’ command.

If I’m sending kill 19348 the current ansible command returns 0 and the playbook continues

mishindm on Oct 29, 2019