salt: [BUG] Empty formula makes `salt-call` to hang forever.

Description

Empty formula makes salt-call to hang forever.

Setup

We have a simple formula, that allows to add some custom states during the highstate, based on the minion’s FQDN: custom_minion_state/init.sls:

{%- set include_dir = grains['fqdn'] | replace('.', '_') %}
{#- test: slspath + '/' + include_dir #}
{%- if salt['cp.cache_dir']("salt://{0}/{1}".format(slspath, include_dir)) %}
include:
  - .{{ include_dir }}
{%- endif %}

In 99% of the cases that renders to the empty SLS file, as there is no customization necessary.

Steps to Reproduce the behavior

Applying that formula to the host, which doesn’t require any customization produces:

# salt-call state.apply custom_minion_state
local:

Summary for local
-----------
Succeeded: 0
Failed:   0
-----------
Total states run:    0
Total run time:  0.000 ms

For Salt 2018.3.3 salt-call immediately returns to the shell. But for Salt 3000.3 it hangs after this output forever(?). Well, long enough to lose the temper and kill it with Ctrl-C.

Expected behavior Application of the empty state should exit immediately after the state has been evaluated and applied.

Versions Report

salt --versions-report Salt Version: Salt: 3000.3

Dependency Versions: cffi: 1.11.5 cherrypy: Not Installed dateutil: 2.6.1 docker-py: 4.2.0 gitdb: Not Installed gitpython: Not Installed Jinja2: 2.10.1 libgit2: Not Installed M2Crypto: 0.35.2 Mako: Not Installed msgpack-pure: Not Installed msgpack-python: 0.6.2 mysql-python: Not Installed pycparser: 2.14 pycrypto: Not Installed pycryptodome: 3.9.7 pygit2: Not Installed Python: 3.6.8 (default, Nov 21 2019, 19:31:34) python-gnupg: Not Installed PyYAML: 3.12 PyZMQ: 19.0.0 smmap: Not Installed timelib: Not Installed Tornado: 4.5.3 ZMQ: 4.3.2

System Versions: dist: centos 8.1.1911 Core locale: UTF-8 machine: x86_64 release: 4.18.0-147.8.1.el8_1.x86_64 system: Linux version: CentOS Linux 8.1.1911 Core

Additional context Running the same command with -l debug gives a bit more information:

...
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://[2001::13b7]:4506
[DEBUG   ] Trying to connect to: tcp://[2001::13b7]:4506
[DEBUG   ] Gathering pillar data for state run
[DEBUG   ] Finished gathering pillar data for state run
[INFO    ] Loading fresh modules for state activity
[DEBUG   ] LazyLoaded jinja.render
[DEBUG   ] LazyLoaded yaml.render
[DEBUG   ] Could not find file 'salt://custom_minion_state.sls' in saltenv 'base'
[DEBUG   ] In saltenv 'base', looking at rel_path 'custom_minion_state/init.sls' to resolve 'salt://custom_minion_state/init.sls'
[DEBUG   ] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/custom_minion_state/init.sls' to resolve 'salt://custom_minion_state/init.sls'
[DEBUG   ] compile template: /var/cache/salt/minion/files/base/custom_minion_state/init.sls
[DEBUG   ] Jinja search path: ['/var/cache/salt/minion/files/base']
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://[2001::13b7]:4506
[DEBUG   ] Trying to connect to: tcp://[2001::13b7]:4506
[DEBUG   ] LazyLoaded cp.cache_dir
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://[2001::13b7]:4506
[DEBUG   ] Trying to connect to: tcp://[2001:67c:2e8:11::c10:13b7]:4506
[INFO    ] Caching directory 'custom_minion_state/prometheus/' for environment 'base'
[PROFILE ] Time (in seconds) to render '/var/cache/salt/minion/files/base/custom_minion_state/init.sls' using 'jinja' renderer: 0.04874920845031738
[DEBUG   ] Rendered data from file: /var/cache/salt/minion/files/base/custom_minion_state/init.sls:
# vim: sts=2 ts=2 sw=2 et ai ft=sls
#
# This state is loaded from top.sls
#

[DEBUG   ] Results of YAML rendering:
{}
[PROFILE ] Time (in seconds) to render '/var/cache/salt/minion/files/base/custom_minion_state/init.sls' using 'yaml' renderer: 0.00016641616821289062
[DEBUG   ] File /var/cache/salt/minion/accumulator/140045102329928 does not exist, no need to cleanup
[DEBUG   ] LazyLoaded state.check_result
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://[2001::13b7]:4506
[DEBUG   ] Trying to connect to: tcp://[2001::13b7]:4506
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[DEBUG   ] LazyLoaded highstate.output
local:

Summary for local
-----------
Succeeded: 0
Failed:   0
-----------
Total states run:    0
Total run time:  0.000 ms
[DEBUG   ] Closing AsyncZeroMQReqChannel instance

Running through strace shows:

getpid()                                = 100123
getpid()                                = 100123
write(1, "\33[0;32mlocal:\33[0;0m\n\33[0;36m\nSumm"..., 201) = 201
write(1, "\n", 1)                       = 1
getpid()                                = 100123
epoll_ctl(20, EPOLL_CTL_DEL, 28, 0x7ffda432bbcc) = 0
getpid()                                = 100123
getpid()                                = 100123
write(24, "\1\0\0\0\0\0\0\0", 8)        = 8
getpid()                                = 100123
getpid()                                = 100123
getpid()                                = 100123
poll([{fd=23, events=POLLIN}], 1, -1)   = 1 ([{fd=23, revents=POLLIN}])
getpid()                                = 100123
read(23, "\1\0\0\0\0\0\0\0", 8)         = 8
getpid()                                = 100123
write(26, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f89908ed9d0, FUTEX_WAIT, 100201, NULL) = 0
close(27)                               = 0
close(26)                               = 0
close(25)                               = 0
close(24)                               = 0
close(23)                               = 0
epoll_ctl(20, EPOLL_CTL_DEL, 21, 0x7ffda432c0dc) = 0
close(21)                               = 0
close(22)                               = 0
close(20)                               = 0
close(5)                                = 0
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f89a266fdc0}, {sa_handler=0x7f89a296d610, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f89a266fdc0}, 8) = 0
munmap(0x7f8992c15000, 262144)          = 0
munmap(0x7f8992c55000, 262144)          = 0
close(8)                                = 0
close(9)                                = 0
close(31)                               = 0
close(32)                               = 0
close(7)                                = 0
munmap(0x7f8993522000, 262144)          = 0
munmap(0x7f89934e2000, 262144)          = 0
close(30)                               = 0
getpid()                                = 100123
getpid()                                = 100123
getpid()                                = 100123
write(38, "\1\0\0\0\0\0\0\0", 8)        = 8
getpid()                                = 100123
poll([{fd=33, events=POLLIN}], 1, -1)   = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---

Applying the same state from the master completes immediately, but with the note that state completed with the ERROR.

master# salt 'prometheus*' state.apply custom_minion_state -l debug
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', 'master', 'tcp://[::]:4506', 'clear')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://[::]:4506
[DEBUG   ] Trying to connect to: tcp://[::]:4506
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[DEBUG   ] LazyLoaded local_cache.get_load
[DEBUG   ] Reading minion list from /var/cache/salt/master/jobs/02/a26ac9e846136c8148eb8e74c4e1507421289c141e915aba2f429e146a8134/.minions.p
[DEBUG   ] get_iter_returns for jid 20200525163928599898 sent to {'prometheus'} will timeout at 18:39:33.605811
[DEBUG   ] jid 20200525163928599898 return from prometheus
[DEBUG   ] return event: {'prometheus': {'ret': {}, 'out': 'highstate', 'retcode': 2, 'jid': '20200525163928599898'}}
[DEBUG   ] LazyLoaded highstate.output
prometheus:

Summary for prometheus
-----------
Succeeded: 0
Failed:   0
-----------
Total states run:    0
Total run time:  0.000 ms
[DEBUG   ] jid 20200525163928599898 found all minions {'prometheus'}
[DEBUG   ] Closing IPCMessageSubscriber instance
ERROR: Minions returned with non-zero exit code

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 48 (36 by maintainers)

Most upvoted comments

@waynew @max-arnold Works for me after downgrading pyzmq to 18.0.1 and started hanging with pyzmq>=18.1.0.

IMO, the issue is pretty serious. Could it be considered for 3001.2? Waiting till 3002 (October?) is too much…

The bug does not sit in pyzmq but rather libzmq. I built pyzmq 18.0.0 with zmq 4.3.2 and got the hang! The break happened when they went from zmq 4.3.1 to 4.3.2. Note: When I ran libzmq tests I go no fails.

i’m sorry but i downgraded from libzmq 4.3.2 to libzmq 4.3.1 and it still hangs.

The bug does not sit in pyzmq but rather libzmq. I built pyzmq 18.0.0 with zmq 4.3.2 and got the hang! The break happened when they went from zmq 4.3.1 to 4.3.2. Note: When I ran libzmq tests I go no fails.

@DmitryKuzmenko @waynew @max-arnold @cmcmarrow unfortunately I can’t invest much time into further debugging of this one, but a number of quick experiments shows that downgrading pyzmq to 18.0.1 also resolves an issue for us (Ubuntu 20.04 / Python 3.8.2 / libzmq 4.3.2). We’ve switched to TCP transport until there’s no clean fix.

@sagetherage thanks!

@cmcmarrow please let me know if there’s anything I could provide in order to help address this one as soon as possible. I’d also be happy to run a few tests once a fix is available. TIA!

Thank you @defanator, this is a critical issue and we attempted to fix it the 3001.1 point release, but other competing priorities and some core team members not being able to reproduce it, caused us to delay.

The issue and the fix is moved to be committed in the Magnesium release. The engineer assigned has reproduced it and will provide a fix. We tried here and didn’t succeed.

Testing 3001 release on Ubuntu 20.04. It still hangs after every state.apply.