salt: [BUG] v3004 leaks fds/pipes causing `Too many open files` crash
Description Several nodes all running salt-minion 3004 all seem to cause this crash before highstate completes. We’ve seen this crash both in cmd.run and also zmq code.
coredumpctl seems to think it’s zeromq holding open the sockets. There are a lot of these in the coredump backtrace.
Stack trace of thread 59913:
#0 0x00007f715250c116 epoll_wait (libc.so.6 + 0xfe116)
#1 0x00007f7151619fe1 n/a (libzmq.so.5 + 0x2dfe1)
#2 0x00007f715165f1cb n/a (libzmq.so.5 + 0x731cb)
#3 0x00007f7152778ea7 start_thread (libpthread.so.0 + 0x8ea7)
#4 0x00007f715250bdef __clone (libc.so.6 + 0xfddef)
[ERROR ] An exception occurred in this state: Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/salt/state.py", line 2144, in call
ret.update(self._run_check(low))
File "/usr/lib/python3/dist-packages/salt/state.py", line 920, in _run_check
_ret = self._run_check_unless(low_data, cmd_opts)
File "/usr/lib/python3/dist-packages/salt/state.py", line 1053, in _run_check_unless
cmd = self.functions["cmd.retcode"](
File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 149, in __call__
return self.loader.run(run_func, *args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1201, in run
return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1216, in _run_as
return _func_or_method(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/modules/cmdmod.py", line 2507, in retcode
ret = _run(
File "/usr/lib/python3/dist-packages/salt/modules/cmdmod.py", line 525, in _run
env_bytes, env_encoded_err = subprocess.Popen(
File "/usr/lib/python3.9/subprocess.py", line 951, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.9/subprocess.py", line 1722, in _execute_child
errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files
...
Jan 26 14:14:49 131m16 salt-highstate[7759]: Too many open files (src/epoll.cpp:65)
Jan 26 14:15:05 131m16 systemd-coredump[109479]: [🡕] Process 7759 (salt-call) of user 0 dumped core.
Jan 26 14:15:05 131m16 Process 7759 (salt-call) of user 0 dumped core.
jgroocock@131m16:~$ sudo ls -al /proc/114133/fd
total 0
dr-x------ 2 root root 0 Jan 26 13:41 .
dr-xr-xr-x 9 root root 0 Jan 26 13:41 ..
lr-x------ 1 root root 64 Jan 26 13:43 0 -> /dev/null
lrwx------ 1 root root 64 Jan 26 13:43 1 -> 'socket:[3220729]'
lrwx------ 1 root root 64 Jan 26 13:43 10 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 100 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 101 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 102 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 103 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 104 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 105 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 106 -> 'socket:[3881074]'
lrwx------ 1 root root 64 Jan 26 13:56 107 -> 'anon_inode:[eventpoll]'
lr-x------ 1 root root 64 Jan 26 13:56 108 -> 'pipe:[3831934]'
l-wx------ 1 root root 64 Jan 26 13:56 109 -> 'pipe:[3831934]'
lrwx------ 1 root root 64 Jan 26 13:43 11 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 110 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 111 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 112 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 113 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 114 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 115 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 116 -> 'socket:[3830398]'
lrwx------ 1 root root 64 Jan 26 13:56 117 -> 'anon_inode:[eventpoll]'
lr-x------ 1 root root 64 Jan 26 13:56 118 -> 'pipe:[3854314]'
l-wx------ 1 root root 64 Jan 26 13:56 119 -> 'pipe:[3854314]'
lrwx------ 1 root root 64 Jan 26 13:43 12 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 120 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 121 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 122 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 123 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 124 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 125 -> 'anon_inode:[eventfd]'
This was observed doing a full highstate on an unconfigured machine with several thousand states.
jgroocock@131m16:~$ systemctl show -P LimitNOFILE salt-highstate.service
524288
I’ve also noticed that v3004 has started dumping a handful of [ERROR ] Unable to connect pusher: Stream is closed errors into the highstate, which didn’t happen in 2019.2.8. I wonder if this could be related?
[DEBUG ] Decrypting the current master AES key
[DEBUG ] salt.crypt.get_rsa_key: Loading private key
[DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG ] Closing AsyncReqChannel instance
[DEBUG ] Connecting the Minion to the Master publish port, using the URI: tcp://[<snip>]:4505
[DEBUG ] <salt.transport.zeromq.PublishClient object at 0x7fda638a6f70> connecting to tcp://[<snip>]:4505
[TRACE ] IPCClient: Connecting to socket: /var/run/salt/minion/minion_event_f9c4bf71a1_pull.ipc
[TRACE ] IPCClient: Connecting to socket: /var/run/salt/minion/minion_event_f9c4bf71a1_pull.ipc
[ERROR ] Unable to connect pusher: Stream is closed
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/salt/utils/event.py", line 421, in connect_pull
self.pusher.connect(timeout=timeout)
File "/usr/lib/python3/dist-packages/salt/utils/asynchronous.py", line 125, in wrap
raise exc_info[1].with_traceback(exc_info[2])
File "/usr/lib/python3/dist-packages/salt/utils/asynchronous.py", line 131, in _target
result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
File "/usr/lib/python3/dist-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
return future_cell[0].result()
File "/usr/lib/python3/dist-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3/dist-packages/salt/transport/ipc.py", line 342, in _connect
yield self.stream.connect(sock_addr)
File "/usr/lib/python3/dist-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3/dist-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
salt.ext.tornado.iostream.StreamClosedError: Stream is closed
[DEBUG ] salt.crypt.get_rsa_key: Loading private key
Possibly points at https://github.com/saltstack/salt/commit/4cf62fbc955df7ea90d8267018ff62296e729253 ?
Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)
Please be as specific as possible and give set-up details.
- on-prem machine
- VM (Virtualbox, KVM, etc. please specify)
- VM running on a cloud service, please be explicit and add details
- container (Kubernetes, Docker, containerd, etc. please specify)
- or a combination, please be explicit
- jails if it is FreeBSD
Steps to Reproduce the behavior
salt-call state.highstate seems to trigger it. It’s unclear if the persistent minion process is affected so far.
??? Unclear as of yet.
Expected behavior No leaked fds/pipes.
Additional context
We came from 2019.2.8 running on python3 that didn’t exhibit this behaviour (as far as I know).
I can also reproduce this on master branch tip too.
jgroocock@133m11:~$ salt-call -V | grep -v Not\ Installed
Salt Version:
Salt: 3004
Dependency Versions:
cffi: 1.14.5
dateutil: 2.8.1
Jinja2: 2.11.3
M2Crypto: 0.38.0
msgpack: 1.0.0
pycparser: 2.20
pycryptodome: 3.9.7
Python: 3.9.2 (default, Feb 28 2021, 17:03:44)
PyYAML: 5.3.1
PyZMQ: 20.0.0
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: debian 11 bullseye
locale: utf-8
machine: x86_64
release: 5.10.76-cloudflare-2021.10.8
system: Linux
version: Debian GNU/Linux 11 bullseye
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 16 (15 by maintainers)
Just a heads up, this may not be a v3004 issue in particular. We saw the same or very similar failures with
Too many open filesimmediately after upgrading from v2019.2.8 to v3003.3, which took us from python2 to 3 as well.While we were able to resolve this by simply upping open file limits compared to where they were on 2019.2.8, the only box it has occurred on is a heavily active master-of-masters that is also a minion of itself. Like frebib, we hit this very quickly using
salt-callwhile bootstrapping a new, fresh box, pointed at itself as the master, with everything identical except the upgraded os/python/salt versions.We did see this crop up once more when processing many events and doubled limits again. Unfortunately, I don’t have stack traces from the time saved, but I can tell you that they look just like the ones above, including: