salt: [BUG] v3004 leaks fds/pipes causing `Too many open files` crash

Description Several nodes all running salt-minion 3004 all seem to cause this crash before highstate completes. We’ve seen this crash both in cmd.run and also zmq code.

coredumpctl seems to think it’s zeromq holding open the sockets. There are a lot of these in the coredump backtrace.

Stack trace of thread 59913:
#0  0x00007f715250c116 epoll_wait (libc.so.6 + 0xfe116)
#1  0x00007f7151619fe1 n/a (libzmq.so.5 + 0x2dfe1)
#2  0x00007f715165f1cb n/a (libzmq.so.5 + 0x731cb)
#3  0x00007f7152778ea7 start_thread (libpthread.so.0 + 0x8ea7)
#4  0x00007f715250bdef __clone (libc.so.6 + 0xfddef)
[ERROR   ] An exception occurred in this state: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/salt/state.py", line 2144, in call
    ret.update(self._run_check(low))
  File "/usr/lib/python3/dist-packages/salt/state.py", line 920, in _run_check
    _ret = self._run_check_unless(low_data, cmd_opts)
  File "/usr/lib/python3/dist-packages/salt/state.py", line 1053, in _run_check_unless
    cmd = self.functions["cmd.retcode"](
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 149, in __call__
    return self.loader.run(run_func, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1201, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1216, in _run_as
    return _func_or_method(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/modules/cmdmod.py", line 2507, in retcode
    ret = _run(
  File "/usr/lib/python3/dist-packages/salt/modules/cmdmod.py", line 525, in _run
    env_bytes, env_encoded_err = subprocess.Popen(
  File "/usr/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.9/subprocess.py", line 1722, in _execute_child
    errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files
...
Jan 26 14:14:49 131m16 salt-highstate[7759]: Too many open files (src/epoll.cpp:65)
Jan 26 14:15:05 131m16 systemd-coredump[109479]: [🡕] Process 7759 (salt-call) of user 0 dumped core.
Jan 26 14:15:05 131m16 Process 7759 (salt-call) of user 0 dumped core.
jgroocock@131m16:~$ sudo ls -al /proc/114133/fd
total 0
dr-x------ 2 root root  0 Jan 26 13:41 .
dr-xr-xr-x 9 root root  0 Jan 26 13:41 ..
lr-x------ 1 root root 64 Jan 26 13:43 0 -> /dev/null
lrwx------ 1 root root 64 Jan 26 13:43 1 -> 'socket:[3220729]'
lrwx------ 1 root root 64 Jan 26 13:43 10 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 100 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 101 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 102 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 103 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 104 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 105 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 106 -> 'socket:[3881074]'
lrwx------ 1 root root 64 Jan 26 13:56 107 -> 'anon_inode:[eventpoll]'
lr-x------ 1 root root 64 Jan 26 13:56 108 -> 'pipe:[3831934]'
l-wx------ 1 root root 64 Jan 26 13:56 109 -> 'pipe:[3831934]'
lrwx------ 1 root root 64 Jan 26 13:43 11 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 110 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 111 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 112 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 113 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 114 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 115 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 116 -> 'socket:[3830398]'
lrwx------ 1 root root 64 Jan 26 13:56 117 -> 'anon_inode:[eventpoll]'
lr-x------ 1 root root 64 Jan 26 13:56 118 -> 'pipe:[3854314]'
l-wx------ 1 root root 64 Jan 26 13:56 119 -> 'pipe:[3854314]'
lrwx------ 1 root root 64 Jan 26 13:43 12 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 120 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 121 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 122 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 123 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 124 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 125 -> 'anon_inode:[eventfd]'

This was observed doing a full highstate on an unconfigured machine with several thousand states.

jgroocock@131m16:~$ systemctl show -P LimitNOFILE salt-highstate.service
524288

I’ve also noticed that v3004 has started dumping a handful of [ERROR ] Unable to connect pusher: Stream is closed errors into the highstate, which didn’t happen in 2019.2.8. I wonder if this could be related?

[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG   ] Closing AsyncReqChannel instance
[DEBUG   ] Connecting the Minion to the Master publish port, using the URI: tcp://[<snip>]:4505
[DEBUG   ] <salt.transport.zeromq.PublishClient object at 0x7fda638a6f70> connecting to tcp://[<snip>]:4505
[TRACE   ] IPCClient: Connecting to socket: /var/run/salt/minion/minion_event_f9c4bf71a1_pull.ipc
[TRACE   ] IPCClient: Connecting to socket: /var/run/salt/minion/minion_event_f9c4bf71a1_pull.ipc
[ERROR   ] Unable to connect pusher: Stream is closed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/salt/utils/event.py", line 421, in connect_pull
    self.pusher.connect(timeout=timeout)
  File "/usr/lib/python3/dist-packages/salt/utils/asynchronous.py", line 125, in wrap
    raise exc_info[1].with_traceback(exc_info[2])
  File "/usr/lib/python3/dist-packages/salt/utils/asynchronous.py", line 131, in _target
    result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
  File "/usr/lib/python3/dist-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
    return future_cell[0].result()
  File "/usr/lib/python3/dist-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/usr/lib/python3/dist-packages/salt/transport/ipc.py", line 342, in _connect
    yield self.stream.connect(sock_addr)
  File "/usr/lib/python3/dist-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/usr/lib/python3/dist-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
salt.ext.tornado.iostream.StreamClosedError: Stream is closed
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key

Possibly points at https://github.com/saltstack/salt/commit/4cf62fbc955df7ea90d8267018ff62296e729253 ?

Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

Please be as specific as possible and give set-up details.

  • on-prem machine
  • VM (Virtualbox, KVM, etc. please specify)
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • or a combination, please be explicit
  • jails if it is FreeBSD

Steps to Reproduce the behavior salt-call state.highstate seems to trigger it. It’s unclear if the persistent minion process is affected so far.

??? Unclear as of yet.

Expected behavior No leaked fds/pipes.

Additional context We came from 2019.2.8 running on python3 that didn’t exhibit this behaviour (as far as I know). I can also reproduce this on master branch tip too.

jgroocock@133m11:~$ salt-call -V | grep -v Not\ Installed
Salt Version:
          Salt: 3004

Dependency Versions:
          cffi: 1.14.5
      dateutil: 2.8.1
        Jinja2: 2.11.3
      M2Crypto: 0.38.0
       msgpack: 1.0.0
     pycparser: 2.20
  pycryptodome: 3.9.7
        Python: 3.9.2 (default, Feb 28 2021, 17:03:44)
        PyYAML: 5.3.1
         PyZMQ: 20.0.0
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: debian 11 bullseye
        locale: utf-8
       machine: x86_64
       release: 5.10.76-cloudflare-2021.10.8
        system: Linux
       version: Debian GNU/Linux 11 bullseye

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 16 (15 by maintainers)

Most upvoted comments

Just a heads up, this may not be a v3004 issue in particular. We saw the same or very similar failures with Too many open files immediately after upgrading from v2019.2.8 to v3003.3, which took us from python2 to 3 as well.

While we were able to resolve this by simply upping open file limits compared to where they were on 2019.2.8, the only box it has occurred on is a heavily active master-of-masters that is also a minion of itself. Like frebib, we hit this very quickly using salt-call while bootstrapping a new, fresh box, pointed at itself as the master, with everything identical except the upgraded os/python/salt versions.

We did see this crop up once more when processing many events and doubled limits again. Unfortunately, I don’t have stack traces from the time saved, but I can tell you that they look just like the ones above, including:

salt.ext.tornado.iostream.StreamClosedError: Stream is closed
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key