salt: Minions randomly becoming unresponsive, sometimes stuck at 100% CPU usage

Description of Issue/Question

I’ve been noticing that a number of my minions will stop responding after being left alone for a while. This did not start with a upgrade of Salt, a new state, or any other change I can attribute to it. There does not appear to be any one action which causes the minions to stop responding, they just do.

Setup

I have over 80 minions which are all identically configured (they are compute nodes in an HPC cluster). If we begin at a starting point of all minions up and running, after a few days I begin to see this when running a state or doing anything else via Salt:

cn32:
    Minion did not return. [No response]
cn66:
    Minion did not return. [No response]
cn59:
    Minion did not return. [No response]
cn68:
    Minion did not return. [No response]

… with more and more minions having the problem the longer time goes on. When I SSH to the system to investigate I sometimes, but not always, find a salt-minion process stuck at 100% CPU. That is a major problem in my environment and is a more pressing matter than being unable to communicate with a minion.

Steps to Reproduce Issue

  1. Start a minion.
  2. Wait.

Versions Report

# salt --versions-report
Salt Version:
           Salt: 2016.3.1

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 1.5
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.5 (default, Nov 20 2015, 02:00:19)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.7.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: centos 7.2.1511 Core
        machine: x86_64
        release: 3.10.0-327.4.5.el7.x86_64
         system: Linux
        version: CentOS Linux 7.2.1511 Core

strace.out.gz

Please let me know what more you need. This is a major issue for me and at the moment SaltStack is unusable and is harming my systems (minion stuck at 100% == the scientific workloads of my machines can’t work appropriately).

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

We just saw this happening in production when a minion fails over. Is anybody who sees this running minions with multiple masters?