salt: [BUG] intermittent connection between master and minion

Description A clear and concise description of what the bug is. I am seeing a weird connection issue in my salt setup. there are ~30 minions registered with the master. for a few of them, master couldn’t connect to them anymore after a while. salt '*' test.ping failed with the following error message:

    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
    
    salt-run jobs.lookup_jid 20230920213139507242

here are a few observations:

restarting the salt-minion service helped but the same minion lost connection again after a while.
salt-call test.ping works fine on minion side. other commands like salt-call state.apply also works fine. this indicates minion to master communication is fine but master to minion communication is not
below is the error message i found from minion log. i tried to bump up the timeout param like salt ‘*’ -t 600 test.ping . but it doesn’t help

2023-09-20 13:00:08,121 [salt.minion      :2733][ERROR   ][821760] Timeout encountered while sending {'cmd': '_return', 'id': 'minion', 'success': True, 'return': True, 'retcode': 0, 'jid': '20230920195941006337', 'fun': 'test.ping', 'fun_args': [], 'user': 'root', '_stamp': '2023-09-20T19:59:41.114944', 'nonce': '3b23a38761fc4e98a694448d36ac7f97'} request
does anyone have any idea what's wrong here and how to debug this issue?

Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

minion was installed by sudo ./bootstrap-salt.sh -A <master-ip-address> -i $(hostname) stable 3006.3. no custom config on minion
master runs inside a container using image saltstack/salt:3006.3 . master configs:

nodegroups:
  prod-early-adopter: L@minion-hostname-1
  prod-general-population: L@minion-hostname-2
  release: L@minion-hostname-3
  custom: L@minion-hostname-4

file_roots:
  base:
    - <path/to/custom/state/file>

state file:

pull_state_job:
  schedule.present:
    - function: state.apply
    - maxrunning: 1
    - when: 8:00pm

deploy:
  cmd.run:
    - name: '<custom-command-here>'
    - runas: ubuntu

Please be as specific as possible and give set-up details.

on-prem machine
VM (Virtualbox, KVM, etc. please specify)
VM running on a cloud service, please be explicit and add details
container (Kubernetes, Docker, containerd, etc. please specify)
or a combination, please be explicit
jails if it is FreeBSD
classic packaging
onedir packaging
used bootstrap to install

Steps to Reproduce the behavior (Include debug logs if possible and relevant)

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

Salt Version:
          Salt: 3006.3
 
Python Version:
        Python: 3.10.4 (main, Apr 20 2022, 01:21:48) [GCC 10.3.1 20210424]
 
Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.2
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.9.8
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: Not Installed
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4
 
System Versions:
          dist: alpine 3.14.6 
        locale: utf-8
       machine: x86_64
       release: 5.11.0-1022-aws
        system: Linux
       version: Alpine Linux 3.14.6

Additional context Add any other context about the problem here.

About this issue

Original URL
State: open
Created 9 months ago
Reactions: 3
Comments: 19 (8 by maintainers)

Most upvoted comments

This seems to still be an issue on 3006.7 when both minion and master are the same version

raddessi on Feb 23, 2024

@darkpixel Yes I have found the same thing and have the same workflow. Something just gets stuck and responses get lost somewhere. They are always receiving events however as you say, in my experience

raddessi on Feb 13, 2024