salt: [BUG] Salt master under load produces duplicate jobs

Description

Rarely, when issuing many jobs (salt master machine under heavy load), we observe the phenomenon that one state.highstate job issued via API is internally “duplicated”. Salt Master generates two Job IDs. The first ist processed “normally” but the 202 ACCEPTED response contains a second one. Querying for it, the master (rightly) complains that a state.highstate job is already running.

Setup

Some setup facts:

  • Salt-Master (and minions) run Salt 3005.1 on CentOS 7 installed from official RPM repo.
  • Master is serving 400 Minions. Everything is on AWS.
  • When we do our scheduled update window, we scale the Master VM to 4 vCPUs and 16GB RAM (the latter is not used up).
  • For that update procedure, a service issues state.highstate jobs via Salt REST API and polls for the job statuses regularly, limited to 50 concurrent jobs.
  • That regular poll (GET /jobs/<jid>) is done with a 10s interval, thus resulting on average in ca. 5 requests/s.
  • Performance/AP-relevant salt master config:
interface: 0.0.0.0
timeout: 30
worker_threads: 10
top_file_merging_strategy: merge

rest_cherrypy:
  port: 8000
  debug: True
  disable_ssl: True
...

During the procedure, the system load avg quickly reaches 100. That alone is most likely worth a separate issue. But every now and then, the above phenomenon occurs:

  • POST /minions request with state.highstate on a minion is issued.
  • Salt Master generates (and logs) two Job IDs.
  • The first one is actually executed on the minion
  • the 202 ACCEPTED response contains the second one.

Then querying for the second one obviously results in a message like The function "state.highstate" is running as PID 28193 and was started at 2023, Apr 12 04:54:15.509763 with jid <first-jid>

At first, I obviously suspected our own service (Java, Apache HTTP Client, known to have a builtin retry funcionality), so I set up tcpdump beforehand. And indeed, only one POST request is seen on the wire.

I sorted all timestamped events into a “time line”:

Time (CEST)     Source of time stamp          Message/Comment

06:53:43        (API client)                  starting highstate on minion
06:53:43.680303 (pcap timestamp)              POST /minions request
06:53:43        (HTTP response Date header)   HTTP response for POST received later has a Date: header with this time

06:54:15.509763 (Salt Master response body)   Start time of job 20230412045415509763
06:54:15.724    (Salt Master log)             2023-04-12 04:54:15,724 [salt.master      :2479][INFO    ][1934] User api Published command state.highstate with jid 20230412045415509763
06:54:15.759    (Salt Master log)             2023-04-12 04:54:15,759 [salt.master      :2479][INFO    ][1928] User api Published command state.highstate with jid 20230412045415567009
06:54:16.029    (Salt Master log)             2023-04-12 04:54:16,029 [salt.utils.job   :85  ][INFO    ][1935] Got return from <minion-id> for job 20230412045415567009
06:54:16.289338 (pcap timestamp)              Response Saltmaster 202 ACCEPTED for job 20230412045415567009
06:54:16        (API client)                  receives and logs response Job 20230412045415567009 has been accepted.
06:54:16.290551 (pcap timestamp)              first Request from API client for GET /jobs/20230412045415567009 (regular polling for job status)
06:54:16        (HTTP response Date header)   HTTP response Date: header for GET /jobs/20230412045415567009

06:54:48.817459 (pcap timestamp)              Response of Salt-Master to the GET request (including ["The function \"state.highstate\" is running as PID 28193 and was started at 2023, Apr 12 04:54:15.509763 with jid 20230412045415509763"] )
06:54:48        (API client)                  logs error due to already running job.

Indeed the two logs from salt master for the two jobs happen after receiving the job via REST API, but before responding to it.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3005.1

Dependency Versions:
          cffi: 1.15.0
      cherrypy: unknown
      dateutil: 2.8.2
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 2.11.1
       libgit2: 0.28.4
      M2Crypto: 0.35.2
          Mako: Not Installed
       msgpack: 0.6.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.21
      pycrypto: 2.6.1
  pycryptodome: Not Installed
        pygit2: 1.0.3
        Python: 3.6.8 (default, Nov 16 2020, 16:55:22)
  python-gnupg: Not Installed
        PyYAML: 3.13
         PyZMQ: 18.0.1
         smmap: Not Installed
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.1.4

System Versions:
          dist: centos 7 Core
        locale: UTF-8
       machine: x86_64
       release: 3.10.0-1160.88.1.el7.x86_64
        system: Linux
       version: CentOS Linux 7 Core

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 31 (16 by maintainers)

Most upvoted comments

nope that should be the same setting. debug is after profile. and the api doesn’t have anything that gets profiled.