salt: [BUG] Some 3006.6 minions not compatible with 3006.7 master

Description After some new VMs this week got 3006.7 from the repository out-of-the-box, which did not work with Salt master 3006.6, I had to upgrade the salt-master to 3006.7. But now some 3006.6 minions are no longer reachable.

They log every 10 seconds:

2024-02-22 16:50:10,665 [salt.crypt       :823 ][ERROR   ][37572] The Salt Master has rejected this minion's public key.
To repair this issue, delete the public key for this minion on the Salt Master.
The Salt Minion will attempt to re-authenicate.

(sic)

Restarting the minion did not help.

So neither updating master first nor minion first resulted in a stable upgrade experience.

After upgrading the minion to 3006.7, reloading systemd and restarting the minion, the connection works.

Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

Please be as specific as possible and give set-up details.

  • on-prem machine
  • VM (Virtualbox, KVM, etc. please specify)
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • or a combination, please be explicit
  • jails if it is FreeBSD
  • classic packaging
  • onedir packaging
  • used bootstrap to install

Steps to Reproduce the behavior (Include debug logs if possible and relevant)

Expected behavior I would hope for minion-master compatibility within any minor-version mix, and at least one step difference in major version. (Of course with only the features both support).

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

Minion:

Salt Version:
          Salt: 3006.6

Python Version:
        Python: 3.10.13 (main, Nov 15 2023, 04:34:27) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.14.6
      cherrypy: 18.6.1
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.3
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.14.2
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: centos 7.9.2009 Core
        locale: utf-8
       machine: x86_64
       release: 3.10.0-1160.108.1.el7.x86_64
        system: Linux
       version: CentOS Linux 7.9.2009 Core```

Master:

```yaml Salt Version: Salt: 3006.7

Python Version: Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]

Dependency Versions: cffi: 1.14.6 cherrypy: unknown dateutil: 2.8.1 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.3 libgit2: 1.3.0 looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.19.1 pygit2: 1.7.0 python-gnupg: 0.4.8 PyYAML: 6.0.1 PyZMQ: 23.2.0 relenv: 0.15.1 smmap: Not Installed timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4

System Versions: dist: centos 7.9.2009 Core locale: utf-8 machine: x86_64 release: 3.10.0-1160.105.1.el7.x86_64 system: Linux version: CentOS Linux 7.9.2009 Core

</details>

About this issue

  • Original URL
  • State: closed
  • Created 4 months ago
  • Reactions: 4
  • Comments: 19 (10 by maintainers)

Most upvoted comments

Thank you @MartinEmrich for opening this, we were bitten by the same issue today when upgrading two syndics supporting a compute cluster (bare metal). We had upgraded our upstream master already but only took it to v3006.6, so we didn’t see the problem come up then.

Thank you also @darkpixel for the fix, that resolved the issue for us! We were able to clean up all of our minions containing newlines in their minion.pub’s and minion.pem’s and restart the salt-minion service, which cleared the problem for the affected hosts.

We dug a little deeper and have some info which may be useful in hunting down the root cause:

  • It turns out our syndics themselves had newline characters in their respective minion.pem and minion.pub files as well. Interestingly however, their syndic_master.pub files did not contain newline characters, although the creation time of syndic_master.pub and minion.pub/minion.pem was less than 30 minutes apart
  • the version of Salt installed which performed the key creations was salt-3004.2-1.el8.noarch (both syndics are CentOS Stream 8 systems). The version of OpenSSL installed at that time was openssl-1:1.1.1k-6.el8.x86_64
  • we don’t believe the issue is strictly tied to a specific OS, as we saw impacted minions running a range of OS’s, including but not limited to: RHEL7, RHEL8, CentOS Linux 7, CentOS Stream 8, and openSUSE Leap 15

Hopefully something in there is useful!

@MartinEmrich If you edit your minion.pub and minion.pem using vi -b minion.pub and then run :set noeol and then :wq and start the salt-minion service again, it will come back online until this gets fixed.

quick and dirty

minions: sudo truncate -s -1 /etc/salt/pki/minion/minion.pem && sudo truncate -s -1 /etc/salt/pki/minion/minion.pub && sudo systemctl restart salt-minion master: sudo salt-key -A --include-denied

if minion fails check pem pub for missing “-” end of last line

In case the code is not shutdown i’ll credit you where it is appropriate (I’ll have to find out).

Don’t worry about this too much, I’m not that really attached to the solution 😉

Also, please let me know if I’m breaking convention by using your solution in my PR.

I don’t think you’ll break any conventions since the clean_key was already done on incoming minion’s key before (you can walk through commits history to see that).

Adding salt.crypt.clean_key on the right side of this condition worked for me.

if salt.crypt.clean_key(pubfn_handle.read()) != salt.crypt.clean_key(load["pub"])

Adding salt.crypt.clean_key on the right side of this condition worked for me.

if salt.crypt.clean_key(pubfn_handle.read()) != salt.crypt.clean_key(load["pub"])

Thank you for your feedback @kiniou i’ve created a draft PR on your solution. In case the code is not shutdown i’ll credit you where it is appropriate (I’ll have to find out). Also, please let me know if I’m breaking convention by using your solution in my PR. I’ll retract it if that is the case.

Adding salt.crypt.clean_key on the right side of this condition worked for me.

if salt.crypt.clean_key(pubfn_handle.read()) != salt.crypt.clean_key(load["pub"])

This is likely the bast path forward for anyone looking for a quick fix that doesn’t require changes to minions.

@darkpixel thanks for the hint, but as I have to SSH into every host now anyways, I will just upgrade the minions to 3006.7. For the few I tried so far, that fixed it.

It’s definitely the line at the end of the file. service salt-minion stop vi -b /etc/salt/pki/minion/* :set noeol :wn until it complains you’ve reached the last file. :wq salt-call test.ping If it’s accepted, great. If not, go to the master and salt-key -d name-of-minion and try salt-call test.ping again. It should work, then do service salt-minion restart

I wish that were the case across my infrastructure. I’ve gone as far as removing the keys from the Minion and Master and starting the salt-minion service again. The “new” Minion key lands in the Denied list 10 seconds after it hits the Unaccepted list, with zero interaction in those 10 seconds.

Additionally, their SHA256 fingerprints match on both systems. Minion still gets auto-rejected unless it’s running 3006.7 like the Master.

Even more odd, I’ve got a couple of 3005 clients that are still connected. All my 3006 clients and 3004 clients failed immediately. My 3006 clients were upgraded to 3006.7, keys replaced, and they’re working now. My 3004 clients cannot be upgraded for the next four months. I’ve been working on them all day with no luck. Removing the key (Which was not preseeded and these are not Windows or Mac machines) does not help. I have verified sha256 fingerprints (In binary and non-binary modes) on both Minion and Master, they’re identical.

@darkpixel @twangboy is a away for a few days. I will take a look at this soon.