salt: salt-minion error on unstable tcp connection

I have a minion connecting to a master via Internet, the routing between these 2 servers is bit unstable, e.g. there is some packet loss, however many other services are running fine, just salt is having bigger problems.

one server is in Azure Cloud and they dont allow ICMP, so here is a tcpping to give an idea of the packet loss encountered:

root@minion1:~# ./tcpping salt-master.example.com 4506
seq 0: tcp response from salt-master.example.com (1.1.1.1) [open]  310.518 ms
seq 1: tcp response from salt-master.example.com (1.1.1.1) [open]  314.011 ms
seq 2: tcp response from salt-master.example.com (1.1.1.1) [open]  310.209 ms
seq 3: tcp response from salt-master.example.com (1.1.1.1) [open]  310.258 ms
seq 5: tcp response from salt-master.example.com (1.1.1.1) [open]  310.698 ms
seq 4: no response (timeout)
seq 7: tcp response from salt-master.example.com (1.1.1.1) [open]  313.063 ms
seq 8: tcp response from salt-master.example.com (1.1.1.1) [open]  309.968 ms
seq 6: no response (timeout)
seq 9: tcp response from salt-master.example.com (1.1.1.1) [open]  313.527 ms
seq 10: tcp response from salt-master.example.com (1.1.1.1) [open]  311.129 ms
seq 11: tcp response from salt-master.example.com (1.1.1.1) [open]  310.477 ms
seq 12: tcp response from salt-master.example.com (1.1.1.1) [open]  313.292 ms
seq 13: tcp response from salt-master.example.com (1.1.1.1) [open]  313.591 ms
seq 15: tcp response from salt-master.example.com (1.1.1.1) [open]  314.750 ms
seq 16: tcp response from salt-master.example.com (1.1.1.1) [open]  313.983 ms
seq 14: no response (timeout)
seq 17: tcp response from salt-master.example.com (1.1.1.1) [open]  310.181 ms
seq 19: tcp response from salt-master.example.com (1.1.1.1) [open]  312.813 ms

here is “salt-minion -l debug”, trying “salt minion1 test.ping” from the master only works about 1 out of 10 times, just wondering if the errors below are normal in this situation of a not 100% stable connection and also if its normal to expect salt to work this badly over this kind of connection (pretty much unusable, whereas other services are working fine even though there is some packet loss)

[DEBUG   ] Ping master
[DEBUG   ] Re-using SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506')
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 1 of 3)
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 2 of 3)
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 3 of 3)
[INFO    ] fire_master failed: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 992, in _fire_master
    result = channel.send(load, timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 314, in send
    return self._crypted_transfer(load, tries, timeout)
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 302, in _crypted_transfer
    return _do_transfer()
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 293, in _do_transfer
    timeout)
  File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 273, in send
    'SaltReqTimeoutError: after {0} seconds, ran {1} tries'.format(timeout * tried, tried)
SaltReqTimeoutError: SaltReqTimeoutError: after 180 seconds, ran 3 tries

[DEBUG   ] Ping master
[DEBUG   ] Re-using SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506')
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 1 of 3)
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 2 of 3)
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 3 of 3)
[INFO    ] fire_master failed: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 992, in _fire_master
    result = channel.send(load, timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 314, in send
    return self._crypted_transfer(load, tries, timeout)
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 302, in _crypted_transfer
    return _do_transfer()
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 293, in _do_transfer
    timeout)
  File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 273, in send
    'SaltReqTimeoutError: after {0} seconds, ran {1} tries'.format(timeout * tried, tried)
SaltReqTimeoutError: SaltReqTimeoutError: after 180 seconds, ran 3 tries

I experience similar kind of problems with other minions, all are connecting via Internet to the master. You could argue that the connection in the above example is not good, but over WAN connecting multiple clouds and many servers its just expected to have a few of those not ideal connections, other services between those servers (mostly HTTP/HTTPS based etc) run fine, is it normal that ZMQ is this sensitive to packet loss, given that other services are able to recover without problems but with ZMQ I always seem to have minions loosing their connection to the master.

I wanted to try RAET, but it has been broken for a few releases now already, I tried following the RAET bugs and the patches there but could never get it to work sadly.

About this issue

  • Original URL
  • State: closed
  • Created 9 years ago
  • Comments: 43 (22 by maintainers)

Most upvoted comments

I believe I have a lead as to what is going on. First, a little background - we are using SaltStack in combination with netbooting and clonezilla to deploy one of several OS images to bare-metal machines and configuring them to customer specifications. We use hardware of several different types: some high-powered (Intel i3/i5/i7 w/ 16G ram), and some very low-powered (Intel J1900 w/ 4G ram). The system is all on the same IP network without firewall or routing of any kind between the master and minions. Finally, for various reasons, we’re running on salt 2015.5.11 on Ubuntu 14.04 - even clonezilla is the Ubuntu-based one (14.04) but customized with salt-minion.

We’ve noticed that the system seems pretty happy when there are no J1900-based machines running. However, we have nearly immediate problems when the J1900s connect and start using resources - they pretty much bring the whole system to a halt with these errors.

Symptoms include many but not all connected systems erroring their highstate runs with the “SaltReqTimeoutError” error. This includes even the high-powered systems. When this happens we sometimes cannot “test.ping” the machines with failed highstate runs, but we can successfully restart the highstate runs (which then fail again within minutes with the same “SaltReqTimeoutError”). However, there are no other errors in logs anywhere on the various systems with any information. And one other interesting tidbit of information - when deploying machines using a “masterless” setup from USB key, the deployment runs without error on ALL of our platforms (so I know the system works and works correctly)!

Furthermore, I have been unable to replicate these symptoms on my smaller testing system - it only happens on our production floor, and only when we have several machines being deployed at the same time. We’ve gone so far as to replace the cluster hardware running the salt-master as well as change out ethernet switches and cables - none of that solved the issue.

So now with the facts out of the way, here’s my hypothesis: by virtue of how slow they are, the J1900s trigger some sort of wait situation in ZMQ which stops the keepalive processing. However, I do not know how to start debugging with this in mind. Maybe someone with more understanding of SaltStack’s or ZMQ’s internals can find something…

Let me know if I can be of any help. Please remember that the only place I can reliably reproduce this issue is a production floor that must be kept running…

@bymodude, @steverweber, @bymodude, thanks for working on this. Communication over sketchy or unfriendly networks has been a problem for a long time. This is one of the things that RAET and the tornado TCP transport were created to fix with ZMQ.

@bymodude, you may have more success upgrading zmq to > 4.0.0 if possible. Another thing to look at is zmq_monitor. I remember another issue where an aggressive network was unhelpfully trimming packets or frames or something but I can’t find that issue, although @cachedout might remember.

please try these settings…

ping_interval: 5
auth_timeout: 10
random_reauth_delay: 10
auth_tries: 2
auth_safemode: False