moby: docker service logs RPC error after network failure in swarm mode from 2/12 nodes
Description We got a docker swarm mode cluster with 12 servers - all using docker 17.06. Tonight our data center had some network probelms and some servers have been offline for a few minutes. They are all back online and healthy now.
docker node ls shows them all as ready and active and they also received some tasks, which are running. We can also start new tasks on them without problems.
However, we cannot receive logs anymore from 2 of the 12 nodes.
docker service logs --tail=1000 testservice
Shows:
error from daemon in stream: Error grabbing logs: rpc error: code = 2 desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node 1yjkb8d3oh2tc8hrup7641mau is not available
What can we do?
Output of docker version
:
Client:
Version: 17.06.0-ce
API version: 1.30
Go version: go1.8.3
Git commit: 02c1d87
Built: Fri Jun 23 21:23:31 2017
OS/Arch: linux/amd64
Server:
Version: 17.06.0-ce
API version: 1.30 (minimum version 1.12)
Go version: go1.8.3
Git commit: 02c1d87
Built: Fri Jun 23 21:19:04 2017
OS/Arch: linux/amd64
Experimental: false
Additional environment details (AWS, VirtualBox, physical, etc.): Ubuntu 16.04 LTS
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 52
- Comments: 72 (13 by maintainers)
I think the certificates have changed and are out of sync between the manager and the worker. Experienced this with multiple Docker versions. My initial workaround to fix the problem was create a new swarm (if it was on the master node) / leave and rejoin (if it was on the worker node).
The fix witch seam more of a solution and worked for me was to execute on the manager node:
docker swarm ca --rotate
This resented the certificates between the nodes and then I could retrieved the logs from the services. Also I could redeploy new services.
Workaround:
docker service logs -tf <service_name>
I’m seeing same issue with 18.03
Encountered the same issue on
19.03.8
. Resolved viadocker swarm ca --rotate
.Same question as @ProteanCode, is there anything that should discourage me from just executing
docker swarm ca --rotate
once a day to prevent this from happening?@alexmnt solution’s worked for me.
Hi,
When you run docker service update, use option --force + --image and your logs will be live again 😃.
Like: docker service update --force --image nginx/latest yourservicename
i think it’s likely that the CA rotation is only making logs work again as a side effect. probably something about updating the node. if other things work and logs don’t, this is probably an issue in the log broker component on the manager. i haven’t had time to look at it yet, for which i sincerely apologize, but i’ll try to block out some time to fix it.
the bright side is i’ve become a much better programmer compared to when we first shipped logs, so when i finally do get around to fixing it it’ll likely be much more stable 😄
Had the issue after experimenting with some
networks
andports
settings on a service within a stack.This resolves my issue. Thanks @alexmnt !
This is true. But i think rotating the certs, resolves some invalid state inside docker. But this is just a presumption.
Can confirm this still occurs on a regular basis on docker 19.03.9. I just noticed it has been open for 2.5 years, what’s blocking this issue from moving forward?
faster way to fix this issue
mixed versions (17.12.1-ce and 18.03.1-ce) , 14 nodes (6 managers) with debian linux
It’s 2020 and this still happens.
docker swarm ca --rotate
helped me, but i remember that I also could use thedocker stack rm
anddocker stack deploy
(but this is only good for non production envs).Should we autorotate the ca cert?
The CA rotation works, but I have this problem on a single node server so it shows that you may be wrong about
Since single node cannot be out of sync with itself.
FYI. I created stress testing tool for Swarmkit which builds latest version of it from puts it to env where network failures can be generated on purpose.
I was able to reproduce this issue with it. Here is some logs: https://pastebin.com/raw/8RW5AE4w and https://pastebin.com/raw/2tgz57nH
@dperny / @thaJeztah I can also see that there is now more information than earlier because of docker/swarmkit#2541
I hope that this helps you find and fix the root cause of issue.
EDIT: Here is now also full log from all the five nodes: https://pastebin.com/raw/wvFL3Aas
Same thing. 3 nodes, 1 manager.
sudo docker service logs kafka3
fails, kafka3 works fine.Workarounds
-f
optionSame issue on 10 nodes swarm 17.09.0-ce. Errors varies by service and are combination of
illegal wireType X
andreceived message length X exceeding the max size 134217728
so it looks like protocol issue or data corruption.Tried to @olljanat directions, but my single manager node was already availability=pause and had no service containers on it.
docker service logs mysvc
fails,docker services logs -f mysvc
works. I’ve tried all combinations of restarts, affinity, etc. Results always hold true.18.03.1-ce on 5 worker nodes, 18.06-1-ce on manager node. Pi cluster. Debian 9 stretch.
@olljanat I’ve tried the requested steps and it did not remove the error message/behavior.
error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node ymom7wdza0vf40rjabypg8ks2 is not available, node x9y7x0vc8o1jf863jxopsvfo2 is not available, node 5qliewb6axa12m6h1zorphzxg is not available, node sp7mbc191vradycld4e11mhqk is not available
This is with dedicated manager nodes. It seems to happen after we do upgrades pretty consistently.
Upgrades are done by bringing up a new autoscale group and connecting it to the existing swarm, setting all existing nodes to drain, demoting the managers, removing the previous autoscale group and then deleting the down nodes from the swarm.
After this process we start to get the error messages reported and logging becomes impacted.
@dperny I did some tuning to my https://github.com/olljanat/swarmkit-stress-tester and now I’m able to reproduce this issue every time I run it.
Key things to get that state was:
Here you can see how it look when swarm is on that state:
Force update for services does not fix the issue.
But restating managers one by one does.
Important note also is that when swarm is on this state you can still read logs using –follow switch.
My theory is that problem is reach the state where log actually starts and that why restarting managers fixes issue.
@raarts Short answer is yes, we use it as well 😉. But you need to update service by service, kind long, but working without errors.
Updated to 18.02 and we still see this error. However we did not scale down or up any service. At appears after some time without influence or changes from outside. Restarting nodes (sometimes all) in the cluster helps for a few hours/days.
I did fix it, I don’t know what really happened, but for me sounded like “iptables” rules problem. I flush the rules on the host and it started working again. Hope to help. Thanks at all!
We’re seeing an issue that seems similar, where a
docker service logs prod-swarm_helloworld
command returns some logs, but ends with an error about two nodes:This service only has one replica now, although it might have had more (and been updated down to 1) in the past. (Not 100% sure about that – it might have been removed and re-created since then.)
Another curious thing: Node n5srg4wxoyqexvp1yazbicrlg is one of our other nodes; but qsemjduuwxrl5yawc1ofgfinj isn’t, at least not currently. We have added nodes to this Swarm, and removed them, in the past, so it’s possible that this is a node that used to exist.
Anything else we can do to gather more information here? In particular, why does Swarm think that this service with one replica should have logs on three nodes, one of which doesn’t even exist any more?)