moby: docker service logs RPC error after network failure in swarm mode from 2/12 nodes

Description We got a docker swarm mode cluster with 12 servers - all using docker 17.06. Tonight our data center had some network probelms and some servers have been offline for a few minutes. They are all back online and healthy now.

docker node ls shows them all as ready and active and they also received some tasks, which are running. We can also start new tasks on them without problems.

However, we cannot receive logs anymore from 2 of the 12 nodes. docker service logs --tail=1000 testservice

Shows: error from daemon in stream: Error grabbing logs: rpc error: code = 2 desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node 1yjkb8d3oh2tc8hrup7641mau is not available

What can we do?

Output of docker version:

Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:23:31 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:19:04 2017
 OS/Arch:      linux/amd64
 Experimental: false

Additional environment details (AWS, VirtualBox, physical, etc.): Ubuntu 16.04 LTS

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 52
  • Comments: 72 (13 by maintainers)

Most upvoted comments

I think the certificates have changed and are out of sync between the manager and the worker. Experienced this with multiple Docker versions. My initial workaround to fix the problem was create a new swarm (if it was on the master node) / leave and rejoin (if it was on the worker node).

The fix witch seam more of a solution and worked for me was to execute on the manager node:

docker swarm ca --rotate

This resented the certificates between the nodes and then I could retrieved the logs from the services. Also I could redeploy new services.

Workaround: docker service logs -tf <service_name>

I’m seeing same issue with 18.03

Encountered the same issue on 19.03.8. Resolved via docker swarm ca --rotate.

Same question as @ProteanCode, is there anything that should discourage me from just executing docker swarm ca --rotate once a day to prevent this from happening?

I think the certificates have changed and are out of sync between the manager and the worker. Experienced this with multiple Docker versions. My initial workaround to fix the problem was create a new swarm (if it was on the master node) / leave and rejoin (if it was on the worker node).

The fix witch seam more of a solution and worked for me was to execute on the manager node:

docker swarm ca --rotate

This resented the certificates between the nodes and then I could retrieved the logs from the services. Also I could redeploy new services.

@alexmnt solution’s worked for me.

Hi,

When you run docker service update, use option --force + --image and your logs will be live again 😃.

Like: docker service update --force --image nginx/latest yourservicename

i think it’s likely that the CA rotation is only making logs work again as a side effect. probably something about updating the node. if other things work and logs don’t, this is probably an issue in the log broker component on the manager. i haven’t had time to look at it yet, for which i sincerely apologize, but i’ll try to block out some time to fix it.

the bright side is i’ve become a much better programmer compared to when we first shipped logs, so when i finally do get around to fixing it it’ll likely be much more stable 😄

Had the issue after experimenting with some networks and ports settings on a service within a stack.

docker service logs SERVICE # not working
docker service logs -f SERVICE # working
docker logs CONTAINER # working
docker logs -f CONTAINER # working
docker service update --force SERVICE # stuck, but service was restarted

This resolves my issue. Thanks @alexmnt !

docker swarm ca --rotate

Since single node cannot be out of sync with itself.

This is true. But i think rotating the certs, resolves some invalid state inside docker. But this is just a presumption.

Can confirm this still occurs on a regular basis on docker 19.03.9. I just noticed it has been open for 2.5 years, what’s blocking this issue from moving forward?

faster way to fix this issue

  • drain all your managers
  • reboot managers
  • active all your managers

mixed versions (17.12.1-ce and 18.03.1-ce) , 14 nodes (6 managers) with debian linux

It’s 2020 and this still happens.

docker swarm ca --rotate helped me, but i remember that I also could use the docker stack rm and docker stack deploy (but this is only good for non production envs).

Should we autorotate the ca cert?

I think the certificates have changed and are out of sync between the manager and the worker. Experienced this with multiple Docker versions. My initial workaround to fix the problem was create a new swarm (if it was on the master node) / leave and rejoin (if it was on the worker node).

The fix witch seam more of a solution and worked for me was to execute on the manager node:

docker swarm ca --rotate

This resented the certificates between the nodes and then I could retrieved the logs from the services. Also I could redeploy new services.

The CA rotation works, but I have this problem on a single node server so it shows that you may be wrong about

are out of sync between the manager and the worker

Since single node cannot be out of sync with itself.

FYI. I created stress testing tool for Swarmkit which builds latest version of it from puts it to env where network failures can be generated on purpose.

I was able to reproduce this issue with it. Here is some logs: https://pastebin.com/raw/8RW5AE4w and https://pastebin.com/raw/2tgz57nH

@dperny / @thaJeztah I can also see that there is now more information than earlier because of docker/swarmkit#2541

I hope that this helps you find and fix the root cause of issue.

EDIT: Here is now also full log from all the five nodes: https://pastebin.com/raw/wvFL3Aas

Same thing. 3 nodes, 1 manager. sudo docker service logs kafka3 fails, kafka3 works fine.

Workarounds

  • use -f option
  • reboot node3
Client:
 Version:       17.12.0-ce
 API version:   1.35
 Go version:    go1.9.2
 Git commit:    c97c6d6
 Built: Wed Dec 27 20:11:19 2017
 OS/Arch:       linux/amd64

Server:
 Engine:
  Version:      17.12.0-ce
  API version:  1.35 (minimum version 1.12)
  Go version:   go1.9.2
  Git commit:   c97c6d6
  Built:        Wed Dec 27 20:09:53 2017
  OS/Arch:      linux/amd64
  Experimental: true

Same issue on 10 nodes swarm 17.09.0-ce. Errors varies by service and are combination of illegal wireType X and received message length X exceeding the max size 134217728 so it looks like protocol issue or data corruption.

Tried to @olljanat directions, but my single manager node was already availability=pause and had no service containers on it. docker service logs mysvc fails, docker services logs -f mysvc works. I’ve tried all combinations of restarts, affinity, etc. Results always hold true.

18.03.1-ce on 5 worker nodes, 18.06-1-ce on manager node. Pi cluster. Debian 9 stretch.

@olljanat I’ve tried the requested steps and it did not remove the error message/behavior.

error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node ymom7wdza0vf40rjabypg8ks2 is not available, node x9y7x0vc8o1jf863jxopsvfo2 is not available, node 5qliewb6axa12m6h1zorphzxg is not available, node sp7mbc191vradycld4e11mhqk is not available

This is with dedicated manager nodes. It seems to happen after we do upgrades pretty consistently.

Upgrades are done by bringing up a new autoscale group and connecting it to the existing swarm, setting all existing nodes to drain, demoting the managers, removing the previous autoscale group and then deleting the down nodes from the swarm.

After this process we start to get the error messages reported and logging becomes impacted.

@dperny I did some tuning to my https://github.com/olljanat/swarmkit-stress-tester and now I’m able to reproduce this issue every time I run it.

Key things to get that state was:

  • Have test services which are printing huge amount of data to console all the time.
  • Take network connection from workers down for one minute.

Here you can see how it look when swarm is on that state:

$ docker run -it -v /var/run/docker.sock:/var/run/docker.sock -v /tmp/swarmkit-stress-tester:/tmp/swarmkit-stress-tester --rm --name debug swarmkit-stress-tester bash

root@87d35c5ae29f:/# swarmctl -s /tmp/swarmkit-stress-tester/manager1.sock node ls
ID                         Name      Membership  Status  Availability  Manager Status
--                         ----      ----------  ------  ------------  --------------
4akj54xwh8msyldmbppr8scw1  manager1  ACCEPTED    READY   ACTIVE        REACHABLE *
6w5dojd5vy9y1phadzngllzto  manager3  ACCEPTED    READY   ACTIVE        REACHABLE
cxlvr2wb37g7pqh6ix1r2i4zk  worker1   ACCEPTED    READY   ACTIVE
e4sc5h8mu30kmzc3wo2g7gt46  manager2  ACCEPTED    READY   ACTIVE        REACHABLE
gcn1uwjublun7a254lnjaz1dc  worker2   ACCEPTED    READY   ACTIVE

root@87d35c5ae29f:/# swarmctl -s /tmp/swarmkit-stress-tester/manager1.sock service ls
ID                         Name  Image         Replicas
--                         ----  -----         --------
i0r653gvx14q4ke4jy4vzfwh6  foo   test-service  global
teblcawm1zelrnn210tlaiius  bar   test-service  2/2

root@87d35c5ae29f:/# swarmctl -s /tmp/swarmkit-stress-tester/manager1.sock service logs bar
Error: failed receiving stream message: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node cxlvr2wb37g7pqh6ix1r2i4zk is not available, node gcn1uwjublun7a254lnjaz1dc is not available

root@87d35c5ae29f:/# swarmctl -s /tmp/swarmkit-stress-tester/manager1.sock service logs foo
Error: failed receiving stream message: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node gcn1uwjublun7a254lnjaz1dc is not available, node cxlvr2wb37g7pqh6ix1r2i4zk is not available

Force update for services does not fix the issue.

swarmctl -s /tmp/swarmkit-stress-tester/manager1.sock service update --image test-service --force bar
swarmctl -s /tmp/swarmkit-stress-tester/manager1.sock service update --image test-service --force foo

But restating managers one by one does.

./workaround.sh

Important note also is that when swarm is on this state you can still read logs using –follow switch.

My theory is that problem is reach the state where log actually starts and that why restarting managers fixes issue.

@raarts Short answer is yes, we use it as well 😉. But you need to update service by service, kind long, but working without errors.

Updated to 18.02 and we still see this error. However we did not scale down or up any service. At appears after some time without influence or changes from outside. Restarting nodes (sometimes all) in the cluster helps for a few hours/days.

Feb 11 15:40:47 worker-node-3 systemd-udevd[23105]: Could not generate persistent MAC address for veth3f5f398: No such file or directory
Feb 11 15:40:47 worker-node-3 systemd-udevd[23106]: Could not generate persistent MAC address for vetha72eabc: No such file or directory
Feb 11 15:40:47 worker-node-3 kernel: veth11: renamed from vetha72eabc
Feb 11 15:40:47 worker-node-3 kernel: device veth11 entered promiscuous mode
Feb 11 15:40:47 worker-node-3 kernel: device vethc7d3eda entered promiscuous mode
Feb 11 15:40:47 worker-node-3 systemd-udevd[23164]: Could not generate persistent MAC address for vethc7d3eda: No such file or directory
Feb 11 15:40:47 worker-node-3 systemd-udevd[23163]: Could not generate persistent MAC address for veth9950dca: No such file or directory
Feb 11 15:40:47 worker-node-3 kernel: IPv6: ADDRCONF(NETDEV_UP): vethc7d3eda: link is not ready
Feb 11 15:40:47 worker-node-3 kernel: docker_gwbridge: port 3(vethc7d3eda) entered forwarding state
Feb 11 15:40:47 worker-node-3 kernel: docker_gwbridge: port 3(vethc7d3eda) entered forwarding state
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47+01:00" level=error msg="time=\"2018-02-11T14:40:47Z\" level=error msg=\"resource not found\" host=\"unix:///var/run/rexray/522768172
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47+01:00" level=error msg="time=\"2018-02-11T14:40:47Z\" level=error msg=\"error: api call failed\" error.resourceID=\"cr-test_registry
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47+01:00" level=error msg="time=\"2018-02-11T14:40:47Z\" level=error msg=\"docker-legacy: Mount: cr-test_registrydb-data: failed: resou
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47.340950579+01:00" level=error msg="fatal task error" error="VolumeDriver.Mount: docker-legacy: Mount: cr-test_registrydb-data: failed
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47+01:00" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/3f4bae9977ea0ada5d6d1fc31124cb7f0e1dc8ded
Feb 11 15:40:47 worker-node-3 kernel: IPVS: Creating netns size=2192 id=2276
Feb 11 15:40:47 worker-node-3 kernel: eth0: renamed from veth3f5f398
Feb 11 15:40:47 worker-node-3 kernel: docker_gwbridge: port 3(vethc7d3eda) entered disabled state
Feb 11 15:40:47 worker-node-3 kernel: br0: port 2(veth11) entered forwarding state
Feb 11 15:40:47 worker-node-3 kernel: br0: port 2(veth11) entered forwarding state
Feb 11 15:40:47 worker-node-3 kernel: eth1: renamed from veth9950dca
Feb 11 15:40:47 worker-node-3 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethc7d3eda: link becomes ready
Feb 11 15:40:47 worker-node-3 kernel: docker_gwbridge: port 3(vethc7d3eda) entered forwarding state
Feb 11 15:40:47 worker-node-3 kernel: docker_gwbridge: port 3(vethc7d3eda) entered forwarding state
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47.954310542+01:00" level=warning msg="failed to deactivate service binding for container cr-test_helmsmandb.1.rovepl7s9jp77pruaw0n1hey
Feb 11 15:40:48 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:48.753833142+01:00" level=error msg="logs call failed" error="failed getting container logs: No such container: cr-test_helmsmandb.1.zz
Feb 11 15:40:48 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:48.754664008+01:00" level=error msg="logs call failed" error="failed getting container logs: No such container: cr-test_helmsmandb.1.zz
Feb 11 15:40:48 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:48.755839959+01:00" level=error msg="logs call failed" error="failed getting container logs: No such container: cr-test_helmsmandb.1.zz
Feb 11 15:40:48 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:48.756443373+01:00" level=error msg="logs call failed" error="failed getting container logs: No such container: cr-test_helmsmandb.1.zz
Feb 11 15:40:52 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:52.363615491+01:00" level=error msg="fatal task error" error="create cr-test_pg-data: driver 'local' already has volume 'cr-test_pg-dat

I did fix it, I don’t know what really happened, but for me sounded like “iptables” rules problem. I flush the rules on the host and it started working again. Hope to help. Thanks at all!

We’re seeing an issue that seems similar, where a docker service logs prod-swarm_helloworld command returns some logs, but ends with an error about two nodes:

prod-swarm_helloworld.2.eqvltu7tn2fu@omni-prod-defaultworker-b4.prod.omni.carezen.net    | > node_helloworld@1.0.0 start /usr/src/app
prod-swarm_helloworld.2.eqvltu7tn2fu@omni-prod-defaultworker-b4.prod.omni.carezen.net    | > node server.js
prod-swarm_helloworld.2.eqvltu7tn2fu@omni-prod-defaultworker-b4.prod.omni.carezen.net    |
error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node qsemjduuwxrl5yawc1ofgfinj is not available, node n5srg4wxoyqexvp1yazbicrlg is not available

This service only has one replica now, although it might have had more (and been updated down to 1) in the past. (Not 100% sure about that – it might have been removed and re-created since then.)

Another curious thing: Node n5srg4wxoyqexvp1yazbicrlg is one of our other nodes; but qsemjduuwxrl5yawc1ofgfinj isn’t, at least not currently. We have added nodes to this Swarm, and removed them, in the past, so it’s possible that this is a node that used to exist.

Anything else we can do to gather more information here? In particular, why does Swarm think that this service with one replica should have logs on three nodes, one of which doesn’t even exist any more?)