moby: cluster error is not so readable ("context deadline exceeded")

Description

Hi, All,

I found that the error returned from swarmkit and returned to docker daemon, finally returned to client side is not so readable.

Since we always get rpc error like rpc error: code = 4 desc = context deadline exceeded, however we always cannot decide which part returns such kind of error.

When user calls an API, the handler may call swarmkit several times and each time may return an error like rpc error: code = 4 desc = context deadline exceeded, for programmers we need to debug where it returns, but obviously currently we cannot.

ping @aaronlehmann @tonistiigi

Steps to reproduce the issue: 1. 2. 3.

Describe the results you received:

Describe the results you expected:

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

root@ubuntu:~# docker version
Client:
 Version:      1.13.0-rc4
 API version:  1.25
 Go version:   go1.7.3
 Git commit:   88862e7
 Built:        Fri Dec 16 22:59:15 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.14.0-dev
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.4
 Git commit:   e75ca4f
 Built:        Mon Jan  9 02:24:51 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

root@ubuntu:~# docker info
Containers: 4
 Running: 0
 Paused: 0
 Stopped: 4
Images: 37
Server Version: 1.14.0-dev
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 194
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: 7ae07lyjg73yt09kdu2mrox49
 Error: rpc error: code = 4 desc = context deadline exceeded
 Is Manager: true
 ClusterID:
 Managers: 0
 Nodes: 0
 Orchestration:
  Task History Retention Limit: 0
 Raft:
  Snapshot Interval: 0
  Heartbeat Tick: 0
  Election Tick: 0
 Dispatcher:
  Heartbeat Period: Less than a second
 CA Configuration:
  Expiry Duration: Less than a second
 Node Address: 192.168.59.103
 Manager Addresses:
  0.0.0.0:2377
  192.168.59.103:2377
  192.168.59.104:2377
  192.168.59.105:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 51371867a01c467f08af739783b8beafc154c4d7
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 3.19.0-25-generic
Operating System: Ubuntu 14.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.954 GiB
Name: ubuntu
ID: Q2ZC:GWDN:27OH:GRMH:G6QU:W7QP:4TIX:Q5F6:YEVK:45XP:EXHC:HOB5
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 31
 Goroutines: 96
 System Time: 2017-01-09T10:48:33.835141601+08:00
 EventsListeners: 0
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
 127.0.0.0/8
Registry Mirrors:
 https://a.b.c/
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 18 (17 by maintainers)

Most upvoted comments

I think the situation has improved a lot since this issue was filed. There’s now a special error message when quorum has been lost for some time: https://github.com/docker/swarmkit/pull/2129

Most of the cases where a “context deadline exceeded” error showed up were actually loss-of-quorum situations, and now these show a more useful error.

I suppose there are further improvements we could make for other situations, like showing whether the deadline was exceeded in the client or in the swarm manager. This seems relatively unimportant, though.

What do you think? Should we close this issue?

hi @allencloud I was trying to reproduce such a case but didn’t successfully so I was reading the source code and did some changes but I have no way to test it and got frustrated.

I will contact @dperny through slack in the next days to get some ideas how to verify this change.

I don’t really like the current errors either.

I think whenever we encounter a context deadline exceeded error, we should rewrite it to a coherent explanation of what timed out, and perhaps list possible reasons that could cause the timeout (loss of quorum, etc).

I was chatting with @dperny and he was still busy on other things, hence me asking 😄 but ask away if help if needed for implementing ❤️