moby: cluster error is not so readable ("context deadline exceeded")
Description
Hi, All,
I found that the error returned from swarmkit and returned to docker daemon, finally returned to client side is not so readable.
Since we always get rpc error like rpc error: code = 4 desc = context deadline exceeded
, however we always cannot decide which part returns such kind of error.
When user calls an API, the handler may call swarmkit several times and each time may return an error like rpc error: code = 4 desc = context deadline exceeded
, for programmers we need to debug
where it returns, but obviously currently we cannot.
ping @aaronlehmann @tonistiigi
Steps to reproduce the issue: 1. 2. 3.
Describe the results you received:
Describe the results you expected:
Additional information you deem important (e.g. issue happens only occasionally):
Output of docker version
:
root@ubuntu:~# docker version
Client:
Version: 1.13.0-rc4
API version: 1.25
Go version: go1.7.3
Git commit: 88862e7
Built: Fri Dec 16 22:59:15 2016
OS/Arch: linux/amd64
Server:
Version: 1.14.0-dev
API version: 1.26 (minimum version 1.12)
Go version: go1.7.4
Git commit: e75ca4f
Built: Mon Jan 9 02:24:51 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
root@ubuntu:~# docker info
Containers: 4
Running: 0
Paused: 0
Stopped: 4
Images: 37
Server Version: 1.14.0-dev
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 194
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: active
NodeID: 7ae07lyjg73yt09kdu2mrox49
Error: rpc error: code = 4 desc = context deadline exceeded
Is Manager: true
ClusterID:
Managers: 0
Nodes: 0
Orchestration:
Task History Retention Limit: 0
Raft:
Snapshot Interval: 0
Heartbeat Tick: 0
Election Tick: 0
Dispatcher:
Heartbeat Period: Less than a second
CA Configuration:
Expiry Duration: Less than a second
Node Address: 192.168.59.103
Manager Addresses:
0.0.0.0:2377
192.168.59.103:2377
192.168.59.104:2377
192.168.59.105:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 51371867a01c467f08af739783b8beafc154c4d7
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 3.19.0-25-generic
Operating System: Ubuntu 14.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.954 GiB
Name: ubuntu
ID: Q2ZC:GWDN:27OH:GRMH:G6QU:W7QP:4TIX:Q5F6:YEVK:45XP:EXHC:HOB5
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 31
Goroutines: 96
System Time: 2017-01-09T10:48:33.835141601+08:00
EventsListeners: 0
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
https://a.b.c/
Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.):
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 18 (17 by maintainers)
I think the situation has improved a lot since this issue was filed. There’s now a special error message when quorum has been lost for some time: https://github.com/docker/swarmkit/pull/2129
Most of the cases where a “context deadline exceeded” error showed up were actually loss-of-quorum situations, and now these show a more useful error.
I suppose there are further improvements we could make for other situations, like showing whether the deadline was exceeded in the client or in the swarm manager. This seems relatively unimportant, though.
What do you think? Should we close this issue?
hi @allencloud I was trying to reproduce such a case but didn’t successfully so I was reading the source code and did some changes but I have no way to test it and got frustrated.
I will contact @dperny through slack in the next days to get some ideas how to verify this change.
I don’t really like the current errors either.
I think whenever we encounter a context deadline exceeded error, we should rewrite it to a coherent explanation of what timed out, and perhaps list possible reasons that could cause the timeout (loss of quorum, etc).
I was chatting with @dperny and he was still busy on other things, hence me asking 😄 but ask away if help if needed for implementing ❤️