moby: Services cannot be created on a swarm during registry outages, even if the necessary images have already been pulled

Description

I have been creating services on a swarm all day for development work I’m doing and it suddenly stopped working - docker service create would hang and eventually exit without creating the service.

After much digging, netstat revealed connections to registry-1.docker.io which I discovered was down at the time. However, it was surprising to me that the registry was even being contacted since all nodes already have the image in question pulled.

Steps to reproduce the issue:

  1. Ensure a test image is pulled locally to all nodes in a swarm
  2. Offline the registry from which that image is pulled
  3. Attempt to docker service create (or, I’m told, docker service update a running service)

Describe the results you received:

$ docker service create --replicas 1 --name helloworld alpine:3.5 ping docker.com
... ~5 minutes elapses...
Error response from daemon: rpc error: code = 4 desc = context deadline exceeded

The docker daemon logs only:

Feb 28 18:38:48 localhost docker[2613]: time="2017-02-28T18:38:48.968662738Z" level=warning msg="unable to pin image alpine:3.5 to digest: manifest unknown: manifest unknown"
Feb 28 18:38:48 localhost docker[2613]: time="2017-02-28T18:38:48.968996137Z" level=error msg="Error creating service helloworld: rpc error: code = 4 desc = context deadline exceeded"
Feb 28 18:38:48 localhost docker[2613]: time="2017-02-28T18:38:48.969256177Z" level=error msg="Handler for POST /v1.25/services/create returned error: rpc error: code = 4 desc = context deadline exceeded"

Describe the results you expected:

At a minimum, more helpful log messages indicating the registry connection failure, so that it’d be readily apparently what the problem was.

However, the requirement to connect to the registry on service creation or update is a fairly big limitation. The services I’m creating are transient and we’ll potentially be creating hundreds or thousands per day. Because the images are already pulled, the dependency on an external service just to run containers is less than ideal.

Additional information you deem important (e.g. issue happens only occasionally):

After discussion with @programmerq on IRC I learned that service create and service update contact the registry to pin the digest, so this process will fail whenever the registry for the service’s image is down.

However, it can be worked around by creating the service using the digest form of the image specifier e.g.:

$ docker service create --replicas 1 --restart-condition none -e FOO=foo $(docker image inspect --format '{{(index .RepoDigests 0)}}' alpine:3.5)

(btw, docker images --format '{{.Digest}}' alpine:3.5 and docker image ls --format '{{.Digest}}' alpine 3.5' both return <none>, but the documentation (for at least docker images`) claims this should return the digest, is this a bug?)

Output of docker version:

Client:
 Version:      1.13.1
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 06:36:34 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.1
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 06:36:34 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 4
 Running: 0
 Paused: 0
 Stopped: 4
Images: 2
Server Version: 1.13.1
Storage Driver: aufs
 Root Dir: /data/docker/aufs
 Backing Filesystem: extfs
 Dirs: 56
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: gmyub5nxs7v84npk768zcs5au
 Is Manager: true
 ClusterID: pxno36t9783hgb10ok466k91v
 Managers: 1
 Nodes: 2
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 172.18.2.200
 Manager Addresses:
  172.18.2.200:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1
runc version: 9df8b306d01f59d3a8029be411de015b7304dd8f
init version: 949e6fa
Kernel Version: 3.16.0-4-amd64
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 494.5 MiB
Name: swarm0
ID: UBFQ:KAEY:LQPZ:Q3AJ:A6LW:LSR5:ZQOR:WZFY:RFCT:BPIE:2GBP:UXOC
Docker Root Dir: /data/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No memory limit support
WARNING: No swap limit support
WARNING: No kernel memory limit support
WARNING: No oom kill disable support
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

Both nodes in the swarm are Debian 8 KVM guests running under SmartOS.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 25 (15 by maintainers)

Most upvoted comments

It still happening on version 19.03.3

oops, I was mistaken, it’s not in 17.03.1, but will be in 17.04

@rommik there was an example in my original report:

$ docker service create --replicas 1 $(docker image inspect --format '{{(index .RepoDigests 0)}}' alpine:3.5)

Or the two step process given by @thaJeztah above.

@rommik a fix for this is in docker 17.03.1, which I think should be available now (see https://github.com/docker/docker/issues/31586)

As discussed above, providing the digest manually is one of the ways to bypass contacting the registry, if the images were downloaded from the registry beforehand. But even without providing digest, the only thing that should happen when the registry is down, is that there should be a short delay (a few seconds) when the registry gets contacted, and then service creation should happen normally, using the locally available image.

@aaronlehmann and I discussed this briefly, and it seems like there is a problem with context timeouts here. When S3 was down, the registry could be reached, but there was no response. This led to a special situation where a different context times out first, leading to the context deadline exceeded error originally reported. In my local testing with the internet turned off, this is not an issue since the registry cannot be contacted in the first place.