moby: Docker Engine's Swarm failing to use credential helper when scaling.
Description
When using docker service scale
or docker service update --replicas X
, Docker Engine’s Swarm will not fetch new authentication token via credential helper if the one already defined on the service definition has expired. AWS’ auth tokens expire every 12 hours, as an example. This causes all replicas to spawn only on nodes that already have the image downloaded.
This probably is also an issue when deploying a new version of an image, but I have not tested that yet.
I did post a work-around at the end of this, in regards to using --with-registry-auth
along with service update replicas X
, but it has the side effect of restarting all running containers. Disruptive.
Steps to reproduce the issue:
Setup
- Create a repo in ECR called ‘redis’. I’m using the new Ohio region (us-east-2).
- Ensure your AWS user has write/pull access to this repo. Easiest to just setup a managed policy.
- Tag new latest redis based on official redis.
docker tag redis:latest REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest
- Generate local ECR token and push it real good.
$ eval $(aws ecr get-login --region us-east-2)
$ docker push REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest
The push refers to a repository [REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis]
a58d4434732b: Pushed
741b78d804b7: Pushed
78731fd42c78: Pushed
c235d5b4caa3: Pushed
307248831aca: Pushed
387483b2c715: Pushed
a2ae92ffcd29: Pushed
latest: digest: sha256:40f100b5d60bffceddd1a5635ce52fe0aa39c229feed8c2c6b641d85bc6baa14 size: 1783
Create Service
- Setup new Swarm with 1 Manager and 1 Worker, using Ubuntu 16.04 and Docker 1.13.1 (official steps to install). If you are creating these nodes in EC2, ensure they have an IAM Role you can use for testing.
- Your IAM Role or User should have read access to the ECR repo. I used the Managed Policy
AmazonEC2ContainerRegistryReadOnly
. - Install aws-cli ONLY on the manager (needed for credential helper). Run as ROOT.
sudo su
apt-get install -y python-pip && pip install awscli
mkdir -p /home/ubuntu/.aws && \
printf "[default]\noutput = json\nregion = us-east-2" > /home/ubuntu/.aws/config
- If your Manager node is NOT in AWS, ensure you have your read-only IAM User setup with
aws configure
. - Install Amazon ECR Credential Helper ONLY on the manager. Run as ROOT.
sudo su
apt-get install -y make
cd ~ && \
git clone https://github.com/awslabs/amazon-ecr-credential-helper.git && \
cd amazon-ecr-credential-helper && \
make docker && \
mv ./bin/local/docker-credential-ecr-login /usr/local/bin/
mkdir -p /home/ubuntu/.docker && printf '{\n "credsStore": "ecr-login"\n}' > /home/ubuntu/.docker/config.json
- Create visualizer service, this will ensure that the Manager already has a container running, hopefully pushing our Redis service to spawn on the Worker instead of the Manager later.
docker service create \
--name=viz \
--publish=8080:8080/tcp \
--constraint=node.role==manager \
--mount=type=bind,src=/var/run/docker.sock,dst=/var/run/docker.sock \
manomarks/visualizer
- Create redis service. You can see it spawn on the Worker (The ip- hostname is not the same as the Manager we’re on). You can also see that ECR credential helper was used via its log files.
ubuntu@ip-10-2-0-38:~/.ecr/log$ docker service create --with-registry-auth --name redis --replicas 1 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest
v6oozr2ki7tirslfgvqxzhyve
ubuntu@ip-10-2-0-38:~/.ecr/log$ docker service ps redis
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
v9fy6b9t02c6 redis.1 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest ip-10-2-0-98 Running Preparing 5 seconds ago
ubuntu@ip-10-2-0-38:~/.ecr/log$ ll
total 12
drwxrw-r-x 2 ubuntu ubuntu 4096 Feb 15 20:05 ./
drwxrw-r-x 3 ubuntu ubuntu 4096 Feb 15 20:05 ../
-rw-rw-r-- 1 ubuntu ubuntu 736 Feb 15 20:08 ecr-login.log.2017-02-15-20
ubuntu@ip-10-2-0-38:~/.ecr/log$
# Further verification
ubuntu@ip-10-2-0-38:~/.ecr/log$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
wpipkd3orljqyyrftntzy7rlg * ip-10-2-0-38 Ready Active Leader
zh7hmacxjor2uvfqvq0p0bdg3 ip-10-2-0-98 Ready Active
ubuntu@ip-10-2-0-38:~/.ecr/log$ docker node ps wpipkd3orljqyyrftntzy7rlg zh7hmacxjor2uvfqvq0p0bdg3
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
v9fy6b9t02c6 redis.1 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest ip-10-2-0-98 Running Running 2 minutes ago
rphsczqpqqqm viz.1 manomarks/visualizer:latest ip-10-2-0-38 Running Running 4 minutes ago
rphsczqpqqqm \_ viz.1 manomarks/visualizer:latest ip-10-2-0-38 Running Running 4 minutes ago
ubuntu@ip-10-2-0-38:~/.ecr/log$ cat ecr-login.log.2017-02-15-20
2017-02-15T20:05:19Z [DEBUG] Retrieving credentials for REDACTED in us-east-2 (REDACTED.dkr.ecr.us-east-2.amazonaws.com)
2017-02-15T20:05:19Z [DEBUG] GetCredentials for REDACTED
2017-02-15T20:05:19Z [DEBUG] Checking file cache for REDACTED
2017-02-15T20:05:19Z [DEBUG] Calling ECR.GetAuthorizationToken for REDACTED
2017-02-15T20:05:19Z [DEBUG] Saving credentials to file cache for REDACTED
2017-02-15T20:08:49Z [DEBUG] Retrieving credentials for REDACTED in us-east-2 (REDACTED.dkr.ecr.us-east-2.amazonaws.com)
2017-02-15T20:08:49Z [DEBUG] GetCredentials for REDACTED
2017-02-15T20:08:49Z [DEBUG] Checking file cache for REDACTED
2017-02-15T20:08:49Z [DEBUG] Using cached token for REDACTED
- At this point, the Manager has an ECR token in-hand that won’t expire for 12 hours. You can wait 12 hours to proceed to the next step, but I found another way to repro this issue. Detach the
AmazonEC2ContainerRegistryReadOnly
policy from your Role or User (alternatively, you can use “Revoke Sessions” in IAM to temporarily disable the user/role). I’ve seen the same behavior whether I waited 12 hours, or removed the policy. - For good measure, backup or remove the credential helper’s own cache.
mv ~/.ecr/cache.json ~/.ecr/cache.json.bak
# After removing read access to ECR, verify
ubuntu@ip-10-2-0-38:~/.ecr/log$ docker pull REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest
Error response from daemon: repository REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis not found: does not exist or no pull access
- Ensure the redis image does not exist on the Manager in the event you accidentally downloaded it in verify step above.
- If you removed the policy, your Swarm will now not have access to download the image on the Manager. This is the same behavior experienced when your token expires. Try to scale up the redis service to 3 or more, which should make the Swarm try to load a copy on the Manager. It will fail.
ubuntu@ip-10-2-0-38:~/.ecr/log$ docker service scale redis=3
redis scaled to 3
ubuntu@ip-10-2-0-38:~/.ecr/log$ docker service ps redis --no-trunc
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
v9fy6b9t02c6530jf5pkrcmp0 redis.1 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest@sha256:40f100b5d60bffceddd1a5635ce52fe0aa39c229feed8c2c6b641d85bc6baa14 ip-10-2-0-98 Running Running 2 hours ago
ibib73k99fl32og49hniv7k8m redis.2 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest@sha256:40f100b5d60bffceddd1a5635ce52fe0aa39c229feed8c2c6b641d85bc6baa14 ip-10-2-0-98 Running Running 41 seconds ago
okpnq0vw6801g851o1q2v13sp \_ redis.2 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest@sha256:40f100b5d60bffceddd1a5635ce52fe0aa39c229feed8c2c6b641d85bc6baa14 ip-10-2-0-38 Shutdown Rejected 50 seconds ago "No such image: REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis@sha256:40f100b5d60bffceddd1a5635ce52fe0aa39c229feed8c2c6b641d85bc6baa14"
kja20wvbs6mcowe61wza56e07 \_ redis.2 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest@sha256:40f100b5d60bffceddd1a5635ce52fe0aa39c229feed8c2c6b641d85bc6baa14 ip-10-2-0-38 Shutdown Rejected 55 seconds ago "No such image: REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis@sha256:40f100b5d60bffceddd1a5635ce52fe0aa39c229feed8c2c6b641d85bc6baa14"
75cbvvf56qnq4hb21d2tf6pcp \_ redis.2 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest@sha256:40f100b5d60bffceddd1a5635ce52fe0aa39c229feed8c2c6b641d85bc6baa14 ip-10-2-0-38 Shutdown Rejected about a minute ago "No such image: REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis@sha256:40f100b5d60bffceddd1a5635ce52fe0aa39c229feed8c2c6b641d85bc6baa14"
lemeqny01e6ffjdrbid8xx6m9 \_ redis.2 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest@sha256:40f100b5d60bffceddd1a5635ce52fe0aa39c229feed8c2c6b641d85bc6baa14 ip-10-2-0-38 Shutdown Rejected about a minute ago "No such image: REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis@sha256:40f100b5d60bffceddd1a5635ce52fe0aa39c229feed8c2c6b641d85bc6baa14"
kp607hy0hqrupnfx0ggxtbkug redis.3 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest@sha256:40f100b5d60bffceddd1a5635ce52fe0aa39c229feed8c2c6b641d85bc6baa14 ip-10-2-0-98 Running Running about a minute ago
# Verify that the credential helper was not utilized in regenerating authentication.
ubuntu@ip-10-2-0-38:~/.ecr/log$ ll
total 12
drwxrw-r-x 2 ubuntu ubuntu 4096 Feb 15 22:36 ./
drwxrw-r-x 3 ubuntu ubuntu 4096 Feb 15 22:38 ../
-rw-rw-r-- 1 ubuntu ubuntu 1057 Feb 15 20:17 ecr-login.log.2017-02-15-20
ubuntu@ip-10-2-0-38:~/.ecr/log$ date
Wed Feb 15 22:42:46 UTC 2017
ubuntu@ip-10-2-0-38:~/.ecr/log$ cat ecr-login.log.2017-02-15-20
2017-02-15T20:05:19Z [DEBUG] Retrieving credentials for REDACTED in us-east-2 (REDACTED.dkr.ecr.us-east-2.amazonaws.com)
2017-02-15T20:05:19Z [DEBUG] GetCredentials for REDACTED
2017-02-15T20:05:19Z [DEBUG] Checking file cache for REDACTED
2017-02-15T20:05:19Z [DEBUG] Calling ECR.GetAuthorizationToken for REDACTED
2017-02-15T20:05:19Z [DEBUG] Saving credentials to file cache for REDACTED
2017-02-15T20:08:49Z [DEBUG] Retrieving credentials for REDACTED in us-east-2 (REDACTED.dkr.ecr.us-east-2.amazonaws.com)
2017-02-15T20:08:49Z [DEBUG] GetCredentials for REDACTED
2017-02-15T20:08:49Z [DEBUG] Checking file cache for REDACTED
2017-02-15T20:08:49Z [DEBUG] Using cached token for REDACTED
2017-02-15T20:17:30Z [DEBUG] Retrieving credentials for REDACTED in us-east-2 (REDACTED.dkr.ecr.us-east-2.amazonaws.com)
2017-02-15T20:17:30Z [DEBUG] GetCredentials for REDACTED
2017-02-15T20:17:30Z [DEBUG] Checking file cache for REDACTED
2017-02-15T20:17:30Z [DEBUG] Using cached token for REDACTED
ubuntu@ip-10-2-0-38:~/.ecr/log$ ll ~/.ecr
total 20
drwxrw-r-x 3 ubuntu ubuntu 4096 Feb 15 22:38 ./
drwxr-xr-x 7 ubuntu ubuntu 4096 Feb 15 22:39 ../
-rw------- 1 ubuntu ubuntu 4884 Feb 15 22:24 cache.json.bak
drwxrw-r-x 2 ubuntu ubuntu 4096 Feb 15 22:36 log/
As a side note, it’s worth noting that the Swarm does eventually run all replicas on the Worker, after failing to launch them on Manager. This is not what I want but at least it doesn’t give up trying to scale.
ubuntu@ip-10-2-0-38:~/.ecr/log$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
wpipkd3orljqyyrftntzy7rlg * ip-10-2-0-38 Ready Active Leader
zh7hmacxjor2uvfqvq0p0bdg3 ip-10-2-0-98 Ready Active
ubuntu@ip-10-2-0-38:~/.ecr/log$ docker node ps zh7hmacxjor2uvfqvq0p0bdg3
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
v9fy6b9t02c6 redis.1 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest ip-10-2-0-98 Running Running 2 hours ago
ibib73k99fl3 redis.2 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest ip-10-2-0-98 Running Running 3 minutes ago
kp607hy0hqru redis.3 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest ip-10-2-0-98 Running Running 3 minutes ago
Describe the results you received:
Docker Engine’s Swarm did not attempt to use the credential helper if the credentials on the service definition were invalid. Instead it output the error “Image does not exist” (the error message could also be improved).
Describe the results you expected:
If Docker Engine Swarm’s authentication token stored on the service definition fails, it should use the installed credential helper again to generate a new authentication token and try again. It is assumed that all Managers will have the credential helper installed. If that new token fails (or no credential helper installed), THEN proceed with error messaging and distribute the replicas to workers who already have the image downloaded.
Additional information you deem important (e.g. issue happens only occasionally):
The same results happen with docker service update
as did with docker service scale
, which is to be expected, as scale is just an alias.
However, if I do docker service update --with-registry-auth --replicas X
along with scaling, it does seem to fetch fresh authentication tokens. Then I can scale and watch it spread across swarm nodes. This would be a valid work-around, but I don’t like that it seems to restart all currently running containers too. This could be disruptive.
ubuntu@ip-10-2-0-38:~/.ecr/log$ ls -l
total 8
-rw-rw-r-- 1 ubuntu ubuntu 1057 Feb 15 20:17 ecr-login.log.2017-02-15-20
-rw-rw-r-- 1 ubuntu ubuntu 415 Feb 16 21:19 ecr-login.log.2017-02-16-21
ubuntu@ip-10-2-0-38:~/.ecr/log$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
manomarks/visualizer <none> 137b9c6f7977 2 weeks ago 325 MB
ubuntu@ip-10-2-0-38:~/.ecr/log$ docker service update --with-registry-auth --replicas 5 redis
redis
ubuntu@ip-10-2-0-38:~/.ecr/log$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis <none> 74d8f543ac97 2 weeks ago 184 MB
manomarks/visualizer <none> 137b9c6f7977 2 weeks ago 325 MB
ubuntu@ip-10-2-0-38:~/.ecr/log$ ll
total 20
drwxrw-r-x 2 ubuntu ubuntu 4096 Feb 17 17:59 ./
drwxrw-r-x 3 ubuntu ubuntu 4096 Feb 17 17:59 ../
-rw-rw-r-- 1 ubuntu ubuntu 1057 Feb 15 20:17 ecr-login.log.2017-02-15-20
-rw-rw-r-- 1 ubuntu ubuntu 415 Feb 16 21:19 ecr-login.log.2017-02-16-21
-rw-rw-r-- 1 ubuntu ubuntu 415 Feb 17 17:59 ecr-login.log.2017-02-17-17
# You can see credential helper was hit above
# And the containers are spread across the nodes below
ubuntu@ip-10-2-0-38:~/.ecr/log$ docker service ps redis
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
j1uct8p1wwpw redis.1 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest ip-10-2-0-98 Running Running 2 minutes ago
t4eyy2ydm2xm \_ redis.1 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest ip-10-2-0-98 Shutdown Shutdown 2 minutes ago
51w4qbbugpmm redis.2 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest ip-10-2-0-98 Running Running 2 minutes ago
lublj14k8780 redis.3 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest ip-10-2-0-38 Running Running 2 minutes ago
e8ad3wzgbahb redis.4 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest ip-10-2-0-98 Running Running 2 minutes ago
1ru8wm46qf2r redis.5 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis:latest ip-10-2-0-38 Running Running 2 minutes ago
# On the worker, you can see it went from 1 container to 3, but restarted the original container which is not desired.
ubuntu@ip-10-2-0-98:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9af4f6ed97ad REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis@sha256:40f100b5d60bffceddd1a5635ce52fe0aa39c229feed8c2c6b641d85bc6baa14 "docker-entrypoint..." 20 hours ago Up 20 hours 6379/tcp redis.1.t4eyy2ydm2xmfknh8ftmboh8m
ubuntu@ip-10-2-0-98:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9681d803ba89 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis@sha256:40f100b5d60bffceddd1a5635ce52fe0aa39c229feed8c2c6b641d85bc6baa14 "docker-entrypoint..." 3 seconds ago Up 2 seconds 6379/tcp redis.1.j1uct8p1wwpwnsjr462wt3ysw
7ee493bfb6f8 REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis@sha256:40f100b5d60bffceddd1a5635ce52fe0aa39c229feed8c2c6b641d85bc6baa14 "docker-entrypoint..." 3 seconds ago Up 3 seconds 6379/tcp redis.2.51w4qbbugpmmgsvgqpce03aqw
605ae7b500fb REDACTED.dkr.ecr.us-east-2.amazonaws.com/redis@sha256:40f100b5d60bffceddd1a5635ce52fe0aa39c229feed8c2c6b641d85bc6baa14 "docker-entrypoint..." 3 seconds ago Up 3 seconds 6379/tcp redis.4.e8ad3wzgbahb4kd0hs2mje3zd
Output of docker version
:
ubuntu@ip-10-2-0-38:~$ docker version
Client:
Version: 1.13.1
API version: 1.26
Go version: go1.7.5
Git commit: 092cba3
Built: Wed Feb 8 06:50:14 2017
OS/Arch: linux/amd64
Server:
Version: 1.13.1
API version: 1.26 (minimum version 1.12)
Go version: go1.7.5
Git commit: 092cba3
Built: Wed Feb 8 06:50:14 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
ubuntu@ip-10-2-0-38:~$ docker info
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 1
Server Version: 1.13.1
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 15
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: active
NodeID: wpipkd3orljqyyrftntzy7rlg
Is Manager: true
ClusterID: vws4u9zsjaug5c1xvfniflkgi
Managers: 1
Nodes: 2
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Node Address: 10.2.0.38
Manager Addresses:
10.2.0.38:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1
runc version: 9df8b306d01f59d3a8029be411de015b7304dd8f
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-57-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 990.6 MiB
Name: ip-10-2-0-38
ID: 6P54:KGSB:NGZA:RCFO:BOE7:3TYQ:NFEB:CDON:YMTT:ZECH:IAZW:TTTK
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.):
AWS, as described above, using IAM Role with Policy AmazonEC2ContainerRegistryReadOnly
to gain pull access to ECR repo.
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 9
- Comments: 33 (9 by maintainers)
I think the problem is bigger than it is described here because the scaling scenario you discuss involve the user. So, the user is there and he can do something. However, there is another tricky use case which cannot be resolved with
--with-registry-auth
.Consider a case where you have three nodes in a swarm and you create a service with a scale factor of two. At that point of time, the service is distributed between two out of three nodes and everything goes without problems. A week later the container in one of the nodes starts failing and Docker decides to move it to the third node, which did not have this container before. Now, the old token already expired and there is no a user to provide a new one. Eventually, Docker gets crazy trying to bring the container on different nodes and restore the scale factor.
I don’t really understand how to deal with this ECR concept of expiring tokens. It makes the swarm feature unusable with the ECR.
@hamiltont – I gave up trying to figure this out and went with deploying the aws ecr proxy in the swarm (https://hub.docker.com/r/esailors/aws-ecr-http-proxy/). That allows me to pull from localhost and never have to worry about the creds timing out, etc.
Also note that aws cli v2.0 now forces a new way of logging into ECR:
aws --region us-east-1 ecr get-login-password | docker login --username AWS --password-stdin <me>.dkr.ecr.us-east-1.amazonaws.com