amazon-ecs-agent: 1.29.0 CannotPullContainerError pull access denied requiring multiple host destruction

Summary

After pushing a new ECR image and attempting to deploy, instance started failing to pull image from ECR, even with no permissions changes.

Description

Status reason | CannotPullContainerError: Error response from daemon: pull access denied for FOO.dkr.ecr.us-east-1.amazonaws.com/foo/app, repository does not exist or may require ‘docker login’

3 different ECS services, across 2 different hosts, all started showing this error upon attempting to launch new tasks during a deployment.

I confirmed I could pull the image locally on my own workstation when performing a docker login to ECR.

No permissions changes were made to the instance.

The ecs tasks all have task execution roles that have the proper ECS Role Policy attached.

The ecs hosts themselves have the proper ECS permissions in their instance roles.

After re-creating the hosts, they pulled fine.

I’m trying to figure out what state the instance got into and how to resolve it in the future. It should not be failing to pull ECR images.

I’m opening this to start it as a point for tracking these issues in the future, as extensive searching did not surface anyone else having this same issue.

Environment Details

ubuntu bionic 18.04 ecs agent 1.29.0

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 20
  • Comments: 27 (11 by maintainers)

Most upvoted comments

Hi, Sorry for facing the issue. The speculation is if CFN/Terraform stack is used for resources deletion and recreation, the task execution role with the same role ARN is deleted and recreated. Since currently when agent caches the credentials of the role, it uses the combination of region, roleARN, registryID, endpointOverride as the key. Which means, as such case, agent will use the credentials from the cache rather than the credentials of the newly created role. This is a known issue and we will work on it in the future.

For now as a workaround, if such an error occurs, we suggest you restart agent on the instance manually after deleting the stack, which will clear the cache, then recreate the CFN/Terraform stack. Another workaround will be not specify the execution role name when create the role.

Thanks

Upon further investigation we have come to know that when an IAM role is deleted and recreated with the same name, the ec2 instance associated with the role will no longer be able to use the permissions granted though that role, this is the expected behavior

@shubham2892 how is this accurate when we can simply restart the ecs-agent to fix the issue?

Since currently when agent caches the credentials of the role, it uses the combination of region, roleARN, registryID, endpointOverride as the key. Which means, as such case, agent will use the credentials from the cache rather than the credentials of the newly created role. This is a known issue and we will work on it in the future.

the above explanation was given to us quite a while ago, is it no longer accurate?

everything points to this being an issue with the ecs-agent caching IAM permissions, not with EC2/IAM - i’m happy to be corrected if i’m wrong here.

@kyleian For us the reason why it managed to start the task was because the image already existed on the instance. So it looks as if it still uses the existing image but tries to pull the latest one anyway. And hence the error. At least that’s our assumption.

@fierlion Sent you some logs to that specified email address.

Conditions/workflow for reproducing:

  • Brand new EC2 in ECS Cluster
  • Deploy ECS Service/TaskDef via Cloudformation; Task/Execution roles (that contain ECR permissions) for service/task def contained within that same template; successful deploy/pull of image from ECR, no issues
  • Nuke the ECS Service Stack, which deletes task/exec roles
  • Redeploy ECS Service via cloudformation - yields error in status reason CannotPullContainerError

Something I noticed in the ECS UI in my current working example, is I do see that the task actually does pull the image occasionally, eg state == “RUNNING”, however the error is still manifested in the Status Reason:

CannotPullContainerError: Error response from daemon: pull access denied for $REDACTED_REGISTRY_ID, repository does not exist or may require 'docker login': denied: The security token included in the request is in

It is unclear to me the difference for why occasionally it’ll manage to pull the image with RUNNING state but still display the CannotPullContainerError, but the majority of the time I see this we’re not able to pull the image when that error is displayed (perhaps within the timing of the token from the original role still being actually valid? The majority of the time I have seen this, it is likely the token is actually expired for the old role that was deleted, am attempting to replicate that case currently, will send separately as a followup.)

Our workaround for now will be to separate out the IAM roles into a separate stack to prevent them from being destroyed on an ecs-service stack delete/rollback.

I am just going to add to the workaround said by @cyastella: If you have a named Exec Role and prefer to keep it that way, then even a slight change in naming would also work, for example you can append v2 to the role name.