garden: In-cluster builds on EKS: credentials not found in native keychain
Bug
Current Behavior
In-cluster builds (both Kaniko and cluster-buildkit) are failing on EKS using ECR when trying to build an image running garden build with the following error:
[2022-11-17T15:24:37.385Z] Error: Unable to query registry for image status: time="2022-11-17T15:24:37Z" level=fatal msg="Error parsing image name \"docker://***********.dkr.ecr.us-east-1.amazonaws.com/****/repo/test:v-36b8e5e856\": getting username and password: 1 error occurred:\n\t* credentials not found in native keychain\n\n"
at skopeoBuildStatus (/snapshot/project/tmp/pkg/cli/node_modules/@garden-io/core/src/plugins/kubernetes/container/build/common.ts:263:13)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (internal/process/task_queues.js:95:5)
at /snapshot/project/tmp/pkg/cli/node_modules/@garden-io/core/src/actions.ts:1303:24
at ActionRouter.getBuildStatus (/snapshot/project/tmp/pkg/cli/node_modules/@garden-io/core/src/actions.ts:359:20)
at wrapped.process (/snapshot/project/tmp/pkg/cli/node_modules/@garden-io/core/src/tasks/build.ts:132:22)
at TaskNode.process (/snapshot/project/tmp/pkg/cli/node_modules/@garden-io/core/src/task-graph.ts:801:20)
at wrapped.processNode (/snapshot/project/tmp/pkg/cli/node_modules/@garden-io/core/src/task-graph.ts:436:18)
Error Details:
command:
- skopeo
- '--command-timeout=30s'
- inspect
- '--raw'
- '--authfile'
- /.docker/config.json
- >-
docker://***********.dkr.ecr.us-east-1.amazonaws.com/****/repo/test:v-36b8e5e856
output: >
time="2022-11-17T15:24:37Z" level=fatal msg="Error parsing image name
\"docker://***********.dkr.ecr.us-east-1.amazonaws.com/****/repo/test:v-36b8e5e856\":
getting username and password: 1 error occurred:\n\t* credentials not found in
native keychain\n\n"
[2022-11-17T15:24:37.400Z] Error: 1 build action(s) failed!
at handleProcessResults (/snapshot/project/tmp/pkg/cli/node_modules/@garden-io/core/src/commands/base.ts:532:19)
at BuildCommand.action (/snapshot/project/tmp/pkg/cli/node_modules/@garden-io/core/src/commands/build.ts:148:32)
at GardenCli.runCommand (/snapshot/project/tmp/pkg/cli/node_modules/@garden-io/core/src/cli/cli.ts:508:20)
at GardenCli.run (/snapshot/project/tmp/pkg/cli/node_modules/@garden-io/core/src/cli/cli.ts:667:26)
at Object.runCli (/snapshot/project/tmp/pkg/cli/src/cli.ts:41:14)
Error Details:
results:
build.test-image:
type: build
description: building test-image
key: build.test-image
name: test-image
error:
detail:
command:
- skopeo
- '--command-timeout=30s'
- inspect
- '--raw'
- '--authfile'
- /.docker/config.json
- >-
docker://***********.dkr.ecr.us-east-1.amazonaws.com/****/repo/test:v-36b8e5e856
output: >
time="2022-11-17T15:24:37Z" level=fatal msg="Error parsing image name
\"docker://***********.dkr.ecr.us-east-1.amazonaws.com/****/repo/test:v-36b8e5e856\":
getting username and password: 1 error occurred:\n\t* credentials not
found in native keychain\n\n"
type: runtime
startedAt: '2022-11-17T15:24:35.473Z'
completedAt: '2022-11-17T15:24:37.326Z'
batchId: b84dedc4-5292-47e7-ab98-a26b0a8fc485
version: v-36b8e5e856
My pod has IAM permissions to access ECR as it is being run on an instance that has the correct IAM role, and imagePullSecrets are set up according to instructions.
After some investigation, I found out that the error is being thrown by skopeo on the garden-utils pod, which is currently running a version of amazon-ecr-credential-helper (v0.4.0) that, for some reason, can’t load the instance credentials. I logged in as root into the garden-utils pod and updated amazon-ecr-credential-helper to 0.6.0 and skopeo now works as expected.
Expected behavior
The in-cluster build would work without errors.
Reproducible example
You can use the kaniko example setting it up on EKS using an ECR registry.
Workaround
There are no known workarounds.
Suggested solution(s)
Update garden-util’s (gardendev/k8s-util) image to use the latest version of amazon-ecr-credential-helper (0.6.0) so that it can load the credentials from the instance’s attached IAM role.
Additional context
Your environment
- OS: macOS Ventura 13.0
- How I’m running Kubernetes: EKS running Kubernetes 1.23
garden version
0.12.46
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 21 (20 by maintainers)
Commits related to this issue
- fix(k8s): update ecr-cred-helper for imdsv2 support amazon-ecr-credential-helper version 0.4.0 has no imdsv2 support (instance metadata service v2). AWS recmmends disabling imdsv1 for security reason... — committed to garden-io/garden by stefreak 2 years ago
- fix(k8s): update ecr-cred-helper for imdsv2 support (#3380) amazon-ecr-credential-helper version 0.4.0 has no imdsv2 support (instance metadata service v2). AWS recmmends disabling imdsv1 for secur... — committed to garden-io/garden by stefreak 2 years ago
@Walther @stefreak all working now! Just had the first build using Kaniko with no issues!
Thank you for your efforts in making this work. Happy to collaborate with this great tool in any other way you guys need.
@theoribeiro awesome 😃 Thanks for the offer, any collaboration+feedback is always appreciated! Feel free to reach out on our discord server and/or via github issues if you have any ideas https://discord.gg/gxeuDgp6Xt
I’m working on a more secure IAM setup using IRSA for in-cluster-building, if you want I can share the docs with you once it’s merged, would be very happy if you can try it out and give feedback. Hope I finish it within the next 7 days.
We’re using a simple port forward over ssh with
-fN -L 5555:$CLUSTER:443and setting up/etc/hoststo point $CLUSTER to 127.0.0.1.That setup works for us for all kubectl commands. We tried using a SOCKS proxy for that but could never get it working properly with kubectl
@stefreak You nailed it! We’re using EKS Managed Node Groups for our nodes and I just checked and they’re indeed using IMDSv2. Since we’re using Terraform to deploy our cluster, I believe the modules are now configured to disable IMDSv1 by default as per an AWS recommendation to conform to security best practices.
@theoribeiro thank you for taking the time to investigate and report this bug!
As I am currently working on enabling IRSA for in-cluster builders (#2931) I am already looking into the relevant code paths. So I’ll look into this issue.