aws-ebs-csi-driver: Cannot Run with IAM Service Account and no metadata service

/kind bug

What happened? The ebs-plugin container on the ebs-csi-controller crashes repeatedly while talking to the metadata service:

I0325 19:25:18.560010       1 driver.go:62] Driver: ebs.csi.aws.com Version: v0.6.0-dirty
panic: EC2 instance metadata is not available

goroutine 1 [running]:
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.newNodeService(0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/node.go:83 +0x196
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.NewDriver(0xc00016ff70, 0x3, 0x3, 0xc0000a88a0, 0xdcc3a0, 0xc0001edb00)
	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/driver.go:87 +0x512
main.main()
	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/cmd/main.go:31 +0x117

What you expected to happen?

When specifying the AWS_REGION variable, and an IAM role service account, the ebs-csi-driver should not need to access the metadata service, and run on its own.

How to reproduce it (as minimally and precisely as possible)?

  1. Create an IAM role with permissions for the aws-ebs-csi-driver
  2. Create an EKS cluster with an OIDC connect provider trust relationship with IAM (https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html), especially following the final step of disabling pods from accessing the metadata service to prevent them from assuming the worker instance profile.
  3. Deploy aws-ebs-csi-driver using the alpha kustomize overlays, adding the eks.amazonaws.com/role-arn to the service account
  4. The ebs-csi-controller pods will start and crash after about 20s

Anything else we need to know?: As far as we can tell everything is set up correctly with the role + service account, but the code explicitly tries to instantiate the metadata service, which is firewalled off. Can this be made optional if region is set, and credentials are available via the service account?

Environment EKS v1.14 ebs-csi-driver v0.5.0, v0.6.0-dirty

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 11
  • Comments: 45 (11 by maintainers)

Most upvoted comments

Hi Everyone. I’ve been looking into this issue a bit closer, and can confirm that this is not a misconfiguration, and also not related to the AWS_REGION environment variable being defined or not.

If you follow the stack trace [1], you end up realising that the driver relies heavily on the metadata service to retrieve the current instance id, availability zone for topology-aware dynamic provisioning, and information about the instance family used to derive the maximum number of EBS volumes that could be attached.

The way I see it, keeping in mind I’m not a member of this project, this does not look like a bug that should be fixed, but rather as a requirement of the driver that should be explicitly documented.

For the time being, I’m working around this issue by using a slightly more specific iptables rule leveraging the string extension [2] to filter only packets containing “iam/security-credentials” [3] within their first 100 bytes:

iptables --insert FORWARD 1 --in-interface eni+ --destination 169.254.169.254/32 -m string --algo bm --to 100 --string 'iam/security-credentials' --jump DROP

I’m would not bet on this to ensure that someone who REALLY wants to access this URL is able to do so, but it should help in most cases. Eager to hear if anyone can think of a better solution.

[1] https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/pkg/driver/node.go [2] http://ipset.netfilter.org/iptables-extensions.man.html#lbCE [3] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#instance-metadata-security-credentials

Requiring metadata kinda sucks considering the recommended EKS configuration with IRSA it to disable metadata api access…

Just an addition for those who landed here seeking for a solution…

@wongma7 said in his #474 (comment) that if we were on a recent version of the SDK we would be safe.

However, if you are setting IMDSv2 as required you may be facing the 401 issue reported by @gwvandesteeg on his #474 (comment) because of the hop limit as the HTTP request is not being sent directly from the EC2. The solution is quite easy, just increase the hop limit and it will work 😉 (3 is working for me, but I haven’t tested 2)

From a least privilege security stand point, using the hop count with IMDSv2 would not be recommended. This approach means you’ve provided all workloads on those worker nodes the same IAM permissions as the worker node itself instead of only granting the workload the permissions it needs using IRSA, as well as limiting the worker node to only the permissions it itself needs.

Same problem here. Have done the CSI and CNI installation by the book following the official guides as linked from the web console, and now observing the same panic reported in this issue. Modifying the base files to include the AWS_REGION environment variable doesn’t seem to help either.

@groodt This was working for us in EKS 1.23 but is now failing in EKS 1.24

Failure in EKS 1.24:

The ebs-plugin containers in the ebs-csi-node pods fail to start because they cannot retrieve instance data.

ebs-csi-node/ebs-plugin reports that ec2 metadata is not available, and subsequently fails to retrieve it from kubernetes api, which times out

retrieving instance data from ec2 metadata
ec2 metadata is not available
retrieving instance data from kubernetes api
kubernetes api is available
...timeout...

Success in EKS 1.23:

ebs-csi-node/ebs-plugin successfully retrieves instance data from ec2 metadata

retrieving instance data from ec2 metadata
ec2 metadata is available

One difference introduced in 1.24 is that automountServiceAccountToken defaults to false, which might explain why the kubernetes api fallback is failing?

But why is the ec2 metadata not available?

I find it interesting that ebs-csi-node does not typically have an IAM role attached, and yet it is retrieving ec2 metadata.

Slack: https://kubernetes.slack.com/archives/C0LRMHZ1T/p1674883684976729


Failure happens here:

https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/pkg/driver/node.go#L82 https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/pkg/cloud/metadata.go#L84 https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/pkg/cloud/metadata_ec2.go#L23 https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/pkg/cloud/metadata_k8s.go#L38

Explanation of how ebs-csi-node obtains instance data: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/821#issuecomment-863618419

I personally feel this issue can be closed. At $dayJob, we are running v1.3.0 and have fully removed hostNetwork from this workload and everything is working with IRSA and the default hop limit of 1.

Just an addition for those who landed here seeking for a solution…

@wongma7 said in his https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/474#issuecomment-803065754 that if we were on a recent version of the SDK we would be safe.

However, if you are setting IMDSv2 as required you may be facing the 401 issue reported by @gwvandesteeg on his https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/474#issuecomment-1225019206 because of the hop limit as the HTTP request is not being sent directly from the EC2. The solution is quite easy, just increase the hop limit and it will work 😉 (3 is working for me, but I haven’t tested 2)

@prashantokochavara Martin is referring to the Worker Nodes, where Metadata Endpoint Access is restricted (https://docs.aws.amazon.com/de_de/eks/latest/userguide/restrict-ec2-credential-access.html)