karpenter-provider-aws: Karpenter 0.32.0 fails in a private vpc due to IAM API requirement

Description

Observed Behavior: Karpenter 0.32.0 attempts to discover Instance Profile using iam.amazonaws.com API which it fails to do on a private VPC as there’s no VPC Endpoint available for it.

{"level":"ERROR","time":"2023-10-31T16:25:09.325Z","logger":"controller","message":"Reconciler error","commit":"3a61217","controller":"nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"default"},"namespace":"","name":"default","reconcileID":"1b18814b-c28d-45fb-bb32-bd804070a03b","error":"resolving instance profile, getting instance profile \"REDACTED\", RequestError: send request failed\ncaused by: Post \"https://iam.amazonaws.com/\": dial tcp 52.46.159.95:443: i/o timeout"}

Expected Behavior: Accept arn of the instance profile as an alternative to discovery via IAM API.

Reproduction Steps (Please include YAML):

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  role: REDACTED

Versions:

  • Chart Version: 0.32.0
  • Kubernetes Version (kubectl version): v1.28.2
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Reactions: 11
  • Comments: 30 (15 by maintainers)

Most upvoted comments

There’s no IAM (different to STS) private endpoint to my knowledge hence the issue

Got it. That’s annoying that IAM doesn’t surface a private endpoint. Given that requirement, we may have to consider adding the spec.instanceProfile back into the EC2NodeClass spec to ensure users with private clusters can use the Karpenter EC2NodeClass.

We use terraform to drive our changes (as well as install the karpenter helm chart) so the creation of instance profile is just a few lines of code, we’re having to create the IRSA anyway so this is just an extra resource. We have quite a strict policy on what type of permissions can be granted to users / services and unfortunately cloudformation is not one of them.

That’s not to say this would not be useful for someone else though! But for us I think the ability to supply the instance profile ARN would be the preferable option.

@jonathan-innis we met with our IAM team yesterday and explained how the instance profile actions are scoped to what is allowed by passRole. They seemed to be on board. Thank you for the explanations. And thank you for the PR adding instanceProfile back to the spec. That will allow us to move fwd while we work through the process of allowing those actions.

The argument given was that the only difference between service and self-managed is how much one trusts the component and one could reason using knowledge of that component’s algorithms. The counter-argument to that is that one cannot simply rely on knowledge of that component’s algorithms as there are practical paths by which actors can use the credentials outside of the algorithms. With a service, such an attack is much less practical: one would have to attack the service provider and if an actor were able to do so there would likely be higher-value targets.

actor gaining operator-level credentials to the account could replace the Karpenter code or obtain Karpenter credentials and perform actions outside of what Karpenter is coded to do

Can you explain this a little more? If you have scoped the permissions of the role appropriately for the controller, the actor should only be able to act using the actions that are assigned to the role. In this case, create instance profiles (which should be benign, similar to creating roles is benign unless you can assign policies to the role) and they should only be able to assign the roles constrained by PassRole if they want to add permissions to the instance profile.

IAM management is locked down at my company as well. And workloads are never allowed to create or mutate IAM resources. SLRs are different as we have an approval process to get a service approved. And once approved, the SLR allows the service to manage resources in our account. So as long as we are running Karpenter ourselves … we will not be able to allow it to create or mutate IAM resources.

It would be ideal to create one role for all nodes within an account that has locked down permissions for that account, and then to use that directly, similarly to how managed node groups are currently utilizing iam roles.

Wanted to add a note, that managed node groups is also creating IAM instance profiles through its service-linked role (SLR), similar to what Karpenter is doing here. You can see the permissions that the managed node groups SLR has here - specifically the PermissionsToCreateAndManageInstanceProfiles statement in the policy.