karpenter-provider-aws: Karpenter 0.32.0 fails in a private vpc due to IAM API requirement
Description
Observed Behavior: Karpenter 0.32.0 attempts to discover Instance Profile using iam.amazonaws.com API which it fails to do on a private VPC as there’s no VPC Endpoint available for it.
{"level":"ERROR","time":"2023-10-31T16:25:09.325Z","logger":"controller","message":"Reconciler error","commit":"3a61217","controller":"nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"default"},"namespace":"","name":"default","reconcileID":"1b18814b-c28d-45fb-bb32-bd804070a03b","error":"resolving instance profile, getting instance profile \"REDACTED\", RequestError: send request failed\ncaused by: Post \"https://iam.amazonaws.com/\": dial tcp 52.46.159.95:443: i/o timeout"}
Expected Behavior: Accept arn of the instance profile as an alternative to discovery via IAM API.
Reproduction Steps (Please include YAML):
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
role: REDACTED
Versions:
- Chart Version: 0.32.0
- Kubernetes Version (
kubectl version
): v1.28.2
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Reactions: 11
- Comments: 30 (15 by maintainers)
Got it. That’s annoying that IAM doesn’t surface a private endpoint. Given that requirement, we may have to consider adding the
spec.instanceProfile
back into the EC2NodeClass spec to ensure users with private clusters can use the Karpenter EC2NodeClass.We use terraform to drive our changes (as well as install the karpenter helm chart) so the creation of instance profile is just a few lines of code, we’re having to create the IRSA anyway so this is just an extra resource. We have quite a strict policy on what type of permissions can be granted to users / services and unfortunately cloudformation is not one of them.
That’s not to say this would not be useful for someone else though! But for us I think the ability to supply the instance profile ARN would be the preferable option.
@jonathan-innis we met with our IAM team yesterday and explained how the instance profile actions are scoped to what is allowed by
passRole
. They seemed to be on board. Thank you for the explanations. And thank you for the PR addinginstanceProfile
back to the spec. That will allow us to move fwd while we work through the process of allowing those actions.The argument given was that the only difference between service and self-managed is how much one trusts the component and one could reason using knowledge of that component’s algorithms. The counter-argument to that is that one cannot simply rely on knowledge of that component’s algorithms as there are practical paths by which actors can use the credentials outside of the algorithms. With a service, such an attack is much less practical: one would have to attack the service provider and if an actor were able to do so there would likely be higher-value targets.
Can you explain this a little more? If you have scoped the permissions of the role appropriately for the controller, the actor should only be able to act using the actions that are assigned to the role. In this case, create instance profiles (which should be benign, similar to creating roles is benign unless you can assign policies to the role) and they should only be able to assign the roles constrained by
PassRole
if they want to add permissions to the instance profile.IAM management is locked down at my company as well. And workloads are never allowed to create or mutate IAM resources. SLRs are different as we have an approval process to get a service approved. And once approved, the SLR allows the service to manage resources in our account. So as long as we are running Karpenter ourselves … we will not be able to allow it to create or mutate IAM resources.
Wanted to add a note, that managed node groups is also creating IAM instance profiles through its service-linked role (SLR), similar to what Karpenter is doing here. You can see the permissions that the managed node groups SLR has here - specifically the
PermissionsToCreateAndManageInstanceProfiles
statement in the policy.