karpenter-provider-aws: Karpenter pods never ready with hostNetwork: true
Description
Observed Behavior:
Updating from v0.25.0 to v0.28.0 is not successful
Setting hostNetwork: true
makes the karpenter pods never become ready
Expected Behavior: Should be good
When running Calico, is hostNetwork: true
all thats needed?
Reproduction Steps (Please include YAML):
karpenter:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: karpenter.sh/provisioner-name
operator: Exists
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: kubernetes.io/arch
operator: In
values:
- arm64
controller:
hostNetwork: true
nodeSelector:
kubernetes.io/arch: arm64
tolerations:
- key: kubernetes.io/arch
operator: Equal
value: arm64
effect: NoSchedule
settings:
aws:
clusterName: mine
clusterEndpoint: mine
defaultInstanceProfile: mine
interruptionQueueName: mine
featureGates:
driftEnabled: true
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::mine:role/karpenter-controller
Versions:
- Chart Version: 0.28.0
- Kubernetes Version (
kubectl version
): Major:“1”, Minor:“24+”, GitVersion:“v1.24.14-eks-c12679a”, GitCommit:“05d192f0de17608d98e17761ad3cffa9a6407f2f”
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 28 (14 by maintainers)
Glad this got solved, thank you @engedaam , @Nashluffy !!
That’s great to hear! The team will be doing a karpenter release next week, and it will include the fix.
@engedaam yes confirmed the original snapshot fixes the issue as well, thanks!
@Nashluffy can you try out the original snapshot fix?
v0-8c760941a8a2099eec87e567a43c86d6c646af67
I just want validate that fix works for you. That version does not include the additional logging.Sorry about that. Attached are updated logs where 8080 is closed and the webhook is up and running, pods are stable.
karpenter-logs.txt
BTW here was the helm values.yaml snippet I ended up using
In the deployment, I see that your sha256 of the deployment image is the same as v0.29.0? I see In the manifest
helm.sh/chart: karpenter-v0.29.0
. Make sure your pod image would be as such:@Nashluffy I have been trying to replicate your issue, and I have not had any luck. I have built a new karpenter version that will log the health check attempts for both the Webhook port and port 8080. This should give more context into the issue. Could you try this karpenter version:
v0-aef4bb9ae73cef3b9b668230d0f2e70093303c3e
? Can you also provide your logs after running this version?After trying again with the latest chart (v0.29.0) I’m seeing the same error as @rarecrumb, which yes, we do have something else on 8080 (also on host network). But we’ve specified a different port for the webhook to run on, as you can see in the manifests above.