karpenter-provider-aws: Karpenter pods never ready with hostNetwork: true

Description

Observed Behavior:

Updating from v0.25.0 to v0.28.0 is not successful

Setting hostNetwork: true makes the karpenter pods never become ready

Expected Behavior: Should be good

When running Calico, is hostNetwork: true all thats needed?

Reproduction Steps (Please include YAML):

      karpenter:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: karpenter.sh/provisioner-name
                      operator: Exists
            preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 50
              preference:
                matchExpressions:
                - key: kubernetes.io/arch
                  operator: In
                  values:
                  - arm64
        controller:
          hostNetwork: true
        nodeSelector:
          kubernetes.io/arch: arm64
        tolerations:
        - key: kubernetes.io/arch
          operator: Equal
          value: arm64
          effect: NoSchedule        
        settings:
          aws:
            clusterName: mine
            clusterEndpoint: mine
            defaultInstanceProfile: mine
            interruptionQueueName: mine
          featureGates:
            driftEnabled: true
        serviceAccount:
          annotations:
            eks.amazonaws.com/role-arn: arn:aws:iam::mine:role/karpenter-controller

Versions:

Chart Version: 0.28.0
Kubernetes Version (kubectl version): Major:“1”, Minor:“24+”, GitVersion:“v1.24.14-eks-c12679a”, GitCommit:“05d192f0de17608d98e17761ad3cffa9a6407f2f”

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

About this issue

Original URL
State: closed
Created a year ago
Comments: 28 (14 by maintainers)

Most upvoted comments

Glad this got solved, thank you @engedaam , @Nashluffy !!

yangwwei on Jul 14, 2023

That’s great to hear! The team will be doing a karpenter release next week, and it will include the fix.

engedaam on Jul 14, 2023

@engedaam yes confirmed the original snapshot fixes the issue as well, thanks!

Nashluffy on Jul 14, 2023

@Nashluffy can you try out the original snapshot fix? v0-8c760941a8a2099eec87e567a43c86d6c646af67 I just want validate that fix works for you. That version does not include the additional logging.

engedaam on Jul 14, 2023

Sorry about that. Attached are updated logs where 8080 is closed and the webhook is up and running, pods are stable.

karpenter-logs.txt

BTW here was the helm values.yaml snippet I ended up using

controller:
...
  image:
    repository: redacted/karpenter/controller
    tag: aef4bb9ae73cef3b9b668230d0f2e70093303c3e
    digest: sha256:dfa64043160e7f948f17ea002bf32ceed0d2f1b8d932b769af2c166e4b1a0361

Nashluffy on Jul 14, 2023

In the deployment, I see that your sha256 of the deployment image is the same as v0.29.0? I see In the manifest helm.sh/chart: karpenter-v0.29.0. Make sure your pod image would be as such:

controller:
  image: redacted/karpenter/controller:aef4bb9ae73cef3b9b668230d0f2e70093303c3e@sha256:dfa64043160e7f948f17ea002bf32ceed0d2f1b8d932b769af2c166e4b1a0361

engedaam on Jul 14, 2023

@Nashluffy I have been trying to replicate your issue, and I have not had any luck. I have built a new karpenter version that will log the health check attempts for both the Webhook port and port 8080. This should give more context into the issue. Could you try this karpenter version: v0-aef4bb9ae73cef3b9b668230d0f2e70093303c3e? Can you also provide your logs after running this version?

engedaam on Jul 14, 2023

After trying again with the latest chart (v0.29.0) I’m seeing the same error as @rarecrumb, which yes, we do have something else on 8080 (also on host network). But we’ve specified a different port for the webhook to run on, as you can see in the manifests above.

2023-07-11T19:29:10.648Z        ERROR   controller      error received after stop sequence was engaged  {"commit": "61cc8f7-dirty", "error": "leader election lost"}
2023-07-11T19:29:14.648Z        ERROR   webhook Error while running server      {"commit": "61cc8f7-dirty", "error": "listen tcp :8080: bind: address already in use"}

Nashluffy on Jul 11, 2023