karpenter-provider-aws: Karpenter pods never ready with hostNetwork: true

Description

Observed Behavior:

Updating from v0.25.0 to v0.28.0 is not successful

Setting hostNetwork: true makes the karpenter pods never become ready

Expected Behavior: Should be good

When running Calico, is hostNetwork: true all thats needed?

Reproduction Steps (Please include YAML):

      karpenter:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: karpenter.sh/provisioner-name
                      operator: Exists
            preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 50
              preference:
                matchExpressions:
                - key: kubernetes.io/arch
                  operator: In
                  values:
                  - arm64
        controller:
          hostNetwork: true
        nodeSelector:
          kubernetes.io/arch: arm64
        tolerations:
        - key: kubernetes.io/arch
          operator: Equal
          value: arm64
          effect: NoSchedule        
        settings:
          aws:
            clusterName: mine
            clusterEndpoint: mine
            defaultInstanceProfile: mine
            interruptionQueueName: mine
          featureGates:
            driftEnabled: true
        serviceAccount:
          annotations:
            eks.amazonaws.com/role-arn: arn:aws:iam::mine:role/karpenter-controller

Versions:

  • Chart Version: 0.28.0
  • Kubernetes Version (kubectl version): Major:“1”, Minor:“24+”, GitVersion:“v1.24.14-eks-c12679a”, GitCommit:“05d192f0de17608d98e17761ad3cffa9a6407f2f”
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 28 (14 by maintainers)

Most upvoted comments

Glad this got solved, thank you @engedaam , @Nashluffy !!

That’s great to hear! The team will be doing a karpenter release next week, and it will include the fix.

@engedaam yes confirmed the original snapshot fixes the issue as well, thanks!

@Nashluffy can you try out the original snapshot fix? v0-8c760941a8a2099eec87e567a43c86d6c646af67 I just want validate that fix works for you. That version does not include the additional logging.

Sorry about that. Attached are updated logs where 8080 is closed and the webhook is up and running, pods are stable.

karpenter-logs.txt

BTW here was the helm values.yaml snippet I ended up using

controller:
...
  image:
    repository: redacted/karpenter/controller
    tag: aef4bb9ae73cef3b9b668230d0f2e70093303c3e
    digest: sha256:dfa64043160e7f948f17ea002bf32ceed0d2f1b8d932b769af2c166e4b1a0361

In the deployment, I see that your sha256 of the deployment image is the same as v0.29.0? I see In the manifest helm.sh/chart: karpenter-v0.29.0. Make sure your pod image would be as such:

controller:
  image: redacted/karpenter/controller:aef4bb9ae73cef3b9b668230d0f2e70093303c3e@sha256:dfa64043160e7f948f17ea002bf32ceed0d2f1b8d932b769af2c166e4b1a0361

@Nashluffy I have been trying to replicate your issue, and I have not had any luck. I have built a new karpenter version that will log the health check attempts for both the Webhook port and port 8080. This should give more context into the issue. Could you try this karpenter version: v0-aef4bb9ae73cef3b9b668230d0f2e70093303c3e? Can you also provide your logs after running this version?

After trying again with the latest chart (v0.29.0) I’m seeing the same error as @rarecrumb, which yes, we do have something else on 8080 (also on host network). But we’ve specified a different port for the webhook to run on, as you can see in the manifests above.

2023-07-11T19:29:10.648Z        ERROR   controller      error received after stop sequence was engaged  {"commit": "61cc8f7-dirty", "error": "leader election lost"}
2023-07-11T19:29:14.648Z        ERROR   webhook Error while running server      {"commit": "61cc8f7-dirty", "error": "listen tcp :8080: bind: address already in use"}