rancher: [BUG] helm-operation failure - Waiting for Kubernetes API to be available

Rancher Server Setup

  • Rancher version: 2.7.3
  • Installation option (Docker install/Helm Chart): Helm Chart
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): k3s 1.25.9
  • Proxy/Cert Details: self signed

Information about the Cluster v1.25.6

  • Cluster Type (Local/Downstream): Downstream
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): Custom: Running docker command to install RKE cluster

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    • If custom, define the set of permissions: Admin

Describe the bug

Rancher keeps creating pods that fail

pod name:

helm-operation-ddxl9

container:

rancher/shell:v0.1.19

image

pod logs:

Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Timeout waiting for kubernetes

pod is them terminated

image

Recent operations is filled with failures

image

To Reproduce

Do nothing, this just started happening with the upgrade to 2.7.2 and has persisted to 2.7.3

Result

Expected Result

Either not create the pod or that the pod can communicate with the kubernetes API

Screenshots

Additional context

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 54 (4 by maintainers)

Most upvoted comments

for those of you experiencing this problem, check that your cluster’s k8s API is communicating properly.

How would you do this? I’ve ran into this scenario too now and have no idea how to troubleshoot it.

Scott

Rancher is good, but it is a leaky abstraction. The assumption is you work with standard machines from cloud providers or enterprise procurement. Mine turns out to be within tigera calico operator that Rancher has made mostly opaque. https://docs.tigera.io/calico/latest/networking/ipam/ip-autodetection I have two NICs in some machines where a secondary NIC has a local storage network in a closed different subnet. Unfortunately, the default autodetection method of tigera calico operator is whichever NIC first seen, and those secondary NICs are what it saw. Hence some links are good while others are not, depending on where the node is located. This has nothing to do with what Matt says:

For all of you who reported this, does the upgrade work eventually? I agree that this definitely looks bad, but it’s really just a look behind the curtain while these pods are waiting for other things to to spin up, and they should eventually resolve. Any additional info is appreciated.

Or it has everything to do with what Rancher tries to achieve, to provide a mostly default automation K8s setup that works.

Another resource that might help is nicolaka/netshoot diagonisation tool, where you can make a deamon set to test connections and routes. Try dig-ging kubernetes.default in the container to test coreDNS, or ping-ing across locations to test pod vxlan and service ips.

for those of you experiencing this problem, check that your cluster’s k8s API is communicating properly.