krustlet: Panic on EKS while watching nodes (401 unauthorized)

I don’t have repro steps other than standing up an EKS cluster and waiting for the node to show up as NotReady. Prior to the panic, the node was Ready and was successfully running WebAssembly applications.

I pulled this from the service log on one of the krustlet nodes:

Apr 15 00:38:56 ip-192-168-71-217.us-west-2.compute.internal krustlet[2260]: [2020-04-15T00:38:56Z WARN  kube::runtime::informer] Unexpected watch error: Api(ErrorResponse { status: "Failure", message: "Unauthorized", reason: "Unauthorized", code: 401 })
Apr 15 00:38:56 ip-192-168-71-217.us-west-2.compute.internal krustlet[2260]: thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: Api(ErrorResponse { status: "Failure", message: "Unauthorized", reason: "Unauthorized", code: 401 })', /home/ec2-user/.cargo/git/checkouts/krustlet-dd9f49a0b51f9977/31f4940/crates/kubelet/src/kubelet.rs:78:41

which points at this unwrap call during polling for pods.

I don’t know why the EKS managed API server is occasionally returning a 401 yet, but I thought I’d file an issue here in case anyone else runs into this.

Unfortunately, after the panic, the krustlet service goes into a failure loop when attempting to reclaim the existing node registration (possibly related to this issue?):

Apr 15 00:51:32 ip-192-168-71-217.us-west-2.compute.internal krustlet[4144]: thread 'main' panicked at 'Unable to recreate node...aborting: Api(ErrorResponse { status: "Failure", message: "nodes \"ip-192-168-71-217.us-west-2.compute.internal\" is forbidden: User \"system:node:ip-192-168-71-217.us-west-2.compute.internal\" cannot delete resource \"nodes\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 })', /home/ec2-user/.cargo/git/checkouts/krustlet-dd9f49a0b51f9977/31f4940/crates/kubelet/src/node.rs:47:21

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (16 by maintainers)

Commits related to this issue

Most upvoted comments

The nodes have been up for 2d and appear to be healthy. I think we can declare the fix works 🎉 !

$ k get node -owide
NAME                                           STATUS   ROLES   AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE    KERNEL-VERSION   CONTAINER-RUNTIME
ip-192-168-13-165.us-west-2.compute.internal   Ready    agent   2d    v1.17.0   192.168.13.165   <none>        <unknown>   <unknown>        mvp
ip-192-168-52-161.us-west-2.compute.internal   Ready    agent   2d    v1.17.0   192.168.52.161   <none>        <unknown>   <unknown>        mvp

So I ended up giving it a shot by fixing it in the upstream library. I’ll open a PR once I get a chance to test this