amazon-eks-ami: AWS EKS - remote error: tls: internal error - CSR pending

What happened: We have EKS cluster deployed with managed nodes. When we try to run kubectl logs or kubectl exec it is giving Error from server: error dialing backend: remote error: tls: internal error. In the admin console, it is showing all the Nodes are ready and Workloads are ready. Then I run kubectl get csr and it is showing all requests as Pending. Then I described a CSR it seems like the details are correct. Please refer the below output

Name:               csr-zz882
Labels:             <none>
Annotations:        <none>
CreationTimestamp:  Sat, 13 Feb 2021 15:03:31 +0000
Requesting User:    system:node:ip-192-168-33-152.ec2.internal
Signer:             kubernetes.io/kubelet-serving
Status:             Pending
Subject:
  Common Name:    system:node:ip-192-168-33-152.ec2.internal
  Serial Number:  
  Organization:   system:nodes
Subject Alternative Names:
         DNS Names:     ec2-3-239-231-25.compute-1.amazonaws.com
                        ip-192-168-33-152.ec2.internal
         IP Addresses:  192.168.33.152
                        3.239.231.25
Events:  <none>

Anything else we need to know?:This issue came suddenly. Our guess after scalling

Environment:

  • AWS Region: North Virginia
  • Instance Type(s): M5.Large
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion):eks.3
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version):1.18
  • AMI Version:AL2_x86_64
  • Kernel (e.g. uname -a):Linux ip-192-168-33-152.ec2.internal 4.14.214-160.339.amzn2.x86_64 #1 SMP Sun Jan 10 05:53:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-002eb42333992c419"
BUILD_TIME="Mon Feb  8 20:17:23 UTC 2021"
BUILD_KERNEL="4.14.214-160.339.amzn2.x86_64"
ARCH="x86_64"

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 26
  • Comments: 18 (1 by maintainers)

Most upvoted comments

Check that in aws-auth configmap is no duplicated values in mapRoles and mapUsers, this was my case.

Thanks @Dr4il for the idea about aws-auth configmap.

@rtripat to reproduce the issue here is what I did:

  • set different username for the same rolearn into mapRoles of the aws-auth configmap:
...
- "groups":
  - "system:bootstrappers"
  - "system:nodes"
  "rolearn": "INSTANCE_ROLE_ARN"
  "username": "system:node:{{EC2PrivateDNSName}}"
- "groups":
  - "system:masters"
  "rolearn": "INSTANCE_ROLE_ARN"
  "username": "test"
...
  • recycle 1 node of the node group associated to INSTANCE_ROLE_ARN

  • deploy a pod to that new node and verify the issue

output of kubectl logs POD_NAME -n NAMESPACE:

Error from server: Get "https://X.X.X.X:10250/containerLogs/...": remote error: tls: internal error

output of kubectl get csr -n NAMESPACE:

NAME        AGE     SIGNERNAME                      REQUESTOR                                                  REQUESTEDDURATION   CONDITION
csr-vgrfh   4m42s   kubernetes.io/kubelet-serving   test                                                       <none>              Pending

To workaround the issue I put the same username for the same rolarn into the aws-auth configmap:

...
- "groups":
  - "system:bootstrappers"
  - "system:nodes"
  "rolearn": "INSTANCE_ROLE_ARN"
  "username": "system:node:{{EC2PrivateDNSName}}"
- "groups":
  - "system:masters"
  "rolearn": "INSTANCE_ROLE_ARN"
  "username": "system:node:{{EC2PrivateDNSName}}"
...

To anyone with a similar issue, be aware AWS will charge you for support cases, but fail to diagnose or help in any way.

Any update on this? We have experienced this three times now, each time having to delete and recreate the cluster. AWS support couldn’t reproduce it on their side, charged us for the support case they never solved, and then asked us to reproduce it for them, giving the following response:

AWS Support:

Also, I’ve tested it in my cluster by scaling the worker nodes from the eks console but in my case the node was launched successfully.

Therefore, please check once again if you can reproduce this issue, if so please share the steps and the logs/outputs that I’ve requested in my previous correspondence and I’ll investigate this further.

In our case, AWS terminated a node (without notifying or requesting): W0708 15:12:35.439299 1 aws.go:1730] the instance i-04c7a**** is terminated I0708 15:12:35.439314 1 node_lifecycle_controller.go:156] deleting node since it is no longer present in cloud provider: ip-********.eu-west-1.compute.internal

The node that came back up started with TLS issue, brought down parts of our system and now the cluster is again unhealthy.

CSR’s from nodes have the following auto-approve config:

# Approve renewal CSRs for the group "system:nodes"
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: auto-approve-renewals-for-nodes
subjects:
- kind: Group
  name: system:nodes
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: system:certificates.k8s.io:certificatesigningrequests:selfnodeclient
  apiGroup: rbac.authorization.k8s.io

but remain in the pending state.

Not exactly the same problem (instead of “pending” we got stuck with “approved” but not “issued” status) but maybe it can help someone.

In our case, it was just bad timing. In user-metadata we apply some changes for containerd config and then restart it. Sometimes restart happens just after creating a CSR but before the actual certificate gets issued (a rather small window of about a couple of seconds) and downloaded by kubelet. Seems like restarting containerd also causes kubelet to restart too and create another CSR. For some reason, EKS doesn’t issue that second CSR for about 10 minutes (any new CSR for that exact node actually). That causes “tls: internal error” error for some new nodes for about 10 minutes.

My issue was resolved after updating AMI of EKS cluster nodegroup.

Use the following documentation to get correct AMI for nodegroup: https://docs.aws.amazon.com/eks/latest/userguide/eks-linux-ami-versions.html

This also might be useful for some people: https://docs.aws.amazon.com/eks/latest/userguide/cert-signing.html

kubectl get csr

In our case that was a clear difference in “hostname type” setting for the subnet, nodes are created in. With the same cluster, with all same configs, when subnet, nodes were created in, has the setting set to “resource name”, and nodes get names like “i-0977c7690f78d6d5f.eu-central-1.compute.internal” they were not able to join the cluster properly with that error. With the setting changed to “IP name”, so nodes start getting names like “ip-10-1-35-198.eu-central-1.compute.internal” they worked just fine.

You can even see here in the output of kubectl get csr traces of both those name types attempting to join the cluster, when second ones succeeded, and first ones don’t:

NAME        AGE   SIGNERNAME                      REQUESTOR                                                       REQUESTEDDURATION   CONDITION
csr-kl722   16m   kubernetes.io/kubelet-serving   system:node:i-0977c7690f78d6d5f.eu-central-1.compute.internal   <none>              Approved
csr-mbbvp   50s   kubernetes.io/kubelet-serving   system:node:ip-10-1-35-198.eu-central-1.compute.internal        <none>              Approved,Issued
csr-r2n4z   16m   kubernetes.io/kubelet-serving   system:node:i-0f3e6bf012164f037.eu-central-1.compute.internal   <none>              Approved
csr-tr9b4   57s   kubernetes.io/kubelet-serving   system:node:ip-10-1-37-252.eu-central-1.compute.internal        <none>              Approved,Issued

It looks like there is some pattern matching with certificate DNS names in AWS managed master nodes for issuing the node certificates, and when the name doesn’t hit it - it just fails.

This was our case. Was pulling my hair out trying to figure out what suddenly went wrong with our EKS module (we didn’t parameterize the DNS name’s type in our network module) and it turned out someone updated the DNS name’s type to resource name. Changing it back to IP name fixed everything.

In my case, when

kubectl get csr

I got all of them Pending and i manually approved one of them like this:

kubectl certificate approve csr-kkz2t

Then, this certificate csr-kkz2t became Approved,Issued, and kubectl logs and exec started working.

In our case that was a clear difference in “hostname type” setting for the subnet, nodes are created in. With the same cluster, with all same configs, when subnet, nodes were created in, has the setting set to “resource name”, and nodes get names like “i-0977c7690f78d6d5f.eu-central-1.compute.internal” they were not able to join the cluster properly with that error. With the setting changed to “IP name”, so nodes start getting names like “ip-10-1-35-198.eu-central-1.compute.internal” they worked just fine.

You can even see here in the output of kubectl get csr traces of both those name types attempting to join the cluster, when second ones succeeded, and first ones don’t:

NAME        AGE   SIGNERNAME                      REQUESTOR                                                       REQUESTEDDURATION   CONDITION
csr-kl722   16m   kubernetes.io/kubelet-serving   system:node:i-0977c7690f78d6d5f.eu-central-1.compute.internal   <none>              Approved
csr-mbbvp   50s   kubernetes.io/kubelet-serving   system:node:ip-10-1-35-198.eu-central-1.compute.internal        <none>              Approved,Issued
csr-r2n4z   16m   kubernetes.io/kubelet-serving   system:node:i-0f3e6bf012164f037.eu-central-1.compute.internal   <none>              Approved
csr-tr9b4   57s   kubernetes.io/kubelet-serving   system:node:ip-10-1-37-252.eu-central-1.compute.internal        <none>              Approved,Issued

It looks like there is some pattern matching with certificate DNS names in AWS managed master nodes for issuing the node certificates, and when the name doesn’t hit it - it just fails.

Damn EKS for years it is having stupid problems