amazon-eks-ami: AWS EKS - remote error: tls: internal error - CSR pending
What happened: We have EKS cluster deployed with managed nodes. When we try to run kubectl logs or kubectl exec it is giving Error from server: error dialing backend: remote error: tls: internal error. In the admin console, it is showing all the Nodes are ready and Workloads are ready. Then I run kubectl get csr and it is showing all requests as Pending. Then I described a CSR it seems like the details are correct. Please refer the below output
Name: csr-zz882
Labels: <none>
Annotations: <none>
CreationTimestamp: Sat, 13 Feb 2021 15:03:31 +0000
Requesting User: system:node:ip-192-168-33-152.ec2.internal
Signer: kubernetes.io/kubelet-serving
Status: Pending
Subject:
Common Name: system:node:ip-192-168-33-152.ec2.internal
Serial Number:
Organization: system:nodes
Subject Alternative Names:
DNS Names: ec2-3-239-231-25.compute-1.amazonaws.com
ip-192-168-33-152.ec2.internal
IP Addresses: 192.168.33.152
3.239.231.25
Events: <none>
Anything else we need to know?:This issue came suddenly. Our guess after scalling
Environment:
- AWS Region: North Virginia
- Instance Type(s): M5.Large
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion):eks.3 - Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version):1.18 - AMI Version:AL2_x86_64
- Kernel (e.g.
uname -a):Linux ip-192-168-33-152.ec2.internal 4.14.214-160.339.amzn2.x86_64 #1 SMP Sun Jan 10 05:53:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux - Release information (run
cat /etc/eks/releaseon a node):
BASE_AMI_ID="ami-002eb42333992c419"
BUILD_TIME="Mon Feb 8 20:17:23 UTC 2021"
BUILD_KERNEL="4.14.214-160.339.amzn2.x86_64"
ARCH="x86_64"
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 26
- Comments: 18 (1 by maintainers)
Check that in aws-auth configmap is no duplicated values in
mapRolesandmapUsers, this was my case.Thanks @Dr4il for the idea about aws-auth configmap.
@rtripat to reproduce the issue here is what I did:
usernamefor the samerolearnintomapRolesof the aws-auth configmap:recycle 1 node of the node group associated to
INSTANCE_ROLE_ARNdeploy a pod to that new node and verify the issue
output of
kubectl logs POD_NAME -n NAMESPACE:output of
kubectl get csr -n NAMESPACE:To workaround the issue I put the same
usernamefor the samerolarninto the aws-auth configmap:To anyone with a similar issue, be aware AWS will charge you for support cases, but fail to diagnose or help in any way.
Any update on this? We have experienced this three times now, each time having to delete and recreate the cluster. AWS support couldn’t reproduce it on their side, charged us for the support case they never solved, and then asked us to reproduce it for them, giving the following response:
AWS Support:
In our case, AWS terminated a node (without notifying or requesting): W0708 15:12:35.439299 1 aws.go:1730] the instance i-04c7a**** is terminated I0708 15:12:35.439314 1 node_lifecycle_controller.go:156] deleting node since it is no longer present in cloud provider: ip-********.eu-west-1.compute.internal
The node that came back up started with TLS issue, brought down parts of our system and now the cluster is again unhealthy.
CSR’s from nodes have the following auto-approve config:
but remain in the pending state.
Not exactly the same problem (instead of “pending” we got stuck with “approved” but not “issued” status) but maybe it can help someone.
In our case, it was just bad timing. In user-metadata we apply some changes for containerd config and then restart it. Sometimes restart happens just after creating a CSR but before the actual certificate gets issued (a rather small window of about a couple of seconds) and downloaded by kubelet. Seems like restarting containerd also causes kubelet to restart too and create another CSR. For some reason, EKS doesn’t issue that second CSR for about 10 minutes (any new CSR for that exact node actually). That causes “tls: internal error” error for some new nodes for about 10 minutes.
My issue was resolved after updating AMI of EKS cluster nodegroup.
Use the following documentation to get correct AMI for nodegroup: https://docs.aws.amazon.com/eks/latest/userguide/eks-linux-ami-versions.html
This also might be useful for some people: https://docs.aws.amazon.com/eks/latest/userguide/cert-signing.html
This was our case. Was pulling my hair out trying to figure out what suddenly went wrong with our EKS module (we didn’t parameterize the DNS name’s type in our network module) and it turned out someone updated the DNS name’s type to resource name. Changing it back to IP name fixed everything.
In my case, when
I got all of them
Pendingand i manually approved one of them like this:Then, this certificate
csr-kkz2tbecameApproved,Issued, and kubectl logs and exec started working.In our case that was a clear difference in “hostname type” setting for the subnet, nodes are created in. With the same cluster, with all same configs, when subnet, nodes were created in, has the setting set to “resource name”, and nodes get names like “i-0977c7690f78d6d5f.eu-central-1.compute.internal” they were not able to join the cluster properly with that error. With the setting changed to “IP name”, so nodes start getting names like “ip-10-1-35-198.eu-central-1.compute.internal” they worked just fine.
You can even see here in the output of
kubectl get csrtraces of both those name types attempting to join the cluster, when second ones succeeded, and first ones don’t:It looks like there is some pattern matching with certificate DNS names in AWS managed master nodes for issuing the node certificates, and when the name doesn’t hit it - it just fails.
Damn EKS for years it is having stupid problems