amazon-vpc-cni-k8s: EKS 1.16 / v1.6.x: "couldn't get current server API group list; will keep using cached value"
We see the aws-node
pods crash on startup sometimes with this logged:
Starting IPAM daemon in the background ... ok.
ERROR: logging before flag.Parse: E0708 16:29:03.884330 6 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout)
Checking for IPAM connectivity ... failed.
Timed out waiting for IPAM daemon to start:
After starting and crashing the pod is then restarted and runs fine. About half of the aws-node
pods do this.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 26
- Comments: 68 (27 by maintainers)
Any news on that internal ticket? We keep running into this issue whenever our nodes start.
Currently, the workaround is adding a busybox as Init Container to wait for the kube-proxy start.
^^
Same for
v1.6.3
😐We had the same problem. Updating master and nodegroups from 1.15 to 1.16. We had to fix the version of kubeproxy (kube-proxy:v1.16.13 -> kube-proxy:v1.16.12) and recreate nodes
We are sometimes running into a race-condition where aws-node is started before kube-proxy. Without kube-proxy, kubernetes.default.svc.cluster.local is not available. aws-node will fail to start and the container is not automatically restarted. To mitigate this, we added the following initContainer to aws-node:
@tibin-mfl Yes, the CNI pod (aws-node) needs kube-proxy to set up the cluster IPs before it can start up.
we had the same issue after upgrading from eks 1.15 to 1.16. We were just bumping the image version inside DaemonSet to 1.6.X. What solved our issue is to apply the full yaml provided by aws doc: https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/release-1.6/config/v1.6/aws-k8s-cni.yaml
it did changes both to the DaemonSet and the ClusterRole.
Good luck !
Hello everyone, I’m also having the same issue. my K8s cluster is still on v1.15 and before upgrading to v1.16 I wanted to make sure all my controllers are on the recommended version by aws in this page.
My main controllers version now: kube-proxy: v1.6.12 (works) core-dns: 1.6.6 (works) amazon-vpc-cni-k8s: 1.7.5 (doesn’t work)
The deployment is done exactly as in the release docs: https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.7.5 I tried all the solutions mentioned in this repo: 1- downgrading kube-proxy from 1.6.13 to 1.6.12 with/without adding this label. Check this answer 2- I tried the following versions of amazon-vpc-cni-k8s: v1.7.5,v1.7.4, v1.7.3, v1.7.2 and finally wanted to roll-back to 1.6.3 but the issue is always the same.
@mogren this is on EKS, version 1.17. We discovered this as part of adding custom PSPs to all components. No scripts locking iptables on startup, using the standard EKS AMIs.
The behaviour we were seeing was that the aws-node pod never became ready, and was crash-looping. Apologies if that caused any confusion. I think it’s not unreasonable to conclude that:
I found the root cause of this issue (at least for my use-case) - my own fault 😃 I had set DHCP option set incorrectly to
<eks_name>.compute.internal
. After setting it to<region>.compute.internal
nodes load correctly and fast.Another tidbit: We ran into this very issue when upgrading from
v1.15
tov1.16
Our current workaround: we’re keeping kube-proxy at v1.15.11 even after upgrading rest of cluster to v1.16.
The rest of the add-ons we were able to get to the recommended versions:
I got my issue resolved, in my case I am managing the entire EKS stack using terraform and the problem happened while updating the code base to support upgrade to 1.18 and AWS CNI 1.7.5 (from 1.6.4). Culprits in my case were
Sharing in case it helps any one
Hi @safaa-alnabulsi
This looks to be a different issue, I feel it is better to be tracked as a new issue since this thread is mainly to track delay in kube-proxy to get the node information and aws-node waits for kube-proxy to come up. But eventually aws-node is up. Please open a new issue and share us the logs -
sudo bash /opt/cni/bin/aws-cni-support.sh
. You can email the logs tovaravaj@amazon.com
. Thank you!@mogren I attached Kube-proxy error log
Complete kubeproxy log https://pastebin.pl/view/3d4cc276 attached here
Hi @mogggggg
Sorry for the delayed response, as you have mentioned it looks like kube-proxy is waiting to retrieve node info and during that time frame aws-node starts and is unable to communicate to the API Server because iptables isnt updated and hence it restarts. I will try to repro and we will see how to mitigate this issue.
Thanks for your patience.
Hi @mogggggg
Thanks for letting us know. Please kindly share the full logs from the log collector script
https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshoot-cni
and also kube-proxy pod logs. You can email it varavaj@amazon.com.Thanks.
Hi, just wanted to chime in and we’re seeing the same thing. Like others have mentioned the pod seems to restart once when the node first starts up and it’s fine after that. We’re not using any custom PSPs.
EKS version:
1.17
AMI version:v1.17.9-eks-4c6976
kube-proxy version:1.17.7
CNI version:1.6.3
I can see these errors in
kube-proxy
logs on one of the nodes whereaws-node
restarted:And this is in the
aws-node
logs:It seems like this has started happening for us as part of the 1.17 upgrade, we haven’t restarted all our nodes since the upgrade and I can see that the pods that are still running (on AMI
v1.16.12-eks-904af05
) theaws-node
pod didn’t restart:I’m happy to share the full logs if they’re helpful, just give me an email address to send them!
@max-rocket-internet Hey, sorry for the lack of updates on this. Been out for a bit without much network access, so haven’t been able to track this one down. I agree that there is no config change between v1.6.2 and v1.6.3, but since v1.5.x, we have updated the readiness and liveness probe configs.
Between kubernetes 1.15 and 1.16 kube-proxy has changed, so that could be related. We have not yet been able to reproduce this yet doing master upgrades.