kops: M5 nodes sometimes fail to connect to the network
-
What
kopsversion are you running? The commandkops version, will display this information. 1.8.0 -
What Kubernetes version are you running?
kubectl versionwill print the version if a cluster is running or provide the Kubernetes version specified as akopsflag.
1.8.4
- What cloud provider are you using?
AWS
- What commands did you run? What is the simplest way to reproduce this issue?
Create a bunch of M5 nodes using the Stretch image, see many of them never connect to the network.
- What happened after the commands executed?
Some nodes start ok, others fail. Manually logging in to the failed node and restarting kubelet fixes the issue.
- What did you expect to happen?
All nodes connect to the network.
It seems like this is caused by a timing issue somewhere, by comparing logs from a failed boot and a successful one this jumps out:
Failed: Cloud-init v. 0.7.9 running ‘init-local’ at Thu, 07 Dec 2017 09:48:31 +0000. Up 9.35 seconds. Successful: Cloud-init v. 0.7.9 running ‘init-local’ at Thu, 07 Dec 2017 07:50:00 +0000. Up 19.02 seconds.
Why did it take 10 seconds longer for cloud-init to start in the successful case?
Logs attached.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 15 (2 by maintainers)
Thanks @jlaswell. This is a fairly big thing for us as well, getting the fix into a stable release would help a great deal (I have been manually replacing the kubelet url in launch configurations that care deeply about this).
I conciously left this issue open since in issue #57382 @chrislovecnm said he thought that this is a bug on the installer side (kubelet should not be started before the tags are set), that side has not in my understanding been worked on.
I have forwarded the issue here: https://github.com/kubernetes/kubernetes/issues/57382
I’d leave this open in kops as well, in case the issue does end up being here.