kops: M5 nodes sometimes fail to connect to the network

  1. What kops version are you running? The command kops version, will display this information. 1.8.0

  2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

1.8.4

  1. What cloud provider are you using?

AWS

  1. What commands did you run? What is the simplest way to reproduce this issue?

Create a bunch of M5 nodes using the Stretch image, see many of them never connect to the network.

  1. What happened after the commands executed?

Some nodes start ok, others fail. Manually logging in to the failed node and restarting kubelet fixes the issue.

  1. What did you expect to happen?

All nodes connect to the network.

It seems like this is caused by a timing issue somewhere, by comparing logs from a failed boot and a successful one this jumps out:

Failed: Cloud-init v. 0.7.9 running ‘init-local’ at Thu, 07 Dec 2017 09:48:31 +0000. Up 9.35 seconds. Successful: Cloud-init v. 0.7.9 running ‘init-local’ at Thu, 07 Dec 2017 07:50:00 +0000. Up 19.02 seconds.

Why did it take 10 seconds longer for cloud-init to start in the successful case?

Logs attached.

journal.bad.txt journal.ok.txt

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 15 (2 by maintainers)

Most upvoted comments

Thanks @jlaswell. This is a fairly big thing for us as well, getting the fix into a stable release would help a great deal (I have been manually replacing the kubelet url in launch configurations that care deeply about this).

I conciously left this issue open since in issue #57382 @chrislovecnm said he thought that this is a bug on the installer side (kubelet should not be started before the tags are set), that side has not in my understanding been worked on.

I have forwarded the issue here: https://github.com/kubernetes/kubernetes/issues/57382

I’d leave this open in kops as well, in case the issue does end up being here.