kops: Kubelet fail to register node arbitrarily using Spot Instances
kops 1.8.0-alpha.1 kubernetes: 1.7.8 provider: AWS
To reproduce it create a cluster and add and instance group with maxPrice value
Some instances not register with apiserver, some of them works fine so it seems some kind of race condition. If you kill kubelet process, after respawn all works fine.
The relevant part of logs i could find in kubelet are this lines repeated in loop:
Oct 11 08:39:47 ip-172-20-104-75 kubelet[1281]: I1011 08:39:47.987315 1281 kubelet.go:1894] SyncLoop (DELETE, "api"): "kube-proxy-ip-172-20-104-75.eu-west-1.compute.internal_kube-system(b1664e35-ae5f-11e7-8c91-062a9a3fee2c)"
Oct 11 08:39:47 ip-172-20-104-75 kubelet[1281]: W1011 08:39:47.990681 1281 kubelet.go:1596] Deleting mirror pod "kube-proxy-ip-172-20-104-75.eu-west-1.compute.internal_kube-system(b1664e35-ae5f-11e7-8c91-062a9a3fee2c)" because it is outdated
Oct 11 08:39:47 ip-172-20-104-75 kubelet[1281]: I1011 08:39:47.990693 1281 mirror_client.go:85] Deleting a mirror pod "kube-proxy-ip-172-20-104-75.eu-west-1.compute.internal_kube-system"
Oct 11 08:39:47 ip-172-20-104-75 kubelet[1281]: I1011 08:39:47.994312 1281 kubelet.go:1888] SyncLoop (REMOVE, "api"): "kube-proxy-ip-172-20-104-75.eu-west-1.compute.internal_kube-system(b1664e35-ae5f-11e7-8c91-062a9a3fee2c)"
Oct 11 08:39:48 ip-172-20-104-75 kubelet[1281]: I1011 08:39:48.008283 1281 kubelet.go:1878] SyncLoop (ADD, "api"): "kube-proxy-ip-172-20-104-75.eu-west-1.compute.internal_kube-system(bd58ce30-ae5f-11e7-8c91-062a9a3fee2c)"
[...]
Cluster yaml
kind: Cluster
metadata:
creationTimestamp: 2017-10-11T08:54:21Z
name: test
spec:
api:
loadBalancer:
type: Public
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: s3://kubernetes-artifacts/test
dnsZone: test
etcdClusters:
- enableEtcdTLS: true
etcdMembers:
- instanceGroup: master-eu-west-1a
name: a
- instanceGroup: master-eu-west-1b
name: b
- instanceGroup: master-eu-west-1c
name: c
name: main
version: 3.1.10
- enableEtcdTLS: true
etcdMembers:
- instanceGroup: master-eu-west-1a
name: a
- instanceGroup: master-eu-west-1b
name: b
- instanceGroup: master-eu-west-1c
name: c
name: events
version: 3.1.10
iam:
legacy: false
kubeAPIServer:
auditLogMaxAge: 10
auditLogMaxBackups: 1
auditLogMaxSize: 100
auditLogPath: /var/log/kube-apiserver-audit.log
kubelet:
featureGates:
ExperimentalCriticalPodAnnotation: "true"
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.7.8
masterInternalName: api.internal.test
masterPublicName: api.test
networkCIDR: 172.20.0.0/16
networking:
weave:
mtu: 8912
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- 0.0.0.0/0
subnets:
- cidr: 172.20.32.0/19
name: eu-west-1a
type: Private
zone: eu-west-1a
- cidr: 172.20.64.0/19
name: eu-west-1b
type: Private
zone: eu-west-1b
- cidr: 172.20.96.0/19
name: eu-west-1c
type: Private
zone: eu-west-1c
- cidr: 172.20.0.0/22
name: utility-eu-west-1a
type: Utility
zone: eu-west-1a
- cidr: 172.20.4.0/22
name: utility-eu-west-1b
type: Utility
zone: eu-west-1b
- cidr: 172.20.8.0/22
name: utility-eu-west-1c
type: Utility
zone: eu-west-1c
topology:
bastion:
bastionPublicName: bastion.test
dns:
type: Public
masters: private
nodes: private
Instance group spotrequests
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2017-10-11T09:21:48Z
labels:
kops.k8s.io/cluster: test
name: ephimeral
spec:
image: kope.io/k8s-1.7-debian-jessie-amd64-hvm-ebs-2017-07-28
machineType: m4.2xlarge
maxSize: 7
minSize: 3
maxPrice: "0.15"
role: Node
subnets:
- eu-west-1b
- eu-west-1c
- eu-west-1a
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 5
- Comments: 24 (14 by maintainers)
Commits related to this issue
- Workaround for spot price We ran into the following bug https://github.com/kubernetes/kops/issues/3605#issuecomment-360068615 and implemented the workaround for it — committed to skyscrapers/terraform-kubernetes by MattiasGees 6 years ago
- Add possibility for spot pricing of the workers (#57) This adds spot pricing functionality for the workers. This is based on upstream docs https://github.com/kubernetes/kops/blob/master/docs/instance... — committed to skyscrapers/terraform-kubernetes by MattiasGees 6 years ago
The script @ese provided didn’t work for me, but then I found 2 small errors in the hook. After fixing those it seems to work nicely. The hook should have been:
The
ExecStartPosthad a trailing"which wasn’t supposed to be there, and the hooknameshouldn’t be part of the manifest, but part of the actual hook. Otherwise the service doesn’t register properly and doesn’t work.I have this workaround in the cluster spec that seems to work
@alok87
It is basically an image with
curlandawsclirunning this bash script@chrislovecnm It runs a simple sh script which waits for ec2 tags and restarts kubelet than:
@ese @chrislovecnm could you provide the Dockerfile for this https://github.com/kubernetes/kops/issues/3605#issuecomment-351674234
Unfortunately the workaround that @ese posted does not work for me. My spot instances still fail to register about half the time during a rolling update. I’m not sure if the root cause of my issue is with tagging (could be, just don’t know for sure), but restarting kubelet manually during a rolling update fixes it for me. Not ideal…
Edit: There’s what appears to be a typo in the workaround above (there’s a
"after the service name that I think shouldn’t be there). Removing the"seems to have improved (but not totally resolved) the issue. On my five node test cluster only one failed to come back up during a rolling update. Maybe I just got lucky that time.Edit 2: I’ve done a few more rolling updates since the comment and it seems like workaround IS working for me after removing the typo. I have only seen one failure in rolling updates since applying the non-typo workaround, and in that case it was something very different (kubelet wasn’t even installed on the node after 10 minutes, no idea what went wrong there and haven’t seen it before).