kops: nodes fail to join, nodeup/protokube failing doing rolling update 1.19 => 1.20
1. What kops version are you running? The command kops version, will display
this information.
Version 1.20.0 (git-8ea83c6d233a15dacfcc769d4d82bea3f530cf72)
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
1.19.9 and 1.20.5
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
kops rolling-update cluster --yes
5. What happened after the commands executed?
New nodes are not joining the cluster
6. What did you expect to happen?
New nodes to join the cluster
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
creationTimestamp: "2018-08-22T23:09:40Z"
generation: 15
name: cluster-dev-2.k8s.local
spec:
additionalPolicies:
master: '[{"Effect":"Allow", "Action":["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents", "logs:DescribeLogGroups", "logs:DescribeLogStreams"], "Resource":["*"]}]'
node: '[{"Effect":"Allow", "Action":["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents", "logs:DescribeLogGroups", "logs:DescribeLogStreams"], "Resource":["*"]}, {"Effect": "Allow", "Action": ["autoscaling:DescribeAutoScalingGroups", "autoscaling:DescribeAutoScalingInstances", "autoscaling:SetDesiredCapacity", "autoscaling:DescribeLaunchConfigurations", "autoscaling:DescribeTags", "autoscaling:TerminateInstanceInAutoScalingGroup"], "Resource": ["*"]}]'
api:
loadBalancer:
class: Classic
type: Public
authorization:
rbac: {}
certManager:
enabled: true
channel: stable
cloudProvider: aws
clusterAutoscaler:
enabled: true
configBase: s3://cluster-store/cluster-dev-2.k8s.local
etcdClusters:
- etcdMembers:
- instanceGroup: master-us-east-2a
name: a
- instanceGroup: master-us-east-2b
name: b
- instanceGroup: master-us-east-2c
name: c
name: main
- etcdMembers:
- instanceGroup: master-us-east-2a
name: a
- instanceGroup: master-us-east-2b
name: b
- instanceGroup: master-us-east-2c
name: c
name: events
iam:
allowContainerRegistry: true
legacy: false
kubeAPIServer:
featureGates:
TTLAfterFinished: "true"
kubelet:
anonymousAuth: false
authenticationTokenWebhook: true
authorizationMode: Webhook
featureGates:
TTLAfterFinished: "true"
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.20.5
masterInternalName: api.internal.cluster-dev-2.k8s.local
masterPublicName: api.cluster-dev-2.k8s.local
metricsServer:
enabled: true
networkCIDR: 170.42.0.0/16
networking:
kopeio: {}
nodeTerminationHandler:
enabled: true
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- 0.0.0.0/0
subnets:
- cidr: 170.42.32.0/19
name: us-east-2a
type: Private
zone: us-east-2a
- cidr: 170.42.64.0/19
name: us-east-2b
type: Private
zone: us-east-2b
- cidr: 170.42.96.0/19
name: us-east-2c
type: Private
zone: us-east-2c
- cidr: 170.42.0.0/22
name: utility-us-east-2a
type: Utility
zone: us-east-2a
- cidr: 170.42.4.0/22
name: utility-us-east-2b
type: Utility
zone: us-east-2b
- cidr: 170.42.8.0/22
name: utility-us-east-2c
type: Utility
zone: us-east-2c
topology:
dns:
type: Public
masters: private
nodes: private
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2019-10-22T16:31:17Z"
generation: 6
labels:
kops.k8s.io/cluster: cluster-dev-2.k8s.local
name: bastions
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210315
machineType: t2.micro
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: bastions
role: Bastion
subnets:
- utility-us-east-2a
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2018-08-22T23:09:41Z"
generation: 9
labels:
kops.k8s.io/cluster: cluster-dev-2.k8s.local
name: master-us-east-2a
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210315
machineType: t2.medium
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-us-east-2a
role: Master
subnets:
- us-east-2a
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2018-08-22T23:09:41Z"
generation: 9
labels:
kops.k8s.io/cluster: cluster-dev-2.k8s.local
name: master-us-east-2b
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210315
machineType: t2.medium
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-us-east-2b
role: Master
subnets:
- us-east-2b
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2018-08-22T23:09:41Z"
generation: 9
labels:
kops.k8s.io/cluster: cluster-dev-2.k8s.local
name: master-us-east-2c
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210315
machineType: t2.medium
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-us-east-2c
role: Master
subnets:
- us-east-2c
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2018-08-22T23:09:41Z"
generation: 13
labels:
kops.k8s.io/cluster: cluster-dev-2.k8s.local
name: nodes
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210315
machineType: t2.medium
maxSize: 10
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: nodes
role: Node
subnets:
- us-east-2a
- us-east-2b
- us-east-2c
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2020-02-19T01:06:07Z"
generation: 7
labels:
kops.k8s.io/cluster: cluster-dev-2.k8s.local
name: persistent
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210315
machineType: t2.medium
maxSize: 5
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: persistent
role: Node
subnets:
- us-east-2a
taints:
- persistent:NoSchedule
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
from protokube:
I0423 16:25:21.985382 60125 aws_volume.go:72] AWS API Request: ec2metadata/GetToken
I0423 16:25:21.989141 60125 aws_volume.go:72] AWS API Request: ec2metadata/GetDynamicData
I0423 16:25:21.993221 60125 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata
I0423 16:25:21.995463 60125 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata
I0423 16:25:22.009031 60125 aws_volume.go:72] AWS API Request: ec2/DescribeInstances
I0423 16:25:22.095308 60125 main.go:230] cluster-id: cluster-dev-2.k8s.local
W0423 16:25:22.095344 60125 main.go:308] Unable to fetch HOSTNAME for use as node identifier
I0423 16:25:22.095353 60125 gossip.go:60] gossip dns connection limit is:0
I0423 16:25:22.097368 60125 aws_volume.go:72] AWS API Request: ec2/DescribeInstances
W0423 16:25:22.160869 60125 cluster.go:150] couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided
E0423 16:25:22.163122 60125 main.go:324] Error initializing secondary gossip: create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided
protokube.service: Main process exited, code=exited, status=1/FAILURE
protokube.service: Failed with result 'exit-code'.
from nodeup:
nodeup[33788]: W0423 16:56:30.298471 33788 executor.go:139] error running task "BootstrapClient/BootstrapClient" (4m21s remaining to succeed): lookup kops-controller.internal.cluster-dev-2.k8s.local on 127.0.0.53:53: server misbehaving
9. Anything else do we need to know?
When I went to add in the cert manager for this update, there was an issue where my ig manifests contained kubernetes.io/cluster/cluster-dev-2.k8s.local=owned in the cloudLabels section (I believe from when I first set up a cluster autoscaler, I’m now using the included one). This was reporting now as being “reserved”. I worked around it by editing the manifests manually to remove these lables which allowed me to save the edits to the cluster manifest and deploy
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 23 (2 by maintainers)
We are also experiencing this in an upgrade from 1.19 -> 1.20. It looks like our VPC CIDR is, like the other commenters here, set to a non-private range. We had never run into issues on lower versions of k8s/kops. We only started having this issue when upgrading to 1.20. As a mitigation right now, we set the following in our kops cluster yaml definition:
It looks like only the secondary gossip protocol has issues with a non-private CIDR.
This isn’t an ideal solution. I don’t have enough context from the commit/pr messages to discern the effects of only having one gossip protocol enabled. It would be good to have someone comment on how we should proceed going forward. If clusters must be in a specific CIDR range, then we should have kops validate that and error out with a helpful message.
@dbachrach, do you know if this line still shows up when upgrading past 1.19:
I get this with upgrades up to and including 1.19, and it seems like it’s related to your config block. Maybe that is the old default …
It should probably check for a valid CIDR range when you make a cluster with the
k8s.local“TLD”I found it happens if you have a private cluster in a non private CIDR range. If you can change your network, fix the CIDR to be private. If you can’t (cough AWS VPC cough) then you just have to remake the cluster.
Paul Davis github.com/dangersalad github.com/Paul Davis
I’ve run some additional tests and stood up some clusters based on new versions. Clusters based on 1.20 and 1.21 did not become healthy using gossip. It seems that with 1.19.3 the cluster becomes healthy, and it is using the ubuntu images. That suggests that the change in images isn’t the culprit.
I then upgraded the 1.19.3 cluster to 1.19.13, which also succeeded with all nodes becoming healthy.
A subsequent upgrade of that cluster from 1.19.13 to 1.20.2 fails with the nodes not joining the cluster successfully.
This suggests that something changed on how nodes are provisioned going to 1.20.