kops: nodes fail to join, nodeup/protokube failing doing rolling update 1.19 => 1.20

1. What kops version are you running? The command kops version, will display this information.

Version 1.20.0 (git-8ea83c6d233a15dacfcc769d4d82bea3f530cf72)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

1.19.9 and 1.20.5

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops rolling-update cluster --yes

5. What happened after the commands executed?

New nodes are not joining the cluster

6. What did you expect to happen?

New nodes to join the cluster

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2018-08-22T23:09:40Z"
  generation: 15
  name: cluster-dev-2.k8s.local
spec:
  additionalPolicies:
    master: '[{"Effect":"Allow", "Action":["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents", "logs:DescribeLogGroups", "logs:DescribeLogStreams"], "Resource":["*"]}]'
    node: '[{"Effect":"Allow", "Action":["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents", "logs:DescribeLogGroups", "logs:DescribeLogStreams"], "Resource":["*"]}, {"Effect": "Allow", "Action": ["autoscaling:DescribeAutoScalingGroups", "autoscaling:DescribeAutoScalingInstances", "autoscaling:SetDesiredCapacity", "autoscaling:DescribeLaunchConfigurations", "autoscaling:DescribeTags", "autoscaling:TerminateInstanceInAutoScalingGroup"], "Resource": ["*"]}]'
  api:
    loadBalancer:
      class: Classic
      type: Public
  authorization:
    rbac: {}
  certManager:
    enabled: true
  channel: stable
  cloudProvider: aws
  clusterAutoscaler:
    enabled: true
  configBase: s3://cluster-store/cluster-dev-2.k8s.local
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-east-2a
      name: a
    - instanceGroup: master-us-east-2b
      name: b
    - instanceGroup: master-us-east-2c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-east-2a
      name: a
    - instanceGroup: master-us-east-2b
      name: b
    - instanceGroup: master-us-east-2c
      name: c
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    featureGates:
      TTLAfterFinished: "true"
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    featureGates:
      TTLAfterFinished: "true"
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.20.5
  masterInternalName: api.internal.cluster-dev-2.k8s.local
  masterPublicName: api.cluster-dev-2.k8s.local
  metricsServer:
    enabled: true
  networkCIDR: 170.42.0.0/16
  networking:
    kopeio: {}
  nodeTerminationHandler:
    enabled: true
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 170.42.32.0/19
    name: us-east-2a
    type: Private
    zone: us-east-2a
  - cidr: 170.42.64.0/19
    name: us-east-2b
    type: Private
    zone: us-east-2b
  - cidr: 170.42.96.0/19
    name: us-east-2c
    type: Private
    zone: us-east-2c
  - cidr: 170.42.0.0/22
    name: utility-us-east-2a
    type: Utility
    zone: us-east-2a
  - cidr: 170.42.4.0/22
    name: utility-us-east-2b
    type: Utility
    zone: us-east-2b
  - cidr: 170.42.8.0/22
    name: utility-us-east-2c
    type: Utility
    zone: us-east-2c
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2019-10-22T16:31:17Z"
  generation: 6
  labels:
    kops.k8s.io/cluster: cluster-dev-2.k8s.local
  name: bastions
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210315
  machineType: t2.micro
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: bastions
  role: Bastion
  subnets:
  - utility-us-east-2a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2018-08-22T23:09:41Z"
  generation: 9
  labels:
    kops.k8s.io/cluster: cluster-dev-2.k8s.local
  name: master-us-east-2a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210315
  machineType: t2.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-2a
  role: Master
  subnets:
  - us-east-2a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2018-08-22T23:09:41Z"
  generation: 9
  labels:
    kops.k8s.io/cluster: cluster-dev-2.k8s.local
  name: master-us-east-2b
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210315
  machineType: t2.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-2b
  role: Master
  subnets:
  - us-east-2b

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2018-08-22T23:09:41Z"
  generation: 9
  labels:
    kops.k8s.io/cluster: cluster-dev-2.k8s.local
  name: master-us-east-2c
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210315
  machineType: t2.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-2c
  role: Master
  subnets:
  - us-east-2c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2018-08-22T23:09:41Z"
  generation: 13
  labels:
    kops.k8s.io/cluster: cluster-dev-2.k8s.local
  name: nodes
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210315
  machineType: t2.medium
  maxSize: 10
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
  role: Node
  subnets:
  - us-east-2a
  - us-east-2b
  - us-east-2c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-02-19T01:06:07Z"
  generation: 7
  labels:
    kops.k8s.io/cluster: cluster-dev-2.k8s.local
  name: persistent
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210315
  machineType: t2.medium
  maxSize: 5
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: persistent
  role: Node
  subnets:
  - us-east-2a
  taints:
  - persistent:NoSchedule

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

from protokube:

I0423 16:25:21.985382   60125 aws_volume.go:72] AWS API Request: ec2metadata/GetToken
I0423 16:25:21.989141   60125 aws_volume.go:72] AWS API Request: ec2metadata/GetDynamicData
I0423 16:25:21.993221   60125 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata
I0423 16:25:21.995463   60125 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata
I0423 16:25:22.009031   60125 aws_volume.go:72] AWS API Request: ec2/DescribeInstances
I0423 16:25:22.095308   60125 main.go:230] cluster-id: cluster-dev-2.k8s.local
W0423 16:25:22.095344   60125 main.go:308] Unable to fetch HOSTNAME for use as node identifier
I0423 16:25:22.095353   60125 gossip.go:60] gossip dns connection limit is:0
I0423 16:25:22.097368   60125 aws_volume.go:72] AWS API Request: ec2/DescribeInstances
W0423 16:25:22.160869   60125 cluster.go:150] couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided
E0423 16:25:22.163122   60125 main.go:324] Error initializing secondary gossip: create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided
protokube.service: Main process exited, code=exited, status=1/FAILURE
protokube.service: Failed with result 'exit-code'.

from nodeup:

nodeup[33788]: W0423 16:56:30.298471   33788 executor.go:139] error running task "BootstrapClient/BootstrapClient" (4m21s remaining to succeed): lookup kops-controller.internal.cluster-dev-2.k8s.local on 127.0.0.53:53: server misbehaving

9. Anything else do we need to know?

When I went to add in the cert manager for this update, there was an issue where my ig manifests contained kubernetes.io/cluster/cluster-dev-2.k8s.local=owned in the cloudLabels section (I believe from when I first set up a cluster autoscaler, I’m now using the included one). This was reporting now as being “reserved”. I worked around it by editing the manifests manually to remove these lables which allowed me to save the edits to the cluster manifest and deploy

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 23 (2 by maintainers)

Most upvoted comments

We are also experiencing this in an upgrade from 1.19 -> 1.20. It looks like our VPC CIDR is, like the other commenters here, set to a non-private range. We had never run into issues on lower versions of k8s/kops. We only started having this issue when upgrading to 1.20. As a mitigation right now, we set the following in our kops cluster yaml definition:

  gossipConfig:
    secondary:
      protocol: ""
  dnsControllerGossipConfig:
    secondary:
      protocol: ""

It looks like only the secondary gossip protocol has issues with a non-private CIDR.

This isn’t an ideal solution. I don’t have enough context from the commit/pr messages to discern the effects of only having one gossip protocol enabled. It would be good to have someone comment on how we should proceed going forward. If clusters must be in a specific CIDR range, then we should have kops validate that and error out with a helpful message.

dbachrach on Aug 10, 2021

  gossipConfig:
    secondary:
      protocol: ""
  dnsControllerGossipConfig:
    secondary:
      protocol: ""

@dbachrach, do you know if this line still shows up when upgrading past 1.19:

I0817 10:38:41.244708   42154 apply_cluster.go:483] Gossip DNS: skipping DNS validation

I get this with upgrades up to and including 1.19, and it seems like it’s related to your config block. Maybe that is the old default …

opennomad on Aug 24, 2021

It should probably check for a valid CIDR range when you make a cluster with the k8s.local “TLD”

paulbdavis on Aug 24, 2021

I found it happens if you have a private cluster in a non private CIDR range. If you can change your network, fix the CIDR to be private. If you can’t (cough AWS VPC cough) then you just have to remake the cluster.

Paul Davis github.com/dangersalad github.com/Paul Davis

On Aug 6, 2021, at 1:36 PM, Matthias Johnson @.***> wrote:

This just keeps looping on the nodes that fail to join:

Aug 6 19:34:56 ip-172-200-20-231 systemd[1]: protokube.service: Main process exited, code=exited, status=1/FAILURE Aug 6 19:34:56 ip-172-200-20-231 systemd[1]: protokube.service: Failed with result ‘exit-code’. Aug 6 19:34:59 ip-172-200-20-231 systemd[1]: protokube.service: Scheduled restart job, restart counter is at 1451. Aug 6 19:34:59 ip-172-200-20-231 systemd[1]: Stopped Kubernetes Protokube Service. Aug 6 19:34:59 ip-172-200-20-231 systemd[1]: Started Kubernetes Protokube Service. Aug 6 19:34:59 ip-172-200-20-231 protokube[17703]: protokube version 0.1 Aug 6 19:34:59 ip-172-200-20-231 protokube[17703]: I0806 19:34:59.737663 17703 aws_volume.go:72] AWS API Request: ec2metadata/GetToken Aug 6 19:34:59 ip-172-200-20-231 protokube[17703]: I0806 19:34:59.739653 17703 aws_volume.go:72] AWS API Request: ec2metadata/GetDynamicData Aug 6 19:34:59 ip-172-200-20-231 protokube[17703]: I0806 19:34:59.741628 17703 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata Aug 6 19:34:59 ip-172-200-20-231 protokube[17703]: I0806 19:34:59.743062 17703 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata Aug 6 19:34:59 ip-172-200-20-231 protokube[17703]: I0806 19:34:59.745601 17703 aws_volume.go:72] AWS API Request: ec2/DescribeInstances Aug 6 19:34:59 ip-172-200-20-231 protokube[17703]: I0806 19:34:59.892147 17703 main.go:217] cluster-id: kops-dev-749-2.k8s.local Aug 6 19:34:59 ip-172-200-20-231 protokube[17703]: W0806 19:34:59.892184 17703 main.go:295] Unable to fetch HOSTNAME for use as node identifier Aug 6 19:34:59 ip-172-200-20-231 protokube[17703]: I0806 19:34:59.892192 17703 gossip.go:60] gossip dns connection limit is:0 Aug 6 19:34:59 ip-172-200-20-231 protokube[17703]: I0806 19:34:59.892419 17703 aws_volume.go:72] AWS API Request: ec2/DescribeInstances Aug 6 19:35:00 ip-172-200-20-231 protokube[17703]: W0806 19:35:00.004451 17703 cluster.go:150] couldn’t deduce an advertise address: no private IP found, explicit advertise addr not provided Aug 6 19:35:00 ip-172-200-20-231 protokube[17703]: E0806 19:35:00.013573 17703 main.go:311] Error initializing secondary gossip: create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided Aug 6 19:35:00 ip-172-200-20-231 systemd[1]: protokube.service: Main process exited, code=exited, status=1/FAILURE Aug 6 19:35:00 ip-172-200-20-231 systemd[1]: protokube.service: Failed with result ‘exit-code’. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

paulbdavis on Aug 6, 2021

I’ve run some additional tests and stood up some clusters based on new versions. Clusters based on 1.20 and 1.21 did not become healthy using gossip. It seems that with 1.19.3 the cluster becomes healthy, and it is using the ubuntu images. That suggests that the change in images isn’t the culprit.

I then upgraded the 1.19.3 cluster to 1.19.13, which also succeeded with all nodes becoming healthy.

A subsequent upgrade of that cluster from 1.19.13 to 1.20.2 fails with the nodes not joining the cluster successfully.

This suggests that something changed on how nodes are provisioned going to 1.20.

opennomad on Aug 6, 2021