kops: dns-controller fails to run when upgrading 1.15 -> 1.16
1. What kops version are you running? The command kops version, will display
this information.
Version 1.16.0 (git-4b0e62b82)
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
1.15.6, attempting to upgrade to 1.16.8
3. What cloud provider are you using? AWS
4. What commands did you run? What is the simplest way to reproduce this issue? kops replace -f - (with cluster specified via kops toolbox template) kops update cluster --yes kops rolling-update cluster --yes
5. What happened after the commands executed? The bastion was restarted, but then the rolling-update was halted since the dns-controller pod wouldn’t come up.
$ kops rolling-update cluster --yes
NAME STATUS NEEDUPDATE READY MIN MAX NODES
bastions NeedsUpdate 1 0 1 1 0
burst NeedsUpdate 1 0 1 1 1
compute NeedsUpdate 3 0 3 20 3
master-us-west-2a NeedsUpdate 1 0 1 1 1
master-us-west-2b NeedsUpdate 1 0 1 1 1
master-us-west-2c NeedsUpdate 1 0 1 1 1
I0318 12:30:44.079045 32595 instancegroups.go:304] Stopping instance "i-01b7c1444fad56416", in group "bastions.<redacted>.k8s.local" (this may take a while).
I0318 12:30:44.365072 32595 instancegroups.go:189] waiting for 15s after terminating instance
I0318 12:30:59.365501 32595 instancegroups.go:193] Deleted a bastion instance, i-01b7c1444fad56416, and continuing with rolling-update.
W0318 12:31:00.324502 32595 aws_cloud.go:671] ignoring instance as it is terminating: i-01b7c1444fad56416 in autoscaling group: bastions.<redacted>.k8s.local
master not healthy after update, stopping rolling-update: "cluster \"<redacted>.k8s.local\" did not pass validation: InstanceGroup \"bastions\" did not have enough nodes 0 vs 1, kube-system pod \"coredns-7f59d7f88f-ncbvr\" is not ready (coredns), kube-system pod \"dns-controller-8d8645cb4-t6xm2\" is not ready (dns-controller), kube-system pod \"weave-net-7x78x\" is not ready (weave,weave-npc)"
Further attempts to continue the rolling-update failed:
$ kops rolling-update cluster --yes
NAME STATUS NEEDUPDATE READY MIN MAX NODES
bastions Ready 0 1 1 1 0
burst NeedsUpdate 1 0 1 1 1
compute NeedsUpdate 3 0 3 20 3
master-us-west-2a NeedsUpdate 1 0 1 1 1
master-us-west-2b NeedsUpdate 1 0 1 1 1
master-us-west-2c NeedsUpdate 1 0 1 1 1
master not healthy after update, stopping rolling-update: "cluster \"<redacted>.k8s.local\" did not pass validation: kube-system pod \"dns-controller-8d8645cb4-t6xm2\" is not ready (dns-controller)"
6. What did you expect to happen? The cluster to complete a rolling-update successfully
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
Cluster YAML
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
creationTimestamp: null
name: <redacted>.k8s.local
spec:
additionalPolicies:
master: |
[
redacted
]
node: |
[
redacted
]
api:
loadBalancer:
type: Public
authorization:
rbac: {}
channel: stable
cloudLabels:
cluster: <redacted>.k8s.local
cloudProvider: aws
configBase: s3://ct-k8s-<redacted>/<redacted>.k8s.local
encryptionConfig: true
etcdClusters:
- etcdMembers:
- instanceGroup: master-us-west-2a
name: a
- instanceGroup: master-us-west-2b
name: b
- instanceGroup: master-us-west-2c
name: c
name: main
- etcdMembers:
- instanceGroup: master-us-west-2a
name: a
- instanceGroup: master-us-west-2b
name: b
- instanceGroup: master-us-west-2c
name: c
name: events
fileAssets:
- content: "# This is the policy used by Google Cloud Engine\n# https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/configure-helper.sh#L739\napiVersion:
audit.k8s.io/v1beta1\nkind: Policy\nrules:\n # The following requests were
manually identified as high-volume and low-risk,\n # so drop them.\n - level:
None\n users: [\"system:kube-proxy\"]\n verbs: [\"watch\"]\n resources:\n
\ - group: \"\" # core\n resources: [\"endpoints\", \"services\",
\"services/status\"]\n - level: None\n # Ingress controller reads 'configmaps/ingress-uid'
through the unsecured port.\n # TODO(#46983): Change this to the ingress
controller service account.\n users: [\"system:unsecured\"]\n namespaces:
[\"kube-system\"]\n verbs: [\"get\"]\n resources:\n - group: \"\"
# core\n resources: [\"configmaps\"]\n - level: None\n users: [\"kubelet\"]
# legacy kubelet identity\n verbs: [\"get\"]\n resources:\n - group:
\"\" # core\n resources: [\"nodes\", \"nodes/status\"]\n - level: None\n
\ userGroups: [\"system:nodes\"]\n verbs: [\"get\"]\n resources:\n -
group: \"\" # core\n resources: [\"nodes\", \"nodes/status\"]\n - level:
None\n users:\n - system:kube-controller-manager\n - system:kube-scheduler\n
\ - system:serviceaccount:kube-system:endpoint-controller\n verbs: [\"get\",
\"update\"]\n namespaces: [\"kube-system\"]\n resources:\n - group:
\"\" # core\n resources: [\"endpoints\"]\n - level: None\n users:
[\"system:apiserver\"]\n verbs: [\"get\"]\n resources:\n - group:
\"\" # core\n resources: [\"namespaces\", \"namespaces/status\", \"namespaces/finalize\"]\n
\ # Don't log HPA fetching metrics.\n - level: None\n users:\n - system:kube-controller-manager\n
\ verbs: [\"get\", \"list\"]\n resources:\n - group: \"metrics.k8s.io\"\n
\ # Don't log these read-only URLs.\n - level: None\n nonResourceURLs:\n
\ - /healthz*\n - /version\n - /swagger*\n # Don't log events
requests.\n - level: None\n resources:\n - group: \"\" # core\n resources:
[\"events\"]\n # node and pod status calls from nodes are high-volume and can
be large, don't log responses for expected updates from nodes\n - level: Request\n
\ users: [\"kubelet\", \"system:node-problem-detector\", \"system:serviceaccount:kube-system:node-problem-detector\"]\n
\ verbs: [\"update\",\"patch\"]\n resources:\n - group: \"\" # core\n
\ resources: [\"nodes/status\", \"pods/status\"]\n omitStages:\n -
\"RequestReceived\"\n - level: Request\n userGroups: [\"system:nodes\"]\n
\ verbs: [\"update\",\"patch\"]\n resources:\n - group: \"\" # core\n
\ resources: [\"nodes/status\", \"pods/status\"]\n omitStages:\n -
\"RequestReceived\"\n # deletecollection calls can be large, don't log responses
for expected namespace deletions\n - level: Request\n users: [\"system:serviceaccount:kube-system:namespace-controller\"]\n
\ verbs: [\"deletecollection\"]\n omitStages:\n - \"RequestReceived\"\n
\ # Secrets, ConfigMaps, and TokenReviews can contain sensitive & binary data,\n
\ # so only log at the Metadata level.\n - level: Metadata\n resources:\n
\ - group: \"\" # core\n resources: [\"secrets\", \"configmaps\"]\n
\ - group: authentication.k8s.io\n resources: [\"tokenreviews\"]\n
\ omitStages:\n - \"RequestReceived\"\n # Get repsonses can be large;
skip them.\n - level: Request\n verbs: [\"get\", \"list\", \"watch\"]\n
\ resources:\n - group: \"\" # core\n - group: \"admissionregistration.k8s.io\"\n
\ - group: \"apiextensions.k8s.io\"\n - group: \"apiregistration.k8s.io\"\n
\ - group: \"apps\"\n - group: \"authentication.k8s.io\"\n - group:
\"authorization.k8s.io\"\n - group: \"autoscaling\"\n - group: \"batch\"\n
\ - group: \"certificates.k8s.io\"\n - group: \"extensions\"\n - group:
\"metrics.k8s.io\"\n - group: \"networking.k8s.io\"\n - group: \"policy\"\n
\ - group: \"rbac.authorization.k8s.io\"\n - group: \"scheduling.k8s.io\"\n
\ - group: \"settings.k8s.io\"\n - group: \"storage.k8s.io\"\n omitStages:\n
\ - \"RequestReceived\"\n # Default level for known APIs\n - level: RequestResponse\n
\ resources:\n - group: \"\" # core\n - group: \"admissionregistration.k8s.io\"\n
\ - group: \"apiextensions.k8s.io\"\n - group: \"apiregistration.k8s.io\"\n
\ - group: \"apps\"\n - group: \"authentication.k8s.io\"\n - group:
\"authorization.k8s.io\"\n - group: \"autoscaling\"\n - group: \"batch\"\n
\ - group: \"certificates.k8s.io\"\n - group: \"extensions\"\n - group:
\"metrics.k8s.io\"\n - group: \"networking.k8s.io\"\n - group: \"policy\"\n
\ - group: \"rbac.authorization.k8s.io\"\n - group: \"scheduling.k8s.io\"\n
\ - group: \"settings.k8s.io\"\n - group: \"storage.k8s.io\" \n omitStages:\n
\ - \"RequestReceived\"\n # Default level for all other requests.\n -
level: Metadata\n omitStages:\n - \"RequestReceived\"\n"
name: audit-policy.yaml
path: /srv/kubernetes/audit-policy.yaml
roles:
- Master
hooks:
- before:
- kubelet.service
manifest: |
[Unit]
Description=Download AWS Authenticator configs from S3
[Service]
Type=oneshot
ExecStart=/bin/mkdir -p /srv/kubernetes/aws-iam-authenticator
ExecStart=/usr/local/bin/aws s3 cp --recursive s3://ct-k8s-<redacted>/<redacted>.k8s.local/addons/authenticator /srv/kubernetes/aws-iam-authenticator/
name: kops-hook-authenticator-config.service
roles:
- Master
iam:
allowContainerRegistry: true
legacy: false
kubeAPIServer:
admissionControl:
- NamespaceLifecycle
- LimitRanger
- ServiceAccount
- PersistentVolumeLabel
- PersistentVolumeClaimResize
- DefaultStorageClass
- DefaultTolerationSeconds
- MutatingAdmissionWebhook
- ValidatingAdmissionWebhook
- NodeRestriction
- ResourceQuota
- AlwaysPullImages
- PodSecurityPolicy
- DenyEscalatingExec
auditLogMaxAge: 30
auditLogMaxBackups: 10
auditLogMaxSize: 100
auditLogPath: /var/log/kube-apiserver-audit.log
auditPolicyFile: /srv/kubernetes/audit-policy.yaml
authenticationTokenWebhookConfigFile: /srv/kubernetes/aws-iam-authenticator/kubeconfig.yaml
featureGates:
ExpandPersistentVolumes: "true"
TTLAfterFinished: "true"
runtimeConfig:
api/all: "true"
kubeControllerManager:
featureGates:
ExpandPersistentVolumes: "true"
TTLAfterFinished: "true"
horizontalPodAutoscalerUseRestClients: true
kubeDNS:
provider: CoreDNS
kubelet:
featureGates:
ExpandPersistentVolumes: "true"
ExperimentalCriticalPodAnnotation: "true"
TTLAfterFinished: "true"
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.16.8
masterInternalName: api.internal.<redacted>.k8s.local
masterPublicName: api.<redacted>.k8s.local
networkCIDR: 172.20.0.0/16
networking:
weave:
mtu: 8912
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- 0.0.0.0/0
subnets:
- cidr: 172.20.32.0/19
name: us-west-2a
type: Private
zone: us-west-2a
- cidr: 172.20.64.0/19
name: us-west-2b
type: Private
zone: us-west-2b
- cidr: 172.20.96.0/19
name: us-west-2c
type: Private
zone: us-west-2c
- cidr: 172.20.0.0/22
name: utility-us-west-2a
type: Utility
zone: us-west-2a
- cidr: 172.20.4.0/22
name: utility-us-west-2b
type: Utility
zone: us-west-2b
- cidr: 172.20.8.0/22
name: utility-us-west-2c
type: Utility
zone: us-west-2c
topology:
dns:
type: Public
masters: private
nodes: private
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2020-03-18T18:47:20Z"
generation: 2
labels:
kops.k8s.io/cluster: <redacted>.k8s.local
name: bastions
spec:
image: ami-07484b38968c888a3
machineType: t2.micro
maxSize: 1
minSize: 1
nodeLabels:
InstanceGroup: bastions
kops.k8s.io/instancegroup: bastions
node.kubernetes.io/instancegroup: bastions
role: Bastion
subnets:
- us-west-2a
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2020-03-18T18:47:22Z"
generation: 2
labels:
kops.k8s.io/cluster: <redacted>.k8s.local
name: burst
spec:
image: ami-07484b38968c888a3
machineType: t2.2xlarge
maxSize: 1
minSize: 1
nodeLabels:
InstanceGroup: burst
kops.k8s.io/instancegroup: burst
node.kubernetes.io/instancegroup: burst
role: Node
rootVolumeSize: 500
rootVolumeType: gp2
subnets:
- us-west-2a
- us-west-2b
- us-west-2c
taints:
- InstanceGroup=burst:NoSchedule
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2020-03-18T18:47:22Z"
generation: 2
labels:
kops.k8s.io/cluster: <redacted>.k8s.local
name: compute
spec:
image: ami-07484b38968c888a3
machineType: c5.2xlarge
maxSize: 20
minSize: 3
nodeLabels:
InstanceGroup: compute
cluster-autoscaler/<redacted>: "true"
kops.k8s.io/instancegroup: compute
node.kubernetes.io/instancegroup: compute
role: Node
rootVolumeSize: 500
rootVolumeType: gp2
subnets:
- us-west-2a
- us-west-2b
- us-west-2c
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2020-03-18T18:47:20Z"
generation: 2
labels:
kops.k8s.io/cluster: <redacted>.k8s.local
name: master-us-west-2a
spec:
image: ami-07484b38968c888a3
machineType: t2.large
maxSize: 1
minSize: 1
nodeLabels:
InstanceGroup: master-us-west-2a
kops.k8s.io/instancegroup: master-us-west-2a
node.kubernetes.io/instancegroup: master-us-west-2a
role: Master
subnets:
- us-west-2a
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2020-03-18T18:47:21Z"
generation: 2
labels:
kops.k8s.io/cluster: <redacted>.k8s.local
name: master-us-west-2b
spec:
image: ami-07484b38968c888a3
machineType: t2.large
maxSize: 1
minSize: 1
nodeLabels:
InstanceGroup: master-us-west-2b
kops.k8s.io/instancegroup: master-us-west-2b
node.kubernetes.io/instancegroup: master-us-west-2b
role: Master
subnets:
- us-west-2b
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2020-03-18T18:47:21Z"
generation: 2
labels:
kops.k8s.io/cluster: <redacted>.k8s.local
name: master-us-west-2c
spec:
image: ami-07484b38968c888a3
machineType: t2.large
maxSize: 1
minSize: 1
nodeLabels:
InstanceGroup: master-us-west-2c
kops.k8s.io/instancegroup: master-us-west-2c
node.kubernetes.io/instancegroup: master-us-west-2c
role: Master
subnets:
- us-west-2c
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
I’m not sure theres anything useful here - the issue is with the dns-controller. But I can provide details if necessary
9. Anything else do we need to know?
I noticed that the dns-controller was updated as part of this change. From kops update cluster I saw this:
ManagedFile/kooper.k8s.local-addons-dns-controller.addons.k8s.io-k8s-1.12
Contents
...
k8s-addon: dns-controller.addons.k8s.io
k8s-app: dns-controller
+ version: v1.16.0
- version: v1.15.0
name: dns-controller
namespace: kube-system
...
k8s-addon: dns-controller.addons.k8s.io
k8s-app: dns-controller
+ version: v1.16.0
- version: v1.15.0
spec:
containers:
...
- --dns=gossip
- --gossip-seed=127.0.0.1:3999
+ - --gossip-protocol-secondary=memberlist
+ - --gossip-listen-secondary=0.0.0.0:3993
+ - --gossip-seed-secondary=127.0.0.1:4000
- --zone=*/*
- -v=2
+ image: kope/dns-controller:1.16.0
- image: kope/dns-controller:1.15.0
name: dns-controller
resources:
...
nodeSelector:
node-role.kubernetes.io/master: ""
+ priorityClassName: system-cluster-critical
serviceAccount: dns-controller
tolerations:
...
I think what is relevant here specifically is the --gossip-seed-secondary=127.0.0.1:4000 addition. Here are the logs of the dns-controller pod that fails to come up:
dns-controller version 1.16.0
I0318 19:01:19.748976 1 gossip.go:60] gossip dns connection limit is:0
I0318 19:01:19.749078 1 cluster.go:145] resolved peers to following addresses peers=127.0.0.1:4000
I0318 19:01:19.754448 1 cluster.go:157] setting advertise address explicitly addr=172.20.58.18 port=3993
I0318 19:01:19.754879 1 delegate.go:227] received NotifyJoin node=01E3QGAZRA7DQEPV2HCSPC8887 addr=172.20.58.18:3993
I0318 19:01:19.754943 1 main.go:209] initializing the watch controllers, namespace: ""
I0318 19:01:19.754960 1 main.go:233] Ingress controller disabled
I0318 19:01:19.754975 1 dnscontroller.go:105] starting DNS controller
I0318 19:01:19.754992 1 dnscontroller.go:158] scope not yet ready: node
I0318 19:01:19.755200 1 node.go:57] starting node controller
I0318 19:01:19.755476 1 pod.go:60] starting pod controller
I0318 19:01:19.755631 1 service.go:59] starting service controller
I0318 19:01:19.755787 1 gossip.go:120] Querying for seeds
I0318 19:01:19.755801 1 gossip.go:129] Got seeds: [127.0.0.1:3999]
I0318 19:01:19.755815 1 gossip.go:144] Seeding successful
I0318 19:01:19.755846 1 glogger.go:31] ->[127.0.0.1:3999] attempting connection
I0318 19:01:19.756173 1 cluster.go:337] memberlist 2020/03/18 19:01:19 [DEBUG] memberlist: Failed to join 127.0.0.1: dial tcp 127.0.0.1:4000: connect: connection refused
W0318 19:01:19.756185 1 cluster.go:223] failed to join cluster: 1 error occurred:
* Failed to join 127.0.0.1: dial tcp 127.0.0.1:4000: connect: connection refused
I0318 19:01:19.756193 1 cluster.go:225] will retry joining cluster every 10s
F0318 19:01:19.756202 1 main.go:172] gossip exited unexpectedly: 1 error occurred:
* Failed to join 127.0.0.1: dial tcp 127.0.0.1:4000: connect: connection refused
That 4000 port seems to be causing issues. I hopped onto the node to see if the port was in use, but its not:
$ sudo netstat -tulpn | grep LISTEN
tcp 0 0 127.0.0.1:10248 0.0.0.0:* LISTEN 2281/kubelet
tcp 0 0 127.0.0.1:10249 0.0.0.0:* LISTEN 3226/kube-proxy
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 245/rpcbind
tcp 0 0 127.0.0.1:8080 0.0.0.0:* LISTEN 3628/kube-apiserver
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 574/sshd
tcp 0 0 172.20.58.18:3996 0.0.0.0:* LISTEN 2915/etcd-manager
tcp 0 0 172.20.58.18:3997 0.0.0.0:* LISTEN 2974/etcd-manager
tcp 0 0 0.0.0.0:3998 0.0.0.0:* LISTEN 4847/dns-controller
tcp 0 0 0.0.0.0:6783 0.0.0.0:* LISTEN 8512/weaver
tcp 0 0 0.0.0.0:3999 0.0.0.0:* LISTEN 2245/protokube
tcp 0 0 127.0.0.1:6784 0.0.0.0:* LISTEN 8512/weaver
tcp 0 0 127.0.0.1:32769 0.0.0.0:* LISTEN 2281/kubelet
tcp6 0 0 :::10250 :::* LISTEN 2281/kubelet
tcp6 0 0 :::10251 :::* LISTEN 3146/kube-scheduler
tcp6 0 0 :::2380 :::* LISTEN 3575/etcd
tcp6 0 0 :::10252 :::* LISTEN 2747/kube-controlle
tcp6 0 0 :::2381 :::* LISTEN 3564/etcd
tcp6 0 0 :::10255 :::* LISTEN 2281/kubelet
tcp6 0 0 :::111 :::* LISTEN 245/rpcbind
tcp6 0 0 :::10256 :::* LISTEN 3226/kube-proxy
tcp6 0 0 :::10257 :::* LISTEN 2747/kube-controlle
tcp6 0 0 :::10259 :::* LISTEN 3146/kube-scheduler
tcp6 0 0 :::22 :::* LISTEN 574/sshd
tcp6 0 0 :::443 :::* LISTEN 3628/kube-apiserver
tcp6 0 0 :::6781 :::* LISTEN 8451/weave-npc
tcp6 0 0 :::6782 :::* LISTEN 8512/weaver
tcp6 0 0 :::4001 :::* LISTEN 3575/etcd
tcp6 0 0 :::4002 :::* LISTEN 3564/etcd
Some quick google searching doesn’t turn up any results around this issue, so I’m hoping I can get some help here. Thanks for any and all assistance.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 6
- Comments: 25 (4 by maintainers)
So I did a workaround for this until this issue is sorted. You can run:
kubectl rollout history deployment.v1.apps/dns-controller -n kube-systemAnd check how many revisions are there. If there are lets say 5, rollback to 4 (which will be 1.15)
kubectl rollout undo deployment/dns-controller --to-revision=4 -n kube-systemThis should bring the dns-controller back up and the master node should be healthy.
Have not seen any direct consequences of running this and if someone else knows of any, would appreciate what to do going forward.
Yes, I can verifiy @juris conclusion.
kops upgrade cluster --yes && kops update cluster --yesmake early cluster modification (new version of dns-controller), but this shouldn’t be problem if “old” version of dns-controller is still running (rolling update).Just make on 1 master instance
--cloud-onlyupgrade:kops rolling-update cluster --instance-group masterXXXX --cloudonly --yesand wait until new master (1.16.7) joins cluster. Restart new dns-controller (probably in CrashLoopBack) and things should work now.@akhmadfld Once you do a rolling update with kops 1.16.3, it will replace the default image of the dns-controller to 1.16.3 as well and you should be able to proceed without issues. Have just done the same myself to fix the certificate issues in etcd.