kops: dns-controller fails to run when upgrading 1.15 -> 1.16

1. What kops version are you running? The command kops version, will display this information. Version 1.16.0 (git-4b0e62b82)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. 1.15.6, attempting to upgrade to 1.16.8

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue? kops replace -f - (with cluster specified via kops toolbox template) kops update cluster --yes kops rolling-update cluster --yes

5. What happened after the commands executed? The bastion was restarted, but then the rolling-update was halted since the dns-controller pod wouldn’t come up.

$ kops rolling-update cluster --yes
NAME                    STATUS          NEEDUPDATE      READY   MIN     MAX     NODES
bastions                NeedsUpdate     1               0       1       1       0
burst                   NeedsUpdate     1               0       1       1       1
compute                 NeedsUpdate     3               0       3       20      3
master-us-west-2a       NeedsUpdate     1               0       1       1       1
master-us-west-2b       NeedsUpdate     1               0       1       1       1
master-us-west-2c       NeedsUpdate     1               0       1       1       1
I0318 12:30:44.079045   32595 instancegroups.go:304] Stopping instance "i-01b7c1444fad56416", in group "bastions.<redacted>.k8s.local" (this may take a while).
I0318 12:30:44.365072   32595 instancegroups.go:189] waiting for 15s after terminating instance
I0318 12:30:59.365501   32595 instancegroups.go:193] Deleted a bastion instance, i-01b7c1444fad56416, and continuing with rolling-update.
W0318 12:31:00.324502   32595 aws_cloud.go:671] ignoring instance as it is terminating: i-01b7c1444fad56416 in autoscaling group: bastions.<redacted>.k8s.local

master not healthy after update, stopping rolling-update: "cluster \"<redacted>.k8s.local\" did not pass validation: InstanceGroup \"bastions\" did not have enough nodes 0 vs 1, kube-system pod \"coredns-7f59d7f88f-ncbvr\" is not ready (coredns), kube-system pod \"dns-controller-8d8645cb4-t6xm2\" is not ready (dns-controller), kube-system pod \"weave-net-7x78x\" is not ready (weave,weave-npc)"

Further attempts to continue the rolling-update failed:

$ kops rolling-update cluster --yes
NAME                    STATUS          NEEDUPDATE      READY   MIN     MAX     NODES
bastions                Ready           0               1       1       1       0
burst                   NeedsUpdate     1               0       1       1       1
compute                 NeedsUpdate     3               0       3       20      3
master-us-west-2a       NeedsUpdate     1               0       1       1       1
master-us-west-2b       NeedsUpdate     1               0       1       1       1
master-us-west-2c       NeedsUpdate     1               0       1       1       1

master not healthy after update, stopping rolling-update: "cluster \"<redacted>.k8s.local\" did not pass validation: kube-system pod \"dns-controller-8d8645cb4-t6xm2\" is not ready (dns-controller)"

6. What did you expect to happen? The cluster to complete a rolling-update successfully

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

Cluster YAML

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  name: <redacted>.k8s.local
spec:
  additionalPolicies:
    master: |
      [
redacted
      ]
    node: |
      [
redacted
      ]
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    cluster: <redacted>.k8s.local
  cloudProvider: aws
  configBase: s3://ct-k8s-<redacted>/<redacted>.k8s.local
  encryptionConfig: true
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    - instanceGroup: master-us-west-2b
      name: b
    - instanceGroup: master-us-west-2c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    - instanceGroup: master-us-west-2b
      name: b
    - instanceGroup: master-us-west-2c
      name: c
    name: events
  fileAssets:
  - content: "# This is the policy used by Google Cloud Engine\n# https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/configure-helper.sh#L739\napiVersion:
      audit.k8s.io/v1beta1\nkind: Policy\nrules:\n  # The following requests were
      manually identified as high-volume and low-risk,\n  # so drop them.\n  - level:
      None\n    users: [\"system:kube-proxy\"]\n    verbs: [\"watch\"]\n    resources:\n
      \     - group: \"\" # core\n        resources: [\"endpoints\", \"services\",
      \"services/status\"]\n  - level: None\n    # Ingress controller reads 'configmaps/ingress-uid'
      through the unsecured port.\n    # TODO(#46983): Change this to the ingress
      controller service account.\n    users: [\"system:unsecured\"]\n    namespaces:
      [\"kube-system\"]\n    verbs: [\"get\"]\n    resources:\n      - group: \"\"
      # core\n        resources: [\"configmaps\"]\n  - level: None\n    users: [\"kubelet\"]
      # legacy kubelet identity\n    verbs: [\"get\"]\n    resources:\n      - group:
      \"\" # core\n        resources: [\"nodes\", \"nodes/status\"]\n  - level: None\n
      \   userGroups: [\"system:nodes\"]\n    verbs: [\"get\"]\n    resources:\n      -
      group: \"\" # core\n        resources: [\"nodes\", \"nodes/status\"]\n  - level:
      None\n    users:\n      - system:kube-controller-manager\n      - system:kube-scheduler\n
      \     - system:serviceaccount:kube-system:endpoint-controller\n    verbs: [\"get\",
      \"update\"]\n    namespaces: [\"kube-system\"]\n    resources:\n      - group:
      \"\" # core\n        resources: [\"endpoints\"]\n  - level: None\n    users:
      [\"system:apiserver\"]\n    verbs: [\"get\"]\n    resources:\n      - group:
      \"\" # core\n        resources: [\"namespaces\", \"namespaces/status\", \"namespaces/finalize\"]\n
      \ # Don't log HPA fetching metrics.\n  - level: None\n    users:\n      - system:kube-controller-manager\n
      \   verbs: [\"get\", \"list\"]\n    resources:\n      - group: \"metrics.k8s.io\"\n
      \ # Don't log these read-only URLs.\n  - level: None\n    nonResourceURLs:\n
      \     - /healthz*\n      - /version\n      - /swagger*\n  # Don't log events
      requests.\n  - level: None\n    resources:\n      - group: \"\" # core\n        resources:
      [\"events\"]\n  # node and pod status calls from nodes are high-volume and can
      be large, don't log responses for expected updates from nodes\n  - level: Request\n
      \   users: [\"kubelet\", \"system:node-problem-detector\", \"system:serviceaccount:kube-system:node-problem-detector\"]\n
      \   verbs: [\"update\",\"patch\"]\n    resources:\n      - group: \"\" # core\n
      \       resources: [\"nodes/status\", \"pods/status\"]\n    omitStages:\n      -
      \"RequestReceived\"\n  - level: Request\n    userGroups: [\"system:nodes\"]\n
      \   verbs: [\"update\",\"patch\"]\n    resources:\n      - group: \"\" # core\n
      \       resources: [\"nodes/status\", \"pods/status\"]\n    omitStages:\n      -
      \"RequestReceived\"\n  # deletecollection calls can be large, don't log responses
      for expected namespace deletions\n  - level: Request\n    users: [\"system:serviceaccount:kube-system:namespace-controller\"]\n
      \   verbs: [\"deletecollection\"]\n    omitStages:\n      - \"RequestReceived\"\n
      \ # Secrets, ConfigMaps, and TokenReviews can contain sensitive & binary data,\n
      \ # so only log at the Metadata level.\n  - level: Metadata\n    resources:\n
      \     - group: \"\" # core\n        resources: [\"secrets\", \"configmaps\"]\n
      \     - group: authentication.k8s.io\n        resources: [\"tokenreviews\"]\n
      \   omitStages:\n      - \"RequestReceived\"\n  # Get repsonses can be large;
      skip them.\n  - level: Request\n    verbs: [\"get\", \"list\", \"watch\"]\n
      \   resources:\n    - group: \"\" # core\n    - group: \"admissionregistration.k8s.io\"\n
      \   - group: \"apiextensions.k8s.io\"\n    - group: \"apiregistration.k8s.io\"\n
      \   - group: \"apps\"\n    - group: \"authentication.k8s.io\"\n    - group:
      \"authorization.k8s.io\"\n    - group: \"autoscaling\"\n    - group: \"batch\"\n
      \   - group: \"certificates.k8s.io\"\n    - group: \"extensions\"\n    - group:
      \"metrics.k8s.io\"\n    - group: \"networking.k8s.io\"\n    - group: \"policy\"\n
      \   - group: \"rbac.authorization.k8s.io\"\n    - group: \"scheduling.k8s.io\"\n
      \   - group: \"settings.k8s.io\"\n    - group: \"storage.k8s.io\"\n    omitStages:\n
      \     - \"RequestReceived\"\n  # Default level for known APIs\n  - level: RequestResponse\n
      \   resources:\n    - group: \"\" # core\n    - group: \"admissionregistration.k8s.io\"\n
      \   - group: \"apiextensions.k8s.io\"\n    - group: \"apiregistration.k8s.io\"\n
      \   - group: \"apps\"\n    - group: \"authentication.k8s.io\"\n    - group:
      \"authorization.k8s.io\"\n    - group: \"autoscaling\"\n    - group: \"batch\"\n
      \   - group: \"certificates.k8s.io\"\n    - group: \"extensions\"\n    - group:
      \"metrics.k8s.io\"\n    - group: \"networking.k8s.io\"\n    - group: \"policy\"\n
      \   - group: \"rbac.authorization.k8s.io\"\n    - group: \"scheduling.k8s.io\"\n
      \   - group: \"settings.k8s.io\"\n    - group: \"storage.k8s.io\"      \n    omitStages:\n
      \     - \"RequestReceived\"\n  # Default level for all other requests.\n  -
      level: Metadata\n    omitStages:\n      - \"RequestReceived\"\n"
    name: audit-policy.yaml
    path: /srv/kubernetes/audit-policy.yaml
    roles:
    - Master
  hooks:
  - before:
    - kubelet.service
    manifest: |
      [Unit]
      Description=Download AWS Authenticator configs from S3
      [Service]
      Type=oneshot
      ExecStart=/bin/mkdir -p /srv/kubernetes/aws-iam-authenticator
      ExecStart=/usr/local/bin/aws s3 cp --recursive s3://ct-k8s-<redacted>/<redacted>.k8s.local/addons/authenticator /srv/kubernetes/aws-iam-authenticator/
    name: kops-hook-authenticator-config.service
    roles:
    - Master
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    admissionControl:
    - NamespaceLifecycle
    - LimitRanger
    - ServiceAccount
    - PersistentVolumeLabel
    - PersistentVolumeClaimResize
    - DefaultStorageClass
    - DefaultTolerationSeconds
    - MutatingAdmissionWebhook
    - ValidatingAdmissionWebhook
    - NodeRestriction
    - ResourceQuota
    - AlwaysPullImages
    - PodSecurityPolicy
    - DenyEscalatingExec
    auditLogMaxAge: 30
    auditLogMaxBackups: 10
    auditLogMaxSize: 100
    auditLogPath: /var/log/kube-apiserver-audit.log
    auditPolicyFile: /srv/kubernetes/audit-policy.yaml
    authenticationTokenWebhookConfigFile: /srv/kubernetes/aws-iam-authenticator/kubeconfig.yaml
    featureGates:
      ExpandPersistentVolumes: "true"
      TTLAfterFinished: "true"
    runtimeConfig:
      api/all: "true"
  kubeControllerManager:
    featureGates:
      ExpandPersistentVolumes: "true"
      TTLAfterFinished: "true"
    horizontalPodAutoscalerUseRestClients: true
  kubeDNS:
    provider: CoreDNS
  kubelet:
    featureGates:
      ExpandPersistentVolumes: "true"
      ExperimentalCriticalPodAnnotation: "true"
      TTLAfterFinished: "true"
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.16.8
  masterInternalName: api.internal.<redacted>.k8s.local
  masterPublicName: api.<redacted>.k8s.local
  networkCIDR: 172.20.0.0/16
  networking:
    weave:
      mtu: 8912
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.20.32.0/19
    name: us-west-2a
    type: Private
    zone: us-west-2a
  - cidr: 172.20.64.0/19
    name: us-west-2b
    type: Private
    zone: us-west-2b
  - cidr: 172.20.96.0/19
    name: us-west-2c
    type: Private
    zone: us-west-2c
  - cidr: 172.20.0.0/22
    name: utility-us-west-2a
    type: Utility
    zone: us-west-2a
  - cidr: 172.20.4.0/22
    name: utility-us-west-2b
    type: Utility
    zone: us-west-2b
  - cidr: 172.20.8.0/22
    name: utility-us-west-2c
    type: Utility
    zone: us-west-2c
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-03-18T18:47:20Z"
  generation: 2
  labels:
    kops.k8s.io/cluster: <redacted>.k8s.local
  name: bastions
spec:
  image: ami-07484b38968c888a3
  machineType: t2.micro
  maxSize: 1
  minSize: 1
  nodeLabels:
    InstanceGroup: bastions
    kops.k8s.io/instancegroup: bastions
    node.kubernetes.io/instancegroup: bastions
  role: Bastion
  subnets:
  - us-west-2a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-03-18T18:47:22Z"
  generation: 2
  labels:
    kops.k8s.io/cluster: <redacted>.k8s.local
  name: burst
spec:
  image: ami-07484b38968c888a3
  machineType: t2.2xlarge
  maxSize: 1
  minSize: 1
  nodeLabels:
    InstanceGroup: burst
    kops.k8s.io/instancegroup: burst
    node.kubernetes.io/instancegroup: burst
  role: Node
  rootVolumeSize: 500
  rootVolumeType: gp2
  subnets:
  - us-west-2a
  - us-west-2b
  - us-west-2c
  taints:
  - InstanceGroup=burst:NoSchedule

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-03-18T18:47:22Z"
  generation: 2
  labels:
    kops.k8s.io/cluster: <redacted>.k8s.local
  name: compute
spec:
  image: ami-07484b38968c888a3
  machineType: c5.2xlarge
  maxSize: 20
  minSize: 3
  nodeLabels:
    InstanceGroup: compute
    cluster-autoscaler/<redacted>: "true"
    kops.k8s.io/instancegroup: compute
    node.kubernetes.io/instancegroup: compute
  role: Node
  rootVolumeSize: 500
  rootVolumeType: gp2
  subnets:
  - us-west-2a
  - us-west-2b
  - us-west-2c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-03-18T18:47:20Z"
  generation: 2
  labels:
    kops.k8s.io/cluster: <redacted>.k8s.local
  name: master-us-west-2a
spec:
  image: ami-07484b38968c888a3
  machineType: t2.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    InstanceGroup: master-us-west-2a
    kops.k8s.io/instancegroup: master-us-west-2a
    node.kubernetes.io/instancegroup: master-us-west-2a
  role: Master
  subnets:
  - us-west-2a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-03-18T18:47:21Z"
  generation: 2
  labels:
    kops.k8s.io/cluster: <redacted>.k8s.local
  name: master-us-west-2b
spec:
  image: ami-07484b38968c888a3
  machineType: t2.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    InstanceGroup: master-us-west-2b
    kops.k8s.io/instancegroup: master-us-west-2b
    node.kubernetes.io/instancegroup: master-us-west-2b
  role: Master
  subnets:
  - us-west-2b

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-03-18T18:47:21Z"
  generation: 2
  labels:
    kops.k8s.io/cluster: <redacted>.k8s.local
  name: master-us-west-2c
spec:
  image: ami-07484b38968c888a3
  machineType: t2.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    InstanceGroup: master-us-west-2c
    kops.k8s.io/instancegroup: master-us-west-2c
    node.kubernetes.io/instancegroup: master-us-west-2c
  role: Master
  subnets:
  - us-west-2c

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here. I’m not sure theres anything useful here - the issue is with the dns-controller. But I can provide details if necessary

9. Anything else do we need to know? I noticed that the dns-controller was updated as part of this change. From kops update cluster I saw this:

  ManagedFile/kooper.k8s.local-addons-dns-controller.addons.k8s.io-k8s-1.12
        Contents            
                                ...
                                      k8s-addon: dns-controller.addons.k8s.io
                                      k8s-app: dns-controller
                                +     version: v1.16.0
                                -     version: v1.15.0
                                    name: dns-controller
                                    namespace: kube-system
                                ...
                                          k8s-addon: dns-controller.addons.k8s.io
                                          k8s-app: dns-controller
                                +         version: v1.16.0
                                -         version: v1.15.0
                                      spec:
                                        containers:
                                ...
                                          - --dns=gossip
                                          - --gossip-seed=127.0.0.1:3999
                                +         - --gossip-protocol-secondary=memberlist
                                +         - --gossip-listen-secondary=0.0.0.0:3993
                                +         - --gossip-seed-secondary=127.0.0.1:4000
                                          - --zone=*/*
                                          - -v=2
                                +         image: kope/dns-controller:1.16.0
                                -         image: kope/dns-controller:1.15.0
                                          name: dns-controller
                                          resources:
                                ...
                                        nodeSelector:
                                          node-role.kubernetes.io/master: ""
                                +       priorityClassName: system-cluster-critical
                                        serviceAccount: dns-controller
                                        tolerations:
                                ...

I think what is relevant here specifically is the --gossip-seed-secondary=127.0.0.1:4000 addition. Here are the logs of the dns-controller pod that fails to come up:

dns-controller version 1.16.0
I0318 19:01:19.748976       1 gossip.go:60] gossip dns connection limit is:0
I0318 19:01:19.749078       1 cluster.go:145] resolved peers to following addresses peers=127.0.0.1:4000
I0318 19:01:19.754448       1 cluster.go:157] setting advertise address explicitly addr=172.20.58.18 port=3993
I0318 19:01:19.754879       1 delegate.go:227] received NotifyJoin node=01E3QGAZRA7DQEPV2HCSPC8887 addr=172.20.58.18:3993
I0318 19:01:19.754943       1 main.go:209] initializing the watch controllers, namespace: ""
I0318 19:01:19.754960       1 main.go:233] Ingress controller disabled
I0318 19:01:19.754975       1 dnscontroller.go:105] starting DNS controller
I0318 19:01:19.754992       1 dnscontroller.go:158] scope not yet ready: node
I0318 19:01:19.755200       1 node.go:57] starting node controller
I0318 19:01:19.755476       1 pod.go:60] starting pod controller
I0318 19:01:19.755631       1 service.go:59] starting service controller
I0318 19:01:19.755787       1 gossip.go:120] Querying for seeds
I0318 19:01:19.755801       1 gossip.go:129] Got seeds: [127.0.0.1:3999]
I0318 19:01:19.755815       1 gossip.go:144] Seeding successful
I0318 19:01:19.755846       1 glogger.go:31] ->[127.0.0.1:3999] attempting connection
I0318 19:01:19.756173       1 cluster.go:337] memberlist 2020/03/18 19:01:19 [DEBUG] memberlist: Failed to join 127.0.0.1: dial tcp 127.0.0.1:4000: connect: connection refused
W0318 19:01:19.756185       1 cluster.go:223] failed to join cluster: 1 error occurred:
        * Failed to join 127.0.0.1: dial tcp 127.0.0.1:4000: connect: connection refused

I0318 19:01:19.756193       1 cluster.go:225] will retry joining cluster every 10s
F0318 19:01:19.756202       1 main.go:172] gossip exited unexpectedly: 1 error occurred:
        * Failed to join 127.0.0.1: dial tcp 127.0.0.1:4000: connect: connection refused

That 4000 port seems to be causing issues. I hopped onto the node to see if the port was in use, but its not:

$ sudo netstat -tulpn | grep LISTEN    
tcp        0      0 127.0.0.1:10248         0.0.0.0:*               LISTEN      2281/kubelet        
tcp        0      0 127.0.0.1:10249         0.0.0.0:*               LISTEN      3226/kube-proxy     
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      245/rpcbind         
tcp        0      0 127.0.0.1:8080          0.0.0.0:*               LISTEN      3628/kube-apiserver 
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      574/sshd            
tcp        0      0 172.20.58.18:3996       0.0.0.0:*               LISTEN      2915/etcd-manager   
tcp        0      0 172.20.58.18:3997       0.0.0.0:*               LISTEN      2974/etcd-manager   
tcp        0      0 0.0.0.0:3998            0.0.0.0:*               LISTEN      4847/dns-controller 
tcp        0      0 0.0.0.0:6783            0.0.0.0:*               LISTEN      8512/weaver         
tcp        0      0 0.0.0.0:3999            0.0.0.0:*               LISTEN      2245/protokube      
tcp        0      0 127.0.0.1:6784          0.0.0.0:*               LISTEN      8512/weaver         
tcp        0      0 127.0.0.1:32769         0.0.0.0:*               LISTEN      2281/kubelet        
tcp6       0      0 :::10250                :::*                    LISTEN      2281/kubelet        
tcp6       0      0 :::10251                :::*                    LISTEN      3146/kube-scheduler 
tcp6       0      0 :::2380                 :::*                    LISTEN      3575/etcd           
tcp6       0      0 :::10252                :::*                    LISTEN      2747/kube-controlle 
tcp6       0      0 :::2381                 :::*                    LISTEN      3564/etcd           
tcp6       0      0 :::10255                :::*                    LISTEN      2281/kubelet        
tcp6       0      0 :::111                  :::*                    LISTEN      245/rpcbind         
tcp6       0      0 :::10256                :::*                    LISTEN      3226/kube-proxy     
tcp6       0      0 :::10257                :::*                    LISTEN      2747/kube-controlle 
tcp6       0      0 :::10259                :::*                    LISTEN      3146/kube-scheduler 
tcp6       0      0 :::22                   :::*                    LISTEN      574/sshd            
tcp6       0      0 :::443                  :::*                    LISTEN      3628/kube-apiserver 
tcp6       0      0 :::6781                 :::*                    LISTEN      8451/weave-npc      
tcp6       0      0 :::6782                 :::*                    LISTEN      8512/weaver         
tcp6       0      0 :::4001                 :::*                    LISTEN      3575/etcd           
tcp6       0      0 :::4002                 :::*                    LISTEN      3564/etcd

Some quick google searching doesn’t turn up any results around this issue, so I’m hoping I can get some help here. Thanks for any and all assistance.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 6
Comments: 25 (4 by maintainers)

Most upvoted comments

So I did a workaround for this until this issue is sorted. You can run: kubectl rollout history deployment.v1.apps/dns-controller -n kube-system

And check how many revisions are there. If there are lets say 5, rollback to 4 (which will be 1.15) kubectl rollout undo deployment/dns-controller --to-revision=4 -n kube-system

This should bring the dns-controller back up and the master node should be healthy.

Have not seen any direct consequences of running this and if someone else knows of any, would appreciate what to do going forward.

+22

nikhilc89 on Mar 19, 2020

Had the same issue. Quick rolling first master with --cloudonly helped. However, it caused ~3 minutes of workers being in NotReady state and some restarted pods 😦

So, if we’re rolling new master on k8s v1.16.7, new dns-controller starts/joins normally? Probably chicken-egg problem here 😕

Yes, I can verifiy @juris conclusion.

kops upgrade cluster --yes && kops update cluster --yes make early cluster modification (new version of dns-controller), but this shouldn’t be problem if “old” version of dns-controller is still running (rolling update).

Just make on 1 master instance --cloud-only upgrade: kops rolling-update cluster --instance-group masterXXXX --cloudonly --yes and wait until new master (1.16.7) joins cluster. Restart new dns-controller (probably in CrashLoopBack) and things should work now.

trajakovic on Apr 10, 2020

So I did a workaround for this until this issue is sorted. You can run: kubectl rollout history deployment.v1.apps/dns-controller -n kube-system And check how many revisions are there. If there are lets say 5, rollback to 4 (which will be 1.15) kubectl rollout undo deployment/dns-controller --to-revision=4 -n kube-system This should bring the dns-controller back up and the master node should be healthy. Have not seen any direct consequences of running this and if someone else knows of any, would appreciate what to do going forward.

Hi @nikhilc89 and all,

I follow your workaround and can rolling update master successfully. But after rolling update master, the dns controller still use old image (1.15). Should we rolled it back again to version 1.16 ? And may I know what is your kubernetes version?

Btw I use cluster with version 1.15, but need to update etcd manager using kops 1.16.3.

Thanks for your answer

@akhmadfld Once you do a rolling update with kops 1.16.3, it will replace the default image of the dns-controller to 1.16.3 as well and you should be able to proceed without issues. Have just done the same myself to fix the certificate issues in etcd.

nikhilc89 on Jul 1, 2020