amazon-vpc-cni-k8s: EKS 1.16 / v1.6.x: "couldn't get current server API group list; will keep using cached value"

We see the aws-node pods crash on startup sometimes with this logged:

Starting IPAM daemon in the background ... ok.
ERROR: logging before flag.Parse: E0708 16:29:03.884330       6 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout)
Checking for IPAM connectivity ...  failed.
Timed out waiting for IPAM daemon to start:

After starting and crashing the pod is then restarted and runs fine. About half of the aws-node pods do this.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 26
  • Comments: 68 (27 by maintainers)

Most upvoted comments

Any news on that internal ticket? We keep running into this issue whenever our nodes start.

Currently, the workaround is adding a busybox as Init Container to wait for the kube-proxy start.

  initContainers:
  - name: init-kubernetes-api
    image: busybox:1.28
    command: ['sh', '-c', "until nslookup kubernetes.default.svc.cluster.local ${KUBE_DNS_PORT_53_TCP_ADDR}; do echo waiting for kubernetes Service endpoint; sleep 2; done"]

^^

kubectl set image daemonset.apps/kube-proxy \
    -n kube-system \
    kube-proxy=602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.16.12

Same for v1.6.3 😐

We had the same problem. Updating master and nodegroups from 1.15 to 1.16. We had to fix the version of kubeproxy (kube-proxy:v1.16.13 -> kube-proxy:v1.16.12) and recreate nodes

We are sometimes running into a race-condition where aws-node is started before kube-proxy. Without kube-proxy, kubernetes.default.svc.cluster.local is not available. aws-node will fail to start and the container is not automatically restarted. To mitigate this, we added the following initContainer to aws-node:

      "initContainers":
        - "name": "wait-for-kubernetes-api"
          "image": "curlimages/curl:7.77.0"
          "command":
            - "sh"
            - "-c"
            - |
              while ! timeout 2s curl --fail --silent --cacert /run/secrets/kubernetes.io/serviceaccount/ca.crt "https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/healthz"; do
                echo Waiting on Kubernetes API to respond...
                sleep 1
              done
          "securityContext":
            "runAsUser": 100
            "runAsGroup": 101

@tibin-mfl Yes, the CNI pod (aws-node) needs kube-proxy to set up the cluster IPs before it can start up.

we had the same issue after upgrading from eks 1.15 to 1.16. We were just bumping the image version inside DaemonSet to 1.6.X. What solved our issue is to apply the full yaml provided by aws doc: https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/release-1.6/config/v1.6/aws-k8s-cni.yaml

it did changes both to the DaemonSet and the ClusterRole.

Good luck !

Hello everyone, I’m also having the same issue. my K8s cluster is still on v1.15 and before upgrading to v1.16 I wanted to make sure all my controllers are on the recommended version by aws in this page.

My main controllers version now: kube-proxy: v1.6.12 (works) core-dns: 1.6.6 (works) amazon-vpc-cni-k8s: 1.7.5 (doesn’t work)

$ kubectl version --short
Client Version: v1.15.0
Server Version: v1.15.11-eks-065dce

The deployment is done exactly as in the release docs: https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.7.5 I tried all the solutions mentioned in this repo: 1- downgrading kube-proxy from 1.6.13 to 1.6.12 with/without adding this label. Check this answer 2- I tried the following versions of amazon-vpc-cni-k8s: v1.7.5,v1.7.4, v1.7.3, v1.7.2 and finally wanted to roll-back to 1.6.3 but the issue is always the same.


  Type     Reason     Age                   From                                                   Message
  ----     ------     ----                  ----                                                   -------
  Normal   Scheduled  3m56s                 default-scheduler                                      Successfully assigned kube-system/aws-node-98fnz to ip-some-ip..eu-west-1.compute.internal
  Normal   Pulling    3m54s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.7.5"
  Normal   Pulled     3m39s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.7.5"
  Normal   Created    3m38s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Created container aws-vpc-cni-init
  Normal   Started    3m38s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Started container aws-vpc-cni-init
  Normal   Started    3m29s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Started container aws-node
  Warning  Unhealthy  3m25s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Readiness probe failed: {"level":"info","ts":"2020-10-20T13:57:04.975Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  3m15s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Readiness probe failed: {"level":"info","ts":"2020-10-20T13:57:14.998Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  3m6s                  kubelet, ip-some-ip.eu-west-1.compute.internal  Readiness probe failed: {"level":"info","ts":"2020-10-20T13:57:24.966Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  2m55s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Readiness probe failed: {"level":"info","ts":"2020-10-20T13:57:34.954Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  2m45s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Readiness probe failed: {"level":"info","ts":"2020-10-20T13:57:45.006Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  2m36s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Readiness probe failed: {"level":"info","ts":"2020-10-20T13:57:54.957Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  2m25s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Readiness probe failed: {"level":"info","ts":"2020-10-20T13:58:04.959Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  2m25s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Liveness probe failed: {"level":"info","ts":"2020-10-20T13:58:05.694Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  2m16s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Readiness probe failed: {"level":"info","ts":"2020-10-20T13:58:14.960Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Normal   Killing    2m5s                  kubelet, ip-some-ip..eu-west-1.compute.internal  Container aws-node failed liveness probe, will be restarted
  Warning  Unhealthy  116s (x4 over 2m15s)  kubelet, ip-some-ip..eu-west-1.compute.internal  (combined from similar events): Readiness probe failed: {"level":"info","ts":"2020-10-20T13:58:34.963Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Normal   Pulling    115s (x2 over 3m37s)  kubelet, ip-some-ip.eu-west-1.compute.internal  Pulling image "account-id.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.7.5"
  Normal   Created    113s (x2 over 3m29s)  kubelet, ip-some-ip..eu-west-1.compute.internal  Created container aws-node
  Normal   Pulled     113s (x2 over 3m30s)  kubelet, ip-some-ip..eu-west-1.compute.internal  Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.7.5"

@mogren this is on EKS, version 1.17. We discovered this as part of adding custom PSPs to all components. No scripts locking iptables on startup, using the standard EKS AMIs.

The behaviour we were seeing was that the aws-node pod never became ready, and was crash-looping. Apologies if that caused any confusion. I think it’s not unreasonable to conclude that:

  • kube-proxy sets up iptables rules that are required by aws-node
  • setting the filesystem to readonly on kube-proxy causes these rules to never be set up, so aws-node crash loops
  • a race condition between kube-proxy and aws-node could cause aws-node to come up before the iptables rules have been configured, causing an initial crash before working as normal (when kube-proxy creates the rules).

Facing the same issue with following 1.18 EKS components. CNI version:

user@user-work-laptop:~$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni-init:v1.7.5-eksbuild.1
amazon-k8s-cni:v1.7.5-eksbuild.1

Kube Proxy version:

user@user-work-laptop:~$ kubectl describe daemonset kube-proxy --namespace kube-system | grep Image | cut -d "/" -f 3
kube-proxy:v1.18.8-eksbuild.1

Nodes:

user@user-work-laptop:~$ kubectl get nodes
NAME                                            STATUS   ROLES    AGE    VERSION
<host1>.<region>.compute.internal   Ready    <none>   6d6h   v1.18.9-eks-d1db3c
<host2>.<region>.compute.internal   Ready    <none>   6d6h   v1.18.9-eks-d1db3c
<host3>.<region>.compute.internal   Ready    <none>   6d5h   v1.18.9-eks-d1db3c
<host4>.<region>.compute.internal   Ready    <none>   6d5h   v1.18.9-eks-d1db3c

Ami: ami-0a3d7ac8c4302b317 Pods:

aws-node-rktgg                        1/1     Running   1          3m45s
aws-node-c89ph                        1/1     Running   1          3m20s
kube-proxy-x8t7m                      1/1     Running   0          3m20s
kube-proxy-bfd7x                      1/1     Running   0          3m45s

kube-proxy-bfd7x log:

portRange: ""
udpIdleTimeout: 250ms: v1alpha1.KubeProxyConfiguration.Conntrack: v1alpha1.KubeProxyConntrackConfiguration.ReadObject: found unknown field: max, error found in #10 byte of ...|ck":{"max":0,"maxPer|..., bigger context ...|":"","configSyncPeriod":"15m0s","conntrack":{"max":0,"maxPerCore":32768,"min":131072,"tcpCloseWaitTi|...
I0209 13:41:30.395728       1 feature_gate.go:243] feature gates: &{map[]}
I0209 13:41:30.395787       1 feature_gate.go:243] feature gates: &{map[]}
E0209 13:41:30.988955       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found
E0209 13:41:31.999074       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found
E0209 13:41:34.259534       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found
E0209 13:41:38.355840       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found
E0209 13:41:47.022235       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found
E0209 13:42:05.550145       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found
I0209 13:42:05.550167       1 server_others.go:178] can't determine this node's IP, assuming 127.0.0.1; if this is incorrect, please set the --bind-address flag

aws-node-rktgg log:

{"level":"info","ts":"2021-02-09T13:41:33.346Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2021-02-09T13:41:33.356Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-02-09T13:41:33.357Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
ERROR: logging before flag.Parse: E0209 13:42:03.383486       9 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://xxx.xx.x.x:443/api?timeout=32s: dial tcp xxx.xx.x.x:443: i/o timeout)

One thing that catches my eye is that kube-proxy is looking for <host>.<eks_name>.compute.internal while actual hostname is <host1>.<region>.compute.internal.

I found the root cause of this issue (at least for my use-case) - my own fault 😃 I had set DHCP option set incorrectly to <eks_name>.compute.internal. After setting it to <region>.compute.internal nodes load correctly and fast.

Another tidbit: We ran into this very issue when upgrading from v1.15 to v1.16

Our current workaround: we’re keeping kube-proxy at v1.15.11 even after upgrading rest of cluster to v1.16.

The rest of the add-ons we were able to get to the recommended versions:

Screen Shot 2021-04-27 at 8 44 26 AM

I got my issue resolved, in my case I am managing the entire EKS stack using terraform and the problem happened while updating the code base to support upgrade to 1.18 and AWS CNI 1.7.5 (from 1.6.4). Culprits in my case were

  1. PSP (We use a specific PSP tailored for aws-node daemonset) and looks like 1.7.5 needs two additional privileges a. NET_ADMIN b. HostPath /var/run/aws-node
  2. Terraform was updating the node groups to new AMI before CNI was updated, I had to change the order to update the CNI vesion before the node groups are updated.

Sharing in case it helps any one

Hi @safaa-alnabulsi

This looks to be a different issue, I feel it is better to be tracked as a new issue since this thread is mainly to track delay in kube-proxy to get the node information and aws-node waits for kube-proxy to come up. But eventually aws-node is up. Please open a new issue and share us the logs - sudo bash /opt/cni/bin/aws-cni-support.sh. You can email the logs to varavaj@amazon.com. Thank you!

@tibin-mfl thanks for reporting this, that is definitely concerning. Do you have kube-proxy logs from any of these nodes? It would be very interesting to see why kube-proxy was taking that long to start up!

@mogren I attached Kube-proxy error log

{"log":"udpIdleTimeout: 250ms: v1alpha1.KubeProxyConfiguration.Conntrack: v1alpha1.KubeProxyConntrackConfiguration.ReadObject: found unknown field: max, error found in #10 byte of ...|ck\":{\"max\":0,\"maxPer|..., bigger context ...|\":\"\",\"configSyncPeriod\":\"15m0s\",\"conntrack\":{\"max\":0,\"maxPerCore\":32768,\"min\":131072,\"tcpCloseWaitTi|...\n","stream":"stderr","time":"2020-09-04T13:53:07.16323908Z"}
{"log":"I0904 13:53:07.163201       1 feature_gate.go:243] feature gates: \u0026{map[]}\n","stream":"stderr","time":"2020-09-04T13:53:07.164205369Z"}
{"log":"E0904 13:53:07.657821       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:07.657910856Z"}
{"log":"E0904 13:53:08.817715       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:08.817826081Z"}
{"log":"E0904 13:53:11.028140       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:11.028236757Z"}
{"log":"E0904 13:53:15.789086       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:15.789186713Z"}
{"log":"E0904 13:53:24.954484       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:24.954581764Z"}
{"log":"E0904 13:53:43.712021       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:43.712148589Z"}
{"log":"I0904 13:53:43.712044       1 server_others.go:140] can't determine this node's IP, assuming 127.0.0.1; if this is incorrect, please set the --bind-address flag\n","stream":"stderr","time":"2020-09-04T13:53:43.712191645Z"}
{"log":"I0904 13:53:43.712063       1 server_others.go:145] Using iptables Proxier.\n","stream":"stderr","time":"2020-09-04T13:53:43.712196787Z"}
{"log":"W0904 13:53:43.712165       1 proxier.go:286] clusterCIDR not specified, unable to distinguish between internal and external traffic\n","stream":"stderr","time":"2020-09-04T13:53:43.712204824Z"}
{"log":"I0904 13:53:43.712828       1 server.go:571] Version: v1.17.9\n","stream":"stderr","time":"2020-09-04T13:53:43.712928066Z"}
{"log":"I0904 13:53:43.713270       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072\n","stream":"stderr","time":"2020-09-04T13:53:43.713356276Z"}
{"log":"I0904 13:53:43.713305       1 conntrack.go:52] Setting nf_conntrack_max to 131072\n","stream":"stderr","time":"2020-09-04T13:53:43.713403074Z"}
{"log":"I0904 13:53:43.713538       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400\n","stream":"stderr","time":"2020-09-04T13:53:43.714641593Z"}
{"log":"I0904 13:53:43.713592       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600\n","stream":"stderr","time":"2020-09-04T13:53:43.714670464Z"} a

Complete kubeproxy log https://pastebin.pl/view/3d4cc276 attached here

Hi @mogggggg

Sorry for the delayed response, as you have mentioned it looks like kube-proxy is waiting to retrieve node info and during that time frame aws-node starts and is unable to communicate to the API Server because iptables isnt updated and hence it restarts. I will try to repro and we will see how to mitigate this issue.

Thanks for your patience.

Hi @mogggggg

Thanks for letting us know. Please kindly share the full logs from the log collector script https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshoot-cni and also kube-proxy pod logs. You can email it varavaj@amazon.com.

Thanks.

Hi, just wanted to chime in and we’re seeing the same thing. Like others have mentioned the pod seems to restart once when the node first starts up and it’s fine after that. We’re not using any custom PSPs.

EKS version: 1.17 AMI version: v1.17.9-eks-4c6976 kube-proxy version: 1.17.7 CNI version: 1.6.3

I can see these errors in kube-proxy logs on one of the nodes where aws-node restarted:

udpIdleTimeout: 250ms: v1alpha1.KubeProxyConfiguration.Conntrack: v1alpha1.KubeProxyConntrackConfiguration.ReadObject: found unknown field: max, error found in #10 byte of ...|ck":{"max":0,"maxPer|..., bigger context ...|":"","configSyncPeriod":"15m0s","conntrack":{"max":0,"maxPerCore":32768,"min":131072,"tcpCloseWaitTi|...
I0905 23:12:39.826265       7 feature_gate.go:243] feature gates: &{map[]}
E0905 23:12:40.388938       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
E0905 23:12:41.516857       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
E0905 23:12:43.567271       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
E0905 23:12:48.167166       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
E0905 23:12:56.325941       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
E0905 23:13:14.684106       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
I0905 23:13:14.684134       7 server_others.go:140] can't determine this node's IP, assuming 127.0.0.1; if this is incorrect, please set the --bind-address flag
I0905 23:13:14.684150       7 server_others.go:145] Using iptables Proxier.
W0905 23:13:14.684259       7 proxier.go:286] clusterCIDR not specified, unable to distinguish between internal and external traffic
I0905 23:13:14.684410       7 server.go:571] Version: v1.17.7
I0905 23:13:14.684773       7 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I0905 23:13:14.684803       7 conntrack.go:52] Setting nf_conntrack_max to 131072
I0905 23:13:14.684850       7 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I0905 23:13:14.684894       7 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I0905 23:13:14.685092       7 config.go:313] Starting service config controller
I0905 23:13:14.685101       7 shared_informer.go:197] Waiting for caches to sync for service config
I0905 23:13:14.685139       7 config.go:131] Starting endpoints config controller
I0905 23:13:14.685149       7 shared_informer.go:197] Waiting for caches to sync for endpoints config
I0905 23:13:14.785879       7 shared_informer.go:204] Caches are synced for service config
I0905 23:13:14.785932       7 shared_informer.go:204] Caches are synced for endpoints config

And this is in the aws-node logs:

{"log":"Copying portmap binary ... Starting IPAM daemon in the background ... ok.\n","stream":"stdout","time":"2020-09-03T11:06:26.418457689Z"}
{"log":"Checking for IPAM connectivity ... ok.\n","stream":"stdout","time":"2020-09-03T11:06:46.458122639Z"}
{"log":"Copying additional CNI plugin binaries and config files ... ok.\n","stream":"stdout","time":"2020-09-03T11:06:46.474182395Z"}
{"log":"Foregrounding IPAM daemon ... \n","stream":"stdout","time":"2020-09-03T11:06:46.474202946Z"}
{"log":"ERROR: logging before flag.Parse: W0903 14:22:54.564615       9 reflector.go:341] pkg/mod/k8s.io/client-go@v0.0.0-20180806134042-1f13a808da65/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version: 109452564 (109453466)\n","stream":"stderr","time":"2020-09-03T14:22:54.564769814Z"}
{"log":"ERROR: logging before flag.Parse: W0903 18:30:26.713005       9 reflector.go:341] pkg/mod/k8s.io/client-go@v0.0.0-20180806134042-1f13a808da65/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version: 109555468 (109679596)\n","stream":"stderr","time":"2020-09-03T18:30:26.713161405Z"}
{"log":"ERROR: logging before flag.Parse: W0903 18:45:56.655601       9 reflector.go:341] pkg/mod/k8s.io/client-go@v0.0.0-20180806134042-1f13a808da65/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version: 109679596 (109687399)\n","stream":"stderr","time":"2020-09-03T18:45:56.655715674Z"}

It seems like this has started happening for us as part of the 1.17 upgrade, we haven’t restarted all our nodes since the upgrade and I can see that the pods that are still running (on AMI v1.16.12-eks-904af05) the aws-node pod didn’t restart:

aws-node-26tfq                               1/1     Running   1          10h
aws-node-2pnwq                               1/1     Running   1          3h33m
aws-node-4f52v                               1/1     Running   1          4d22h
aws-node-5qsll                               1/1     Running   1          5d22h
aws-node-6z6wq                               1/1     Running   0          40d
aws-node-92hvs                               1/1     Running   0          40d
aws-node-c8srx                               1/1     Running   1          5d22h
aws-node-chkhb                               1/1     Running   1          5d4h
aws-node-djlkb                               1/1     Running   0          40d
aws-node-g7drp                               1/1     Running   1          5d5h
aws-node-g9rgn                               1/1     Running   0          40d
aws-node-gbdq5                               1/1     Running   1          2d22h
aws-node-gc5zl                               1/1     Running   1          2d22h
aws-node-hc48d                               1/1     Running   1          5d22h
aws-node-hx9bl                               1/1     Running   1          24d
aws-node-j9dcn                               1/1     Running   1          39d
aws-node-jj4qs                               1/1     Running   1          2d22h
aws-node-kwbjl                               1/1     Running   1          153m
aws-node-ljcv8                               1/1     Running   1          39d
aws-node-lv74f                               1/1     Running   1          12d
aws-node-q2w2w                               1/1     Running   1          2d22h
aws-node-s7qw4                               1/1     Running   1          2d22h
aws-node-tck8w                               1/1     Running   1          5d4h
aws-node-tjhtf                               1/1     Running   1          2d22h
aws-node-tzpb2                               1/1     Running   0          40d
aws-node-vm4nh                               1/1     Running   1          2d22h
aws-node-xnnj2                               1/1     Running   2          153m
aws-node-zchs9                               1/1     Running   1          2d22h

I’m happy to share the full logs if they’re helpful, just give me an email address to send them!

@max-rocket-internet Hey, sorry for the lack of updates on this. Been out for a bit without much network access, so haven’t been able to track this one down. I agree that there is no config change between v1.6.2 and v1.6.3, but since v1.5.x, we have updated the readiness and liveness probe configs.

Between kubernetes 1.15 and 1.16 kube-proxy has changed, so that could be related. We have not yet been able to reproduce this yet doing master upgrades.