amazon-vpc-cni-k8s: Failing liveness Probe

We are recording an high number of aws cni container restart. Screen Shot 2020-06-16 at 9 41 05 AM Searching our splunk logs didn’t led us to any particular error. looking at the kubernetes events for kube-system namespace we can see that the pods are failing liveness probe:

107s        Warning   Unhealthy       pod/aws-node-2p68j   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:58:23.812Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
107s        Warning   Unhealthy       pod/aws-node-2p68j   Liveness probe failed: {"level":"info","ts":"2020-06-16T15:58:23.867Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
18m         Warning   Unhealthy       pod/aws-node-4zjf7   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:41:27.975Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
18m         Warning   Unhealthy       pod/aws-node-4zjf7   Liveness probe failed: {"level":"info","ts":"2020-06-16T15:41:30.484Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
25m         Warning   Unhealthy       pod/aws-node-7c6zn   Liveness probe failed: {"level":"info","ts":"2020-06-16T15:34:21.820Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
25m         Warning   Unhealthy       pod/aws-node-7c6zn   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:34:23.817Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
24m         Warning   Unhealthy       pod/aws-node-7c6zn   Liveness probe failed: {"level":"info","ts":"2020-06-16T15:35:11.820Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
24m         Warning   Unhealthy       pod/aws-node-7c6zn   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:35:13.828Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
23m         Warning   Unhealthy       pod/aws-node-7c6zn   Liveness probe failed: {"level":"info","ts":"2020-06-16T15:36:11.821Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
23m         Warning   Unhealthy       pod/aws-node-7c6zn   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:36:13.821Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
32s         Warning   Unhealthy       pod/aws-node-b6szl   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:59:38.295Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}

our setting looks like:

kubernetes 1.16.9 provisioned via kops (~ 850 nodes)
aws cni 1.6.1 runs without resource limit
liveness and readiness probe:

      Requests:
      cpu:      10m
    Liveness:   exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3
    Readiness:  exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3

the liveness probe seems to behave correctly when invoked from the container:

/app/grpc-health-probe -addr=:50051
{"level":"info","ts":"2020-06-16T16:12:25.209Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"status: SERVING"}

Would really appreciate any help in continuing this investigation. Shall we just relax timeout specs? (would love to hear from other folks)

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 25 (9 by maintainers)

Most upvoted comments

I am able to reproduce similar issue.

kubectl describe pod aws-node-sgp4c -n kube-system

Warning  Unhealthy  10m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:07.661Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  10m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:14.960Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  10m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:24.957Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  10m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:34.957Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  9m52s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:44.970Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  9m42s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:54.965Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  9m32s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:43:04.966Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  9m22s                 kubelet            Liveness probe failed: {"level":"info","ts":"2022-03-17T12:43:14.983Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  9m21s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:43:15.045Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}

kubectl logs aws-node-sgp4c -n kube-system

{"level":"info","ts":"2022-03-17T12:58:06.329Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-03-17T12:58:06.330Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-03-17T12:58:06.350Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-03-17T12:58:06.352Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-03-17T12:58:08.362Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-17T12:58:10.370Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-17T12:58:12.378Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-17T12:58:14.387Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-17T12:58:16.397Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-17T12:58:18.406Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

debugging from the node tail -f /var/log/aws-routed-eni/ipamd.log | grep error

{"level":"error","ts":"2022-03-17T13:06:14.594Z","caller":"ipamd/ipamd.go:456","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-0b8d9XXXXX0ff09 eni-0681XXXXX2427]: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: No OpenIDConnect provider found in your account for https://oidc.eks.eu-west-1.amazonaws.com/id/97043D9XXXXXXXXC65E8E946\n\tstatus code: 400, request id: d71dd6e4-a65f-4247-b21b-625d6d1d2787"}

Solution:

aws iam list-open-id-connect-providers | grep 97043D9XXXXXXXXC65E8E946 If output is returned from the previous command, then you already have a provider for your cluster. If no output is returned, then you must create an IAM OIDC provider.

eksctl utils associate-iam-oidc-provider --cluster EKS-cluster-name --approve

Delete to recreate the pods via its daemonset

kubectl delete pods -n kube-system -l k8s-app=aws-node

Confirm it is running

kubectl get pods -n kube-system  -l k8s-app=aws-node
NAME             READY   STATUS    RESTARTS   AGE
aws-node-cnpq8   1/1     Running   0          19s

berry2012 on Mar 17, 2022

This is what ipamd log states: cat ipamd.log | grep error {"level":"error","ts":"2021-04-20T14:17:02.797Z","caller":"awsutils/awsutils.go:416","msg":"Failed to retrieve vpc-ipv4-cidr-blocks from instance metadata service"} {"level":"error","ts":"2021-04-20T14:32:17.001Z","caller":"awsutils/awsutils.go:416","msg":"Failed to retrieve vpc-ipv4-cidr-blocks from instance metadata service"} {"level":"error","ts":"2021-04-20T16:32:31.709Z","caller":"ipamd/ipamd.go:931","msg":"Failed to retrieve interfaces data from instance metadata RequestError: send request failed\ncaused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"} {"level":"error","ts":"2021-04-20T16:32:31.709Z","caller":"ipamd/ipamd.go:494","msg":"IP pool reconcile: Failed to get attached ENI info: get attached ENIs: failed to retrieve interfaces data: RequestError: send request failed\ncaused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"} {"level":"error","ts":"2021-04-21T03:07:53.689Z","caller":"awsutils/awsutils.go:416","msg":"Failed to retrieve vpc-ipv4-cidr-blocks from instance metadata service"}

FYI: the nodes are self managed nodes

vinothav on Apr 21, 2021

Hi, I have the same problem today, the cause is for me like I can see in /var/log/aws-routed-eni/ipamd.log in the node containing the concerned pod:

{"level":"error","ts":"2021-04-13T16:34:05.272Z","caller":"aws-k8s-agent/main.go:28","msg":"Initialization failure: 
failed to allocate one IP addresses on ENI eni-04263466a7d86c036, err: allocate IP address:
failed to allocate a private IP address: 
InsufficientFreeAddressesInSubnet: The specified subnet does not have enough free addresses to satisfy the request.\n\tstatus code: 400, request id: f32133ee-7bc9-4259-b1fa-5a21b39e7165: allocate IP address:
failed to allocate a private IP address: 
InsufficientFreeAddressesInSubnet: The specified subnet does not have enough free addresses to satisfy the request.\n\tstatus code: 400, request id: f32133ee-7bc9-4259-b1fa-5a21b39e7165"}

So, I have enough free addresses in a subnet. Perhaps, it can help other people that have the same problem.

hervems on Apr 13, 2021

Any update on this issue :: I am also facing similar issue

EKS Version :: 1.20.7
AWS CNI Versions: amazon-k8s-cni-init:v1.7.5-eksbuild.1 // amazon-k8s-cni:v1.7.5-eksbuild.1

I am deploying EKS add on using cloud formation script - below is the snippet ::

EksClusterAddon: Type: AWS::EKS::Addon Properties: AddonName: !Ref EksAddonName ClusterName: !Ref EksClusterName ResolveConflicts: OVERWRITE ServiceAccountRoleArn: !Ref EksServiceRoleArn

I am not providing any addonversion and by default above version is getting picked up by AWS.

Logs for aws-node pod ::

{“level”:“info”,“ts”:“2021-12-08T13:01:00.915Z”,“caller”:“entrypoint.sh”,“msg”:“Install CNI binary…”} {“level”:“info”,“ts”:“2021-12-08T13:01:00.940Z”,“caller”:“entrypoint.sh”,“msg”:"Starting IPAM daemon in the background … "} {“level”:“info”,“ts”:“2021-12-08T13:01:00.948Z”,“caller”:“entrypoint.sh”,“msg”:"Checking for IPAM connectivity … "} ERROR: logging before flag.Parse: E1208 13:01:01.348599 9 memcache.go:147] couldn’t get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request ERROR: logging before flag.Parse: E1208 13:02:01.442917 9 memcache.go:147] couldn’t get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

crechandan on Feb 7, 2022

Hi Team,

I am facing the same issue, liveliness and readiness probe failure on aws-node causing them to crashloop and pods/svc hosted on the node being un-available.

Normal Scheduled 39m Normal Pulling 37m Normal Pulled 37m Normal Created 37m Normal Started 36m Normal Pulled 36m Warning Unhealthy 35m Warning Unhealthy 35m Warning Unhealthy 35m Warning Unhealthy 34m Warning Unhealthy 34m Warning Unhealthy 34m Normal Killing 34m Warning Unhealthy 34m Warning Unhealthy 33m Warning Unhealthy 33m Normal Pulling Normal Pulled 33m Normal Created Normal Started Warning Unhealthy 33m Warning Unhealthy default-scheduler Successfully assigned kube-system/aws-node-qct2m to xxxxxxxxxxxxxx kubelet Pulling image “602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni-init:v1.7.10” kubelet Successfully pulled image “602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni-init:v1.7.10” in 690.175934ms kubelet Created container aws-vpc-cni-init kubelet Started container aws-vpc-cni-init kubelet Successfully pulled image “602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.7.10” in 300.589066ms kubelet Readiness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:53:01.143Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} kubelet Readiness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:53:31.943Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} kubelet Liveness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:53:47.560Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} kubelet Readiness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:53:57.051Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} kubelet Liveness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:54:16.743Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} kubelet Liveness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:54:42.655Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} kubelet Container aws-node failed liveness probe, will be restarted kubelet Readiness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:54:39.438Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} kubelet Readiness probe failed: OCI runtime exec failed: exec failed: container_linux.go:370: starting container process caused: read init-p: connection reset by peer: unknown kubelet Readiness probe failed: OCI runtime exec failed: exec failed: container_linux.go:370: starting container process caused: process_linux.go:103: executing setns process caused: exit status 1: unknown 33m (x2 over 36m) kubelet Pulling image “602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.7.10” kubelet Successfully pulled image “602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.7.10” in 619.512059ms 33m (x2 over 36m) kubelet Created container aws-node 33m (x2 over 36m) kubelet Started container aws-node kubelet Readiness probe errored: rpc error: code = Unknown desc = container not running (a79647cd4b740690c9fb3211b40f9f183e3b4faf8e3efdcd7302eebdbb699b6c) 2m6s (x77 over 33m) kubelet (combined from similar events): Liveness probe failed: {“level”:“info”,“ts”:“2021-04-20T15:26:42.360Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”}`

EKS version: 1.19

Node version: v1.19.6-eks-49a6c0

Instance type: t3a.large

kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2 amazon-k8s-cni-init:v1.7.10 amazon-k8s-cni:v1.7.10

kubectl describe daemonset aws-node -n kube-system | grep -i "ENABLE_POD_ENI" ENABLE_POD_ENI: false

As a workaround I tried to increase the connect timeout for the liveliness and readiness probe command exec, but still doesnt work. Liveness: exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=10s] delay=60s timeout=10s period=10s #success=1 #failure=3 Readiness: exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=10s] delay=15s timeout=10s period=10s #success=1 #failure=3

The issue is not related to #1360 as I checked the Loopback interface on a node with failing aws-node: ifconfig -a

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 133982 bytes 9506349 (9.0 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 133982 bytes 9506349 (9.0 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Kindly let me know if you need more information.

vinothav on Apr 20, 2021

After a week of monitoring I’m not recording any restart or liveness probe fail anymore, considering this issue closed. Thank you @mogren for the support.

ltagliamonte-dd on Jun 27, 2020

@mogren I’ve continued the investigation on my end and I noticed that my server was having a lot of ARP cache overflow, it is been 2 days i’m running with (I have a VPC with two \16 CIDRs):

net.ipv4.neigh.default.gc_thresh2 = 65534 (50% of 2 VPC)
net.ipv4.neigh.default.gc_thresh3 = 104854 (80% of 2 VPC)

and I’ve got no sign of failing liveness probes. Will keep monitor until end of the next week and will report back here.

ltagliamonte-dd on Jun 19, 2020