amazon-vpc-cni-k8s: Failing liveness Probe

We are recording an high number of aws cni container restart. Screen Shot 2020-06-16 at 9 41 05 AM Searching our splunk logs didn’t led us to any particular error. looking at the kubernetes events for kube-system namespace we can see that the pods are failing liveness probe:

107s        Warning   Unhealthy       pod/aws-node-2p68j   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:58:23.812Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
107s        Warning   Unhealthy       pod/aws-node-2p68j   Liveness probe failed: {"level":"info","ts":"2020-06-16T15:58:23.867Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
18m         Warning   Unhealthy       pod/aws-node-4zjf7   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:41:27.975Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
18m         Warning   Unhealthy       pod/aws-node-4zjf7   Liveness probe failed: {"level":"info","ts":"2020-06-16T15:41:30.484Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
25m         Warning   Unhealthy       pod/aws-node-7c6zn   Liveness probe failed: {"level":"info","ts":"2020-06-16T15:34:21.820Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
25m         Warning   Unhealthy       pod/aws-node-7c6zn   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:34:23.817Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
24m         Warning   Unhealthy       pod/aws-node-7c6zn   Liveness probe failed: {"level":"info","ts":"2020-06-16T15:35:11.820Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
24m         Warning   Unhealthy       pod/aws-node-7c6zn   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:35:13.828Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
23m         Warning   Unhealthy       pod/aws-node-7c6zn   Liveness probe failed: {"level":"info","ts":"2020-06-16T15:36:11.821Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
23m         Warning   Unhealthy       pod/aws-node-7c6zn   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:36:13.821Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
32s         Warning   Unhealthy       pod/aws-node-b6szl   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:59:38.295Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}

our setting looks like:

  • kubernetes 1.16.9 provisioned via kops (~ 850 nodes)
  • aws cni 1.6.1 runs without resource limit
  • liveness and readiness probe:
      Requests:
      cpu:      10m
    Liveness:   exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3
    Readiness:  exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3

the liveness probe seems to behave correctly when invoked from the container:

/app/grpc-health-probe -addr=:50051
{"level":"info","ts":"2020-06-16T16:12:25.209Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"status: SERVING"}

Would really appreciate any help in continuing this investigation. Shall we just relax timeout specs? (would love to hear from other folks)

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 25 (9 by maintainers)

Most upvoted comments

I am able to reproduce similar issue.

kubectl describe pod aws-node-sgp4c -n kube-system

Warning  Unhealthy  10m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:07.661Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  10m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:14.960Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  10m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:24.957Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  10m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:34.957Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  9m52s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:44.970Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  9m42s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:54.965Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  9m32s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:43:04.966Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  9m22s                 kubelet            Liveness probe failed: {"level":"info","ts":"2022-03-17T12:43:14.983Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  9m21s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:43:15.045Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}

kubectl logs aws-node-sgp4c -n kube-system

{"level":"info","ts":"2022-03-17T12:58:06.329Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-03-17T12:58:06.330Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-03-17T12:58:06.350Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-03-17T12:58:06.352Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-03-17T12:58:08.362Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-17T12:58:10.370Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-17T12:58:12.378Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-17T12:58:14.387Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-17T12:58:16.397Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-17T12:58:18.406Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

debugging from the node tail -f /var/log/aws-routed-eni/ipamd.log | grep error

{"level":"error","ts":"2022-03-17T13:06:14.594Z","caller":"ipamd/ipamd.go:456","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-0b8d9XXXXX0ff09 eni-0681XXXXX2427]: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: No OpenIDConnect provider found in your account for https://oidc.eks.eu-west-1.amazonaws.com/id/97043D9XXXXXXXXC65E8E946\n\tstatus code: 400, request id: d71dd6e4-a65f-4247-b21b-625d6d1d2787"}

Solution:

aws iam list-open-id-connect-providers | grep 97043D9XXXXXXXXC65E8E946 If output is returned from the previous command, then you already have a provider for your cluster. If no output is returned, then you must create an IAM OIDC provider.

eksctl utils associate-iam-oidc-provider --cluster EKS-cluster-name --approve

Delete to recreate the pods via its daemonset

kubectl delete pods -n kube-system -l k8s-app=aws-node

Confirm it is running

kubectl get pods -n kube-system  -l k8s-app=aws-node
NAME             READY   STATUS    RESTARTS   AGE
aws-node-cnpq8   1/1     Running   0          19s

This is what ipamd log states: cat ipamd.log | grep error {"level":"error","ts":"2021-04-20T14:17:02.797Z","caller":"awsutils/awsutils.go:416","msg":"Failed to retrieve vpc-ipv4-cidr-blocks from instance metadata service"} {"level":"error","ts":"2021-04-20T14:32:17.001Z","caller":"awsutils/awsutils.go:416","msg":"Failed to retrieve vpc-ipv4-cidr-blocks from instance metadata service"} {"level":"error","ts":"2021-04-20T16:32:31.709Z","caller":"ipamd/ipamd.go:931","msg":"Failed to retrieve interfaces data from instance metadata RequestError: send request failed\ncaused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"} {"level":"error","ts":"2021-04-20T16:32:31.709Z","caller":"ipamd/ipamd.go:494","msg":"IP pool reconcile: Failed to get attached ENI info: get attached ENIs: failed to retrieve interfaces data: RequestError: send request failed\ncaused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"} {"level":"error","ts":"2021-04-21T03:07:53.689Z","caller":"awsutils/awsutils.go:416","msg":"Failed to retrieve vpc-ipv4-cidr-blocks from instance metadata service"}

FYI: the nodes are self managed nodes

Hi, I have the same problem today, the cause is for me like I can see in /var/log/aws-routed-eni/ipamd.log in the node containing the concerned pod:

{"level":"error","ts":"2021-04-13T16:34:05.272Z","caller":"aws-k8s-agent/main.go:28","msg":"Initialization failure: 
failed to allocate one IP addresses on ENI eni-04263466a7d86c036, err: allocate IP address:
failed to allocate a private IP address: 
InsufficientFreeAddressesInSubnet: The specified subnet does not have enough free addresses to satisfy the request.\n\tstatus code: 400, request id: f32133ee-7bc9-4259-b1fa-5a21b39e7165: allocate IP address:
failed to allocate a private IP address: 
InsufficientFreeAddressesInSubnet: The specified subnet does not have enough free addresses to satisfy the request.\n\tstatus code: 400, request id: f32133ee-7bc9-4259-b1fa-5a21b39e7165"}

So, I have enough free addresses in a subnet. Perhaps, it can help other people that have the same problem.

Any update on this issue :: I am also facing similar issue

  • EKS Version :: 1.20.7
  • AWS CNI Versions: amazon-k8s-cni-init:v1.7.5-eksbuild.1 // amazon-k8s-cni:v1.7.5-eksbuild.1

I am deploying EKS add on using cloud formation script - below is the snippet ::

EksClusterAddon: Type: AWS::EKS::Addon Properties: AddonName: !Ref EksAddonName ClusterName: !Ref EksClusterName ResolveConflicts: OVERWRITE ServiceAccountRoleArn: !Ref EksServiceRoleArn

I am not providing any addonversion and by default above version is getting picked up by AWS.

Logs for aws-node pod ::

{“level”:“info”,“ts”:“2021-12-08T13:01:00.915Z”,“caller”:“entrypoint.sh”,“msg”:“Install CNI binary…”} {“level”:“info”,“ts”:“2021-12-08T13:01:00.940Z”,“caller”:“entrypoint.sh”,“msg”:"Starting IPAM daemon in the background … "} {“level”:“info”,“ts”:“2021-12-08T13:01:00.948Z”,“caller”:“entrypoint.sh”,“msg”:"Checking for IPAM connectivity … "} ERROR: logging before flag.Parse: E1208 13:01:01.348599 9 memcache.go:147] couldn’t get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request ERROR: logging before flag.Parse: E1208 13:02:01.442917 9 memcache.go:147] couldn’t get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

image

Hi Team,

I am facing the same issue, liveliness and readiness probe failure on aws-node causing them to crashloop and pods/svc hosted on the node being un-available.

Events: `| Type | Reason | Age | From | Message|


Normal Scheduled 39m default-scheduler Successfully assigned kube-system/aws-node-qct2m to xxxxxxxxxxxxxx Normal Pulling 37m kubelet Pulling image “602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni-init:v1.7.10” Normal Pulled 37m kubelet Successfully pulled image “602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni-init:v1.7.10” in 690.175934ms Normal Created 37m kubelet Created container aws-vpc-cni-init Normal Started 36m kubelet Started container aws-vpc-cni-init Normal Pulled 36m kubelet Successfully pulled image “602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.7.10” in 300.589066ms Warning Unhealthy 35m kubelet Readiness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:53:01.143Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} Warning Unhealthy 35m kubelet Readiness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:53:31.943Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} Warning Unhealthy 35m kubelet Liveness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:53:47.560Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} Warning Unhealthy 34m kubelet Readiness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:53:57.051Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} Warning Unhealthy 34m kubelet Liveness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:54:16.743Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} Warning Unhealthy 34m kubelet Liveness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:54:42.655Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} Normal Killing 34m kubelet Container aws-node failed liveness probe, will be restarted Warning Unhealthy 34m kubelet Readiness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:54:39.438Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} Warning Unhealthy 33m kubelet Readiness probe failed: OCI runtime exec failed: exec failed: container_linux.go:370: starting container process caused: read init-p: connection reset by peer: unknown Warning Unhealthy 33m kubelet Readiness probe failed: OCI runtime exec failed: exec failed: container_linux.go:370: starting container process caused: process_linux.go:103: executing setns process caused: exit status 1: unknown Normal Pulling 33m (x2 over 36m) kubelet Pulling image “602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.7.10” Normal Pulled 33m kubelet Successfully pulled image “602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.7.10” in 619.512059ms Normal Created 33m (x2 over 36m) kubelet Created container aws-node Normal Started 33m (x2 over 36m) kubelet Started container aws-node Warning Unhealthy 33m kubelet Readiness probe errored: rpc error: code = Unknown desc = container not running (a79647cd4b740690c9fb3211b40f9f183e3b4faf8e3efdcd7302eebdbb699b6c) Warning Unhealthy 2m6s (x77 over 33m) kubelet (combined from similar events): Liveness probe failed: {“level”:“info”,“ts”:“2021-04-20T15:26:42.360Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”}`

EKS version: 1.19

Node version: v1.19.6-eks-49a6c0

Instance type: t3a.large

kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2 amazon-k8s-cni-init:v1.7.10 amazon-k8s-cni:v1.7.10

kubectl describe daemonset aws-node -n kube-system | grep -i "ENABLE_POD_ENI" ENABLE_POD_ENI: false

As a workaround I tried to increase the connect timeout for the liveliness and readiness probe command exec, but still doesnt work. Liveness: exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=10s] delay=60s timeout=10s period=10s #success=1 #failure=3 Readiness: exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=10s] delay=15s timeout=10s period=10s #success=1 #failure=3

The issue is not related to #1360 as I checked the Loopback interface on a node with failing aws-node: ifconfig -a

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 133982 bytes 9506349 (9.0 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 133982 bytes 9506349 (9.0 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Kindly let me know if you need more information.

After a week of monitoring I’m not recording any restart or liveness probe fail anymore, considering this issue closed. Thank you @mogren for the support.

@mogren I’ve continued the investigation on my end and I noticed that my server was having a lot of ARP cache overflow, it is been 2 days i’m running with (I have a VPC with two \16 CIDRs):

net.ipv4.neigh.default.gc_thresh2 = 65534 (50% of 2 VPC)
net.ipv4.neigh.default.gc_thresh3 = 104854 (80% of 2 VPC)

and I’ve got no sign of failing liveness probes. Will keep monitor until end of the next week and will report back here.