amazon-vpc-cni-k8s: Failing liveness Probe
We are recording an high number of aws cni container restart.
Searching our splunk logs didn’t led us to any particular error.
looking at the kubernetes events for kube-system namespace we can see that the pods are failing liveness probe:
107s Warning Unhealthy pod/aws-node-2p68j Readiness probe failed: {"level":"info","ts":"2020-06-16T15:58:23.812Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
107s Warning Unhealthy pod/aws-node-2p68j Liveness probe failed: {"level":"info","ts":"2020-06-16T15:58:23.867Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
18m Warning Unhealthy pod/aws-node-4zjf7 Readiness probe failed: {"level":"info","ts":"2020-06-16T15:41:27.975Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
18m Warning Unhealthy pod/aws-node-4zjf7 Liveness probe failed: {"level":"info","ts":"2020-06-16T15:41:30.484Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
25m Warning Unhealthy pod/aws-node-7c6zn Liveness probe failed: {"level":"info","ts":"2020-06-16T15:34:21.820Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
25m Warning Unhealthy pod/aws-node-7c6zn Readiness probe failed: {"level":"info","ts":"2020-06-16T15:34:23.817Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
24m Warning Unhealthy pod/aws-node-7c6zn Liveness probe failed: {"level":"info","ts":"2020-06-16T15:35:11.820Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
24m Warning Unhealthy pod/aws-node-7c6zn Readiness probe failed: {"level":"info","ts":"2020-06-16T15:35:13.828Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
23m Warning Unhealthy pod/aws-node-7c6zn Liveness probe failed: {"level":"info","ts":"2020-06-16T15:36:11.821Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
23m Warning Unhealthy pod/aws-node-7c6zn Readiness probe failed: {"level":"info","ts":"2020-06-16T15:36:13.821Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
32s Warning Unhealthy pod/aws-node-b6szl Readiness probe failed: {"level":"info","ts":"2020-06-16T15:59:38.295Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
our setting looks like:
- kubernetes 1.16.9 provisioned via kops (~ 850 nodes)
- aws cni 1.6.1 runs without resource limit
- liveness and readiness probe:
Requests:
cpu: 10m
Liveness: exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3
Readiness: exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3
the liveness probe seems to behave correctly when invoked from the container:
/app/grpc-health-probe -addr=:50051
{"level":"info","ts":"2020-06-16T16:12:25.209Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"status: SERVING"}
Would really appreciate any help in continuing this investigation. Shall we just relax timeout specs? (would love to hear from other folks)
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 25 (9 by maintainers)
I am able to reproduce similar issue.
kubectl describe pod aws-node-sgp4c -n kube-system
kubectl logs aws-node-sgp4c -n kube-system
debugging from the node
tail -f /var/log/aws-routed-eni/ipamd.log | grep error
{"level":"error","ts":"2022-03-17T13:06:14.594Z","caller":"ipamd/ipamd.go:456","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-0b8d9XXXXX0ff09 eni-0681XXXXX2427]: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: No OpenIDConnect provider found in your account for https://oidc.eks.eu-west-1.amazonaws.com/id/97043D9XXXXXXXXC65E8E946\n\tstatus code: 400, request id: d71dd6e4-a65f-4247-b21b-625d6d1d2787"}
Solution:
aws iam list-open-id-connect-providers | grep 97043D9XXXXXXXXC65E8E946
If output is returned from the previous command, then you already have a provider for your cluster. If no output is returned, then you must create an IAM OIDC provider.eksctl utils associate-iam-oidc-provider --cluster EKS-cluster-name --approve
Delete to recreate the pods via its daemonset
kubectl delete pods -n kube-system -l k8s-app=aws-node
Confirm it is running
This is what ipamd log states:
cat ipamd.log | grep error {"level":"error","ts":"2021-04-20T14:17:02.797Z","caller":"awsutils/awsutils.go:416","msg":"Failed to retrieve vpc-ipv4-cidr-blocks from instance metadata service"} {"level":"error","ts":"2021-04-20T14:32:17.001Z","caller":"awsutils/awsutils.go:416","msg":"Failed to retrieve vpc-ipv4-cidr-blocks from instance metadata service"} {"level":"error","ts":"2021-04-20T16:32:31.709Z","caller":"ipamd/ipamd.go:931","msg":"Failed to retrieve interfaces data from instance metadata RequestError: send request failed\ncaused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"} {"level":"error","ts":"2021-04-20T16:32:31.709Z","caller":"ipamd/ipamd.go:494","msg":"IP pool reconcile: Failed to get attached ENI info: get attached ENIs: failed to retrieve interfaces data: RequestError: send request failed\ncaused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"} {"level":"error","ts":"2021-04-21T03:07:53.689Z","caller":"awsutils/awsutils.go:416","msg":"Failed to retrieve vpc-ipv4-cidr-blocks from instance metadata service"}
FYI: the nodes are self managed nodes
Hi, I have the same problem today, the cause is for me like I can see in
/var/log/aws-routed-eni/ipamd.log
in the node containing the concerned pod:So, I have enough free addresses in a subnet. Perhaps, it can help other people that have the same problem.
Any update on this issue :: I am also facing similar issue
I am deploying EKS add on using cloud formation script - below is the snippet ::
EksClusterAddon: Type: AWS::EKS::Addon Properties: AddonName: !Ref EksAddonName ClusterName: !Ref EksClusterName ResolveConflicts: OVERWRITE ServiceAccountRoleArn: !Ref EksServiceRoleArn
I am not providing any addonversion and by default above version is getting picked up by AWS.
Logs for aws-node pod ::
{“level”:“info”,“ts”:“2021-12-08T13:01:00.915Z”,“caller”:“entrypoint.sh”,“msg”:“Install CNI binary…”} {“level”:“info”,“ts”:“2021-12-08T13:01:00.940Z”,“caller”:“entrypoint.sh”,“msg”:"Starting IPAM daemon in the background … "} {“level”:“info”,“ts”:“2021-12-08T13:01:00.948Z”,“caller”:“entrypoint.sh”,“msg”:"Checking for IPAM connectivity … "} ERROR: logging before flag.Parse: E1208 13:01:01.348599 9 memcache.go:147] couldn’t get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request ERROR: logging before flag.Parse: E1208 13:02:01.442917 9 memcache.go:147] couldn’t get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
Hi Team,
I am facing the same issue, liveliness and readiness probe failure on aws-node causing them to crashloop and pods/svc hosted on the node being un-available.
Events: `| Type | Reason | Age | From | Message|
Normal Scheduled 39m default-scheduler Successfully assigned kube-system/aws-node-qct2m to xxxxxxxxxxxxxx Normal Pulling 37m kubelet Pulling image “602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni-init:v1.7.10” Normal Pulled 37m kubelet Successfully pulled image “602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni-init:v1.7.10” in 690.175934ms Normal Created 37m kubelet Created container aws-vpc-cni-init Normal Started 36m kubelet Started container aws-vpc-cni-init Normal Pulled 36m kubelet Successfully pulled image “602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.7.10” in 300.589066ms Warning Unhealthy 35m kubelet Readiness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:53:01.143Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} Warning Unhealthy 35m kubelet Readiness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:53:31.943Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} Warning Unhealthy 35m kubelet Liveness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:53:47.560Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} Warning Unhealthy 34m kubelet Readiness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:53:57.051Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} Warning Unhealthy 34m kubelet Liveness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:54:16.743Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} Warning Unhealthy 34m kubelet Liveness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:54:42.655Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} Normal Killing 34m kubelet Container aws-node failed liveness probe, will be restarted Warning Unhealthy 34m kubelet Readiness probe failed: {“level”:“info”,“ts”:“2021-04-20T14:54:39.438Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”} Warning Unhealthy 33m kubelet Readiness probe failed: OCI runtime exec failed: exec failed: container_linux.go:370: starting container process caused: read init-p: connection reset by peer: unknown Warning Unhealthy 33m kubelet Readiness probe failed: OCI runtime exec failed: exec failed: container_linux.go:370: starting container process caused: process_linux.go:103: executing setns process caused: exit status 1: unknown Normal Pulling 33m (x2 over 36m) kubelet Pulling image “602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.7.10” Normal Pulled 33m kubelet Successfully pulled image “602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.7.10” in 619.512059ms Normal Created 33m (x2 over 36m) kubelet Created container aws-node Normal Started 33m (x2 over 36m) kubelet Started container aws-node Warning Unhealthy 33m kubelet Readiness probe errored: rpc error: code = Unknown desc = container not running (a79647cd4b740690c9fb3211b40f9f183e3b4faf8e3efdcd7302eebdbb699b6c) Warning Unhealthy 2m6s (x77 over 33m) kubelet (combined from similar events): Liveness probe failed: {“level”:“info”,“ts”:“2021-04-20T15:26:42.360Z”,“caller”:“/usr/local/go/src/runtime/proc.go:203”,“msg”:“timeout: failed to connect service ":50051" within 10s”}`
EKS version: 1.19
Node version: v1.19.6-eks-49a6c0
Instance type: t3a.large
kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni-init:v1.7.10 amazon-k8s-cni:v1.7.10kubectl describe daemonset aws-node -n kube-system | grep -i "ENABLE_POD_ENI"
ENABLE_POD_ENI: falseAs a workaround I tried to increase the connect timeout for the liveliness and readiness probe command exec, but still doesnt work.
Liveness: exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=10s] delay=60s timeout=10s period=10s #success=1 #failure=3 Readiness: exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=10s] delay=15s timeout=10s period=10s #success=1 #failure=3
The issue is not related to #1360 as I checked the Loopback interface on a node with failing aws-node:
ifconfig -a
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 133982 bytes 9506349 (9.0 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 133982 bytes 9506349 (9.0 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Kindly let me know if you need more information.
After a week of monitoring I’m not recording any restart or liveness probe fail anymore, considering this issue closed. Thank you @mogren for the support.
@mogren I’ve continued the investigation on my end and I noticed that my server was having a lot of ARP cache overflow, it is been 2 days i’m running with (I have a VPC with two \16 CIDRs):
and I’ve got no sign of failing liveness probes. Will keep monitor until end of the next week and will report back here.