amazon-vpc-cni-k8s: Timeout waiting for IPAM to start causing perpetual failure (1.6.1, EKS 1.15)

We’ve been plaqued for a while (at least through 1.5.x, maybe even before) with the CNI dying (not sure how) and never coming back up. This sort of issue causes a node to continue accepting workloads of which can’t provide any value because networking is broken.

We also have #1054 open presently which represents successful restarts and trying to understand why we’re restarting in the first place.

For one of the nodes which this happened, we cordoned and blocked CA from touching it for observation. Any information following relates to the aws-node instance running on that node, aws-node-mq5bk.

We had case 7079142701 open with AWS where they suggested #1019 (issue #1011) would fix the problem but couldn’t assist beyond that. We’ve also reached out to our TAM to setup a meeting with your CNI team and he’s failed to respond in over a week.

Therefore, I’m writing this ticket to agglomerate available information from what I’m seeing because I don’t quite think this problem is like the others (see list at bottom). Please @mogren, you’re my only hope!

I’d really like to verify with engineers that our problem is indeed solved with #1019. We’ve been told before by support to “just upgrade™” before and still saw issues.

Logs K8s provides (kubectl logs):

Starting IPAM daemon in the background ... ok.
ERROR: logging before flag.Parse: E0624 20:52:52.156133       6 memcache.go:147] couldn't get resource list for external.metrics.k8s.io/v1beta1: <nil>
Checking for IPAM connectivity ...  failed.
Timed out waiting for IPAM daemon to start:                                                                                                                                                                                                                                       

Support Person found errors in logs (provided via eks-logs-collector.sh) (ipamd.log):

- {"level":"error","ts":"2020-06-04T13:37:58.660Z","caller":"ipamd/ipamd.go:962","msg":"Failed to retrieve interfaces data from instance metadata RequestError: send request failed\ncaused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/:  dial tcp 169.254.169.254:80: connect: connection refused"}
- {"level":"error","ts":"2020-06-04T19:07:03.780Z","caller":"aws-k8s-agent/main.go:30","msg":"Initialization failure: ipamd init: failed to retrieve attached ENIs info"}

(plugin.log):

- {"level":"error","ts":"2020-06-09T04:16:03.253Z","caller":"routed-eni-cni-plugin/cni.go:252","msg":"Error received from DelNetwork gRPC call for pod edge-platform-gateway-66f6864f48-5wbvn namespace default sandbox c7bda6671561a7bfa978017c7f66b440df7991a2efe15b8ebc2ff3bf9c6a26b8: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused\""}

A unique readiness probe from what I think is around the time the container crashed and ceased to come back (~Jun 3/4):

Events from the Pod kube-system/aws-node-mq5bk
1 Unhealthy: Readiness probe failed: runtime: failed to create new OS thread (have 8 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

runtime stack:
runtime.throw(0x5653e61c1951, 0x9)
/usr/local/go/src/runtime/panic.go:774 +0x74
runtime.newosproc(0xc000111500)
/usr/local/go/src/runtime/os_linux.go:153 +0x1bc
runtime.newm1(0xc000111500)
/usr/local/go/src/runtime/proc.go:1853 +0xde
runtime.newm(0x0, 0xc000030500)
/usr/local/go/src/runtime/proc.go:1832 +0x93
runtime.startm(0xc000030500, 0xc000111100)
/usr/local/go/src/runtime/proc.go:1969 +0x12c
runtime.handoffp(0xc000030500)
/usr/local/go/src/runtime/proc.go:1996 +0x54
runtime.entersyscallblock_handoff()
/usr/local/go/src/runtime/proc.go:2937 +0x32
runtime.systemstack(0x0)
/usr/local/go/src/runtime/asm_amd64.s:370 +0x63
runtime.mstart()
/usr/local/go/src/runtime/proc.go:1146

goroutine 23 [syscall]:
runtime.notetsleepg(0x5653e6964660, 0x3b5d979a, 0x0)
/usr/local/go/src/runtime/lock_futex.go:227 +0x38 fp=0xc00004cf60 sp=0xc00004cf30 pc=0x5653e5da1658
runtime.timerproc(0x5653e6964640)
/usr/local/go/src/runtime/time.go:311 +0x2f5 fp=0xc00004cfd8 sp=0xc00004cf60 pc=0x5653e5de2565
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc00004cfe0 sp=0xc00004cfd8 pc=0x5653e5df1821
created by runtime.(*timersBucket).addtimerLocked
/usr/local/go/src/runtime/time.go:169 +0x110

goroutine 1 [select]:
google.golang.org/grpc.(*ClientConn).WaitForStateChange(0xc000110e00, 0x5653e6643e60, 0xc00008e840, 0x0, 0x0)
/go/pkg/mod/google.golang.org/grpc@v1.28.0/clientconn.go:502 +0x103
google.golang.org/grpc.DialContext(0x5653e6643e60, 0xc00008e840, 0x7ffff42f3510, 0x6, 0xc000090800, 0x3, 0x4, 0x0, 0x0, 0x0)
/go/pkg/mod/google.golang.org/grpc@v1.28.0/clientconn.go:322 +0xca2
main.main()
/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/grpc-health-probe/main.go:99 +0x344

goroutine 20 [syscall]:
os/signal.signal_recv(0x0)
/usr/local/go/src/runtime/sigqueue.go:147 +0x9e
os/signal.loop()
/usr/local/go/src/os/signal/signal_unix.go:23 +0x24
created by os/signal.init.0
/usr/local/go/src/os/signal/signal_unix.go:29 +0x43

goroutine 22 [chan receive]:
main.main.func1(0xc00008e780, 0xc000095980)
/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/grpc-health-probe/main.go:79 +0x43
created by main.main
/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/grpc-health-probe/main.go:78 +0x11d

goroutine 24 [runnable]:
google.golang.org/grpc.(*ccBalancerWrapper).watcher(0xc000090940)
/go/pkg/mod/google.golang.org/grpc@v1.28.0/balancer_conn_wrappers.go:67
created by google.golang.org/grpc.newCCBalancerWrapper
/go/pkg/mod/google.golang.org/grpc@v1.28.0/balancer_conn_wrappers.go:60 +0x16f

goroutine 25 [chan receive]:
net.(netFD).connect.func1(0xc000092900, 0xc000092960, 0xc0000829b0, 0xc000104380)
/usr/local/go/src/net/fd_unix.go:117 +0x51
net.(netFD).connect(0xc000104380, 0x5653e6643e60, 0xc00008ea20, 0x0, 0x0, 0x5653e663b8a0, 0xc0000b66c0, 0x5653e663b8a0, 0xc0000261c0, 0x0, ...)
/usr/local/go/src/net/fd_unix.go:173 +0x42e
net.(netFD).dial(0xc000104380, 0x5653e6643e60, 0xc00008ea20, 0x5653e6646160, 0x0, 0x5653e6646160, 0xc00011e6f0, 0x0, 0x0, 0x0)
/usr/local/go/src/net/sock_posix.go:149 +0x101
net.socket(0x5653e6643e60, 0xc00008ea20, 0x5653e61bd657, 0x3, 0x2, 0x1, 0x0, 0x0, 0x5653e6646160, 0x0, ...)
/usr/local/go/src/net/sock_posix.go:70 +0x1c2
net.internetSocket(0x5653e6643e60, 0xc00008ea20, 0x5653e61bd657, 0x3, 0x5653e6646160, 0x0, 0x5653e6646160, 0xc00011e6f0, 0x1, 0x0, ...)
/usr/local/go/src/net/ipsock_posix.go:141 +0x143
net.(sysDialer).doDialTCP(0xc000104300, 0x5653e6643e60, 0xc00008ea20, 0x0, 0xc00011e6f0, 0x5653e659e440, 0x5653e697ce80, 0x0)
/usr/local/go/src/net/tcpsock_posix.go:65 +0xc4
net.(sysDialer).dialTCP(0xc000104300, 0x5653e6643e60, 0xc00008ea20, 0x0, 0xc00011e6f0, 0xc000095bc0, 0x10, 0x10)
/usr/local/go/src/net/tcpsock_posix.go:61 +0xd9
net.(sysDialer).dialSingle(0xc000104300, 0x5653e6643e60, 0xc0000

Wed Jun 03 2020 09:29:22 GMT-0400 (EDT)

I then see this liveness probe failure about 20 seconds after:

 1 Unhealthy: Liveness probe failed: OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "process_linux.go:101: executing setns process caused \"exit status 1\"": unknown

Here’s more Container details in its current state:

Containers:
  aws-node:
    Container ID:   docker://b1fc0367d03d6172a225b9d34018d8532e49aa2437a4b608a655487519b712e3
    Image:          602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.6.1
    Image ID:       docker-pullable://602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni@sha256:d50a182475c5ee6c18c3b81b01aa649367f30fb0dc60f7a619dcdbf45e10b3a3
    Port:           61678/TCP
    Host Port:      61678/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 24 Jun 2020 16:52:52 -0400
      Finished:     Wed, 24 Jun 2020 16:53:28 -0400
    Ready:          False
    Restart Count:  5203
    Requests:
      cpu:      10m
    Liveness:   exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3
    Readiness:  exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3
    Environment:
      AWS_VPC_K8S_CNI_RANDOMIZESNAT:  prng
      MINIMUM_IP_TARGET:              15
      WARM_IP_TARGET:                 7
      AWS_VPC_K8S_CNI_LOGLEVEL:       DEBUG
      AWS_VPC_K8S_CNI_VETHPREFIX:     eni
      AWS_VPC_ENI_MTU:                9001
      MY_NODE_NAME:                    (v1:spec.nodeName)
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /host/var/log from log-dir (rw)
      /var/run/docker.sock from dockersock (rw)
      /var/run/dockershim.sock from dockershim (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from aws-node-token-mkxkm (ro)

Other suggested issues:

  1. https://github.com/aws/amazon-vpc-cni-k8s/issues/950
  2. https://github.com/aws/amazon-vpc-cni-k8s/issues/872
  3. https://github.com/aws/amazon-vpc-cni-k8s/issues/486, specifically https://github.com/aws/amazon-vpc-cni-k8s/issues/486#issuecomment-640854414 and his follow up https://github.com/aws/amazon-vpc-cni-k8s/issues/486#issuecomment-641546823 - Could this really just be IAM? (we use eksctl so this is hard to believe)
  4. https://github.com/aws/amazon-vpc-cni-k8s/issues/1008

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 9
  • Comments: 27 (13 by maintainers)

Most upvoted comments

@nigelellis Yes, as soon as we make it the default for new clusters. 🙂

Thanks a lot for confirming @nigelellis!

There will be a v1.7.2 out very soon with a lot of fixes. The v1.7.x versions now have an init-container as well, so the config has changed a lot since v1.6.x

Thank you, I’ll try this out tomorrow.

On Sun, Sep 13, 2020 at 3:18 PM, Claes Mogren < notifications@github.com > wrote:

@ nigelellis ( https://github.com/nigelellis ) Hey, sorry for the late reply on this. In the pod spec you pated, I don’t see unix:///var/run/dockershim.sock being mounted into the aws-node pod. The docker.sock is a docker specific API that was only used on CNI v1.5x and earlier.

The missing lines are aws-k8s-cni. yaml#L141-L142 ( https://github.com/aws/amazon-vpc-cni-k8s/blob/master/config/v1.6/aws-k8s-cni.yaml#L141-L142 ) and aws-k8s-cni. yaml#L156-L158 ( https://github.com/aws/amazon-vpc-cni-k8s/blob/master/config/v1.6/aws-k8s-cni.yaml#L156-L158 ).

This was mentioned in the release notes for v1. 6. 0 ( https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.6.0 ) and the next 3 releases when updating from earlier versions. Sorry for not noticing this right away. Could you please try adding /var/run/dockershim.sock ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub ( https://github.com/aws/amazon-vpc-cni-k8s/issues/1055#issuecomment-691733364 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AC4UMHKL74A3RC6RLRNHJB3SFVALBANCNFSM4OHMIZ6A ).

@mogren yes, that’s the last line in the log before the pod terminates with error 137. My read is it dies while reading from Docker.

I’m running the stock AWS image amazon-k8s-cni:v1.6.4. The kubelet is running EKS v1.17.9-eks-4c6976.

Looks like it’s running Docker docker://19.3.6 (found via. kubectl get node | grep containerVersion)

Does that help?

Thanks, I’ll see if I can reproduce it tomorrow.

On Thu, Sep 03, 2020 at 5:20 PM, Claes Mogren < notifications@github.com > wrote:

@ nigelellis ( https://github.com/nigelellis ) https:/ / docs. aws. amazon. com/ eks/ latest/ userguide/ troubleshooting. html#troubleshoot-cni ( https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshoot-cni ) is the relevant doc.

If you have access to the node, running sudo bash / opt/ cni/ bin/ aws-cni-support. sh ( http://opt/cni/bin/aws-cni-support.sh ) should give you an archive with all relevant logs. (Note, you might want to send this through AWS support or email, depending on the size of the log file)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub ( https://github.com/aws/amazon-vpc-cni-k8s/issues/1055#issuecomment-686829092 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AC4UMHJJYSPEBLOTYYM222TSEAXFZANCNFSM4OHMIZ6A ).

Looks like I have similar. In /var/logs/aws-routed-eni/ipamd.log I see:

{"level":"error","ts":"2020-07-31T09:47:45.014Z","caller":"ipamd/ipamd.go:322","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-01f111154 eni-0822227]: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 111....111"}

despite I do have both ec2:DescribeNetworkInterfaces and sts:AssumeRoleWithWebIdentity explicitly attached to node role. Please note “caused by: [nothing here]”.