cilium: Upgrading/ Reverting vpc-cni addon [v1.12.6-eksbuild.1 <-> v1.12.5-eksbuild.2] on EKS v1.24 breaks underlying network with Cilium as CNI

Is there an existing issue for this?

I have searched the existing issues

What happened?

While upgrading vpc-cni addon [v1.12.6-eksbuild.1 <-> v1.12.5-eksbuild.2] on EKS cluster v1.24 breaks networking for all pods in terminating state and new pods that come up in pending state.

This is the error message we received after upgrading the vpc-cni plugin on any workload that was trying to be scheduled :

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "6e7f82eed3c3c40752c08e8fa672029640c64eb5b966f22597154d9cd1ac6bcb": plugin type="egress-v4-cni" name="egress-v4-cni" failed (add): netplugin failed: "panic: runtime error: invalid memory address or nil pointer dereference\n[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x5598b4660a2c]\n\ngoroutine 1 [running, locked to thread]:\nmain.cmdAdd(0xc000180850)\n\t/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/egress-v4-cni-plugin/cni.go:256 +0xbec\ngithub.com/containernetworking/cni/pkg/skel.(*dispatcher).checkVersionAndCall(0xc0000d5ed8, 0xc000180850, {0x5598b4798d18, 0xc0000b8e40}, 0x5598b4796be8)\n\t/go/pkg/mod/github.com/containernetworking/cni@v0.8.1/pkg/skel/skel.go:166 +0x20a\ngithub.com/containernetworking/cni/pkg/skel.(*dispatcher).pluginMain(0xc0000d5ed8, 0x5598b48e5000?, 0xc0000d5ec0?, 0x5598b44be809?, {0x5598b4798d18, 0xc0000b8e40}, {0xc000018108, 0x15})\n\t/go/pkg/mod/github.com/containernetworking/cni@v0.8.1/pkg/skel/skel.go:218 +0x245\ngithub.com/containernetworking/cni/pkg/skel.PluginMainWithError(...)\n\t/go/pkg/mod/github.com/containernetworking/cni@v0.8.1/pkg/skel/skel.go:275\ngithub.com/containernetworking/cni/pkg/skel.PluginMain(0x5598b4667db5?, 0x17?, 0xc0000d5f60?, {0x5598b4798d18?, 0xc0000b8e40?}, {0xc000018108?, 0x0?})\n\t/go/pkg/mod/github.com/containernetworking/cni@v0.8.1/pkg/skel/skel.go:290 +0xd1\nmain.main()\n\t/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/egress-v4-cni-plugin/cni.go:250 +0x8e\n"

This is the error message we received after upgrading the vpc-cni plugin on any pods that were in terminating phase :

error killing pod: failed to "KillPodSandbox" for "a10c54e7-3cad-41d0-aaa4-440db2df3ceb" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"3e96ad4dcb4183d542d0bbaecc5bef1d31cca718b95d3f796f6de3805bad5fd5\": plugin type=\"egress-v4-cni\" name=\"egress-v4-cni\" failed (delete): failed to parse config: json: cannot unmarshal string into Go struct field NetConf.mtu of type int"

Note: Know workaround, deleting cilium-daemonset pod on the node solves this issue

Cilium Version

v1.13.2

Kernel Version

Bottlerocket OS 1.13.3 (aws-k8s-1.24) 5.15.102

Kubernetes Version

v1.24.10-eks-48e63af

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

I agree to follow this project’s Code of Conduct

About this issue

Original URL
State: open
Created a year ago
Reactions: 6
Comments: 15 (7 by maintainers)

Most upvoted comments

For people that may land here, I confirm that the cilium daemonset rollout fixes the issue while we wait for the real fix.

rrey on Jul 25, 2023

@squeed just to be sure I get your last comment the proper way: If one faces the issue reported here triggering a rollout of the cilium daemonset shall allow it to regenerate the conf properly and fix the issue ?

rrey on Jul 25, 2023

The problem is that in v1.13 and prior, we only generate the CNI configuration file on agent startup. If the “source” configuration file (i.e. 10-aws-cni.conflist) changes after startup, we miss that.

The “easiest” fix is to exit the agent when the file changes, but maybe there’s a better way. Hmm.

squeed on May 3, 2023

I’ve raised a similar issue here on the VPC CNI plugin; since the stack trace seems to be coming from there: https://github.com/aws/amazon-vpc-cni-k8s/issues/2364

FWIW – all the files in my net.d dump are valid JSON (jq parses them anyway)

justenwalker on Apr 29, 2023