cilium: Upgrading/ Reverting vpc-cni addon [v1.12.6-eksbuild.1 <-> v1.12.5-eksbuild.2] on EKS v1.24 breaks underlying network with Cilium as CNI
Is there an existing issue for this?
- I have searched the existing issues
What happened?
While upgrading vpc-cni addon [v1.12.6-eksbuild.1 <-> v1.12.5-eksbuild.2] on EKS cluster v1.24 breaks networking for all pods in terminating state and new pods that come up in pending state.
This is the error message we received after upgrading the vpc-cni plugin on any workload that was trying to be scheduled :
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "6e7f82eed3c3c40752c08e8fa672029640c64eb5b966f22597154d9cd1ac6bcb": plugin type="egress-v4-cni" name="egress-v4-cni" failed (add): netplugin failed: "panic: runtime error: invalid memory address or nil pointer dereference\n[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x5598b4660a2c]\n\ngoroutine 1 [running, locked to thread]:\nmain.cmdAdd(0xc000180850)\n\t/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/egress-v4-cni-plugin/cni.go:256 +0xbec\ngithub.com/containernetworking/cni/pkg/skel.(*dispatcher).checkVersionAndCall(0xc0000d5ed8, 0xc000180850, {0x5598b4798d18, 0xc0000b8e40}, 0x5598b4796be8)\n\t/go/pkg/mod/github.com/containernetworking/cni@v0.8.1/pkg/skel/skel.go:166 +0x20a\ngithub.com/containernetworking/cni/pkg/skel.(*dispatcher).pluginMain(0xc0000d5ed8, 0x5598b48e5000?, 0xc0000d5ec0?, 0x5598b44be809?, {0x5598b4798d18, 0xc0000b8e40}, {0xc000018108, 0x15})\n\t/go/pkg/mod/github.com/containernetworking/cni@v0.8.1/pkg/skel/skel.go:218 +0x245\ngithub.com/containernetworking/cni/pkg/skel.PluginMainWithError(...)\n\t/go/pkg/mod/github.com/containernetworking/cni@v0.8.1/pkg/skel/skel.go:275\ngithub.com/containernetworking/cni/pkg/skel.PluginMain(0x5598b4667db5?, 0x17?, 0xc0000d5f60?, {0x5598b4798d18?, 0xc0000b8e40?}, {0xc000018108?, 0x0?})\n\t/go/pkg/mod/github.com/containernetworking/cni@v0.8.1/pkg/skel/skel.go:290 +0xd1\nmain.main()\n\t/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/egress-v4-cni-plugin/cni.go:250 +0x8e\n"
This is the error message we received after upgrading the vpc-cni plugin on any pods that were in terminating phase :
error killing pod: failed to "KillPodSandbox" for "a10c54e7-3cad-41d0-aaa4-440db2df3ceb" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"3e96ad4dcb4183d542d0bbaecc5bef1d31cca718b95d3f796f6de3805bad5fd5\": plugin type=\"egress-v4-cni\" name=\"egress-v4-cni\" failed (delete): failed to parse config: json: cannot unmarshal string into Go struct field NetConf.mtu of type int"
Note: Know workaround, deleting cilium-daemonset pod on the node solves this issue
Cilium Version
v1.13.2
Kernel Version
Bottlerocket OS 1.13.3 (aws-k8s-1.24)
5.15.102
Kubernetes Version
v1.24.10-eks-48e63af
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 6
- Comments: 15 (7 by maintainers)
For people that may land here, I confirm that the cilium daemonset rollout fixes the issue while we wait for the real fix.
@squeed just to be sure I get your last comment the proper way: If one faces the issue reported here triggering a rollout of the cilium daemonset shall allow it to regenerate the conf properly and fix the issue ?
The problem is that in v1.13 and prior, we only generate the CNI configuration file on agent startup. If the “source” configuration file (i.e. 10-aws-cni.conflist) changes after startup, we miss that.
The “easiest” fix is to exit the agent when the file changes, but maybe there’s a better way. Hmm.
I’ve raised a similar issue here on the VPC CNI plugin; since the stack trace seems to be coming from there: https://github.com/aws/amazon-vpc-cni-k8s/issues/2364
FWIW – all the files in my net.d dump are valid JSON (
jqparses them anyway)