cilium: During deployment upgrade with high max surge, some pods are stuck in ContainerCreating state
Bug report
General Information
- Cilium version (run
cilium version)
cilium version
Client: 1.9.8 3fcfff7 2021-05-28T02:03:28+02:00 go version go1.15.12 linux/amd64 Daemon: 1.9.8 3fcfff7 2021-05-28T02:03:28+02:00 go version go1.15.12 linux/amd64
- Kernel version (run
uname -a)
# uname -a
Linux ip-10-80-179-178.ec2.internal 4.14.232-176.381.amzn2.x86_64 #1 SMP Wed May 19 00:31:54 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
- Orchestration system version in use (e.g.
kubectl version, …)
# kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.16-eks-7737de", GitCommit:"7737de131e58a68dda49cdd0ad821b4cb3665ae8", GitTreeState:"clean", BuildDate:"2021-03-10T21:33:25Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
- Link to relevant artifacts (policies, deployments scripts, …)
- Generate and upload a system zip ---- attached I added logs for one node with cilium-8bmkj which is doing the issue currently (it really random on which pod it is happening)
cilium-sysdump-20210616-103544.zip How to reproduce the issue
- Install cilium using the following helm command
helm install cilium cilium/cilium --version 1.9.8
–namespace kube-system
–set egressMasqueradeInterfaces=eth0
–set nodeinit.enabled=true
- Install a deployment with around 100 pods with max surge of 50% and max unavailable of 25%
- change the image tag of the deployment to trigger pods replacement
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15 (4 by maintainers)
After upgrading to the latest release I can verify that for the last 2-3 weeks we didn’t get the issue so it was resolved. Thank you all for the hard work.
Thanks @christarazi , we already played with the limits without any success. we did see major improvements in 1.10.3 on our staging cluster. I hope to upgrade the version on production cluster this Sunday and see if it fixes the issue for good.