cilium: During deployment upgrade with high max surge, some pods are stuck in ContainerCreating state

Bug report

General Information

Cilium version (run cilium version)

cilium version

Client: 1.9.8 3fcfff7 2021-05-28T02:03:28+02:00 go version go1.15.12 linux/amd64 Daemon: 1.9.8 3fcfff7 2021-05-28T02:03:28+02:00 go version go1.15.12 linux/amd64

Kernel version (run uname -a)

# uname -a
Linux ip-10-80-179-178.ec2.internal 4.14.232-176.381.amzn2.x86_64 #1 SMP Wed May 19 00:31:54 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Orchestration system version in use (e.g. kubectl version, …)

# kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.16-eks-7737de", GitCommit:"7737de131e58a68dda49cdd0ad821b4cb3665ae8", GitTreeState:"clean", BuildDate:"2021-03-10T21:33:25Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Link to relevant artifacts (policies, deployments scripts, …)
Generate and upload a system zip ---- attached I added logs for one node with cilium-8bmkj which is doing the issue currently (it really random on which pod it is happening)

cilium-sysdump-20210616-103544.zip How to reproduce the issue

Install cilium using the following helm command

helm install cilium cilium/cilium --version 1.9.8
–namespace kube-system
–set egressMasqueradeInterfaces=eth0
–set nodeinit.enabled=true

Install a deployment with around 100 pods with max surge of 50% and max unavailable of 25%
change the image tag of the deployment to trigger pods replacement

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 15 (4 by maintainers)

Most upvoted comments

After upgrading to the latest release I can verify that for the last 2-3 weeks we didn’t get the issue so it was resolved. Thank you all for the hard work.

ybialik on Aug 24, 2021

Thanks @christarazi , we already played with the limits without any success. we did see major improvements in 1.10.3 on our staging cluster. I hope to upgrade the version on production cluster this Sunday and see if it fixes the issue for good.

ybialik on Jul 29, 2021