kubernetes: Bug: ResourceExhausted desc = grpc: received message larger than max (4195017 vs. 4194304)
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
- Nodes in
NotReadystatus - Pods in
Unknownstatus
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration: AWS
- OS (e.g. from /etc/os-release): Debian 8 (jessie)
- Kernel (e.g.
uname -a):Linux ... 4.4.115-k8s #1 SMP Thu Feb 8 15:37:40 UTC 2018 x86_64 GNU/Linux - Install tools: kops
- Others:
- Logs:
==> /var/log/daemon.log <==
May 15 12:02:19 ip-10-133-93-99 kubelet[2524]: E0515 12:02:19.150508 2524 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4195017 vs. 4194304)
May 15 12:02:19 ip-10-133-93-99 kubelet[2524]: E0515 12:02:19.150586 2524 kuberuntime_container.go:323] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4195017 vs. 4194304)
May 15 12:02:19 ip-10-133-93-99 kubelet[2524]: E0515 12:02:19.150613 2524 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4195017 vs. 4194304)
==> /var/log/syslog <==
May 15 12:02:21 ip-10-133-93-99 kubelet[2524]: E0515 12:02:21.554728 2524 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4195017 vs. 4194304)
May 15 12:02:21 ip-10-133-93-99 kubelet[2524]: E0515 12:02:21.554842 2524 kuberuntime_container.go:323] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4195017 vs. 4194304)
May 15 12:02:21 ip-10-133-93-99 kubelet[2524]: E0515 12:02:21.554873 2524 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4195017 vs. 4194304)
May 15 12:02:21 ip-10-133-93-99 kubelet[2524]: I0515 12:02:21.788754 2524 kubelet.go:1794] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 1h3m27.338264092s ago; threshold is 3m0s]
- Possible prerequisites: a lot of jobs (1657) that failed (state: Error) because of internal problems
- Cluster size: 3 masters, 4 worker nodes
- Workaround: how to get the node back
docker system prune
...
service docker restart && service kubelet restart
- Possible related to https://github.com/kubernetes/kubernetes/issues/51099
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 31 (16 by maintainers)
Commits related to this issue
- Merge pull request #63894 from dims/bump-grpc-max-message-size-for-docker-service Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instru... — committed to kubernetes/kubernetes by deleted user 6 years ago
- Create kubelet_error — committed to mysunshine92/k8s-study by mysunshine92 6 years ago
For anyone operating a k8s cluster, running into this error and not being sure what the actual problem is… If you run this
docker inspect $(docker ps -a | cut -d ' ' -f1 | tail -n+2) | wc -c, you’ll see that a message describing all containers currently known to docker is bigger than the limit the Kubelet has set for gRPC messages (16MB at the time of writing).This either means you have a LOT of (possibly dead) containers, or there’s a LOT of metadata attached to your containers. This metadata probably comes in the form of custom labels attached to the container image by whatever tool you’re using to create the container images. Check out
docker describefor a couple of your containers and the issue should become apparent.@dims yes it would work but it’s 1GB so we are in a runaway situation here; have 1GB today and you’ll need 2GB tomorrow. 16MB is definitely an improvement and your change to 16MB is LGTM (thanks!), but the root problem still remains. The 2 solutions I can imagine are (1) an API change to include paging (or streaming), or (2) make the kubelet ensure there’s less than a total number of containers (running+exited).
Since the issue also impacts
ctrI’m participating in the following upstream issue: https://github.com/containerd/containerd/issues/2320.@dims I can already tell that it probably won’t solve the problem (with containerd at least):
The solution here for me, install nerdctl, then:
Once containers are pruned, you need to restart containerd for the node to show ready again.
@oldthreefeng I’m glad it helped.
In my case, after adding
--maximum-dead-containers=1000, it still happened once after.So I have to add a docker-prune service for double insurance.
Just for reference
/etc/systemd/system/docker-prune.service
/etc/systemd/system/docker-prune.timer
It will run once a week on Sunday Using systemctl enable and start both. check the status with
systemctl list-timers.My case was caused by too many tekton dead pod (over 3000 dead containers). I will try to add
--maximum-dead-containers=1000to the kubelet to solve the issue https://kubernetes.io/docs/concepts/architecture/garbage-collection/#container-image-garbage-collection https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/Thanks @MG2R your comment was really helpful!! However, is there any systematic solution to resolve this? rather than
docker system pruneon such node or remove dead containers manually? I mean, is there any chance to increase limit of gRPC message size?Really? There’s an absolute universe more information in 1GB that 4MB, just as it made sense to not make the limit 4096. Although the above is true, can we agree there is huge difference, in human terms, between 4MB and say 1GB.
Pretty please change the limit to a round 1GB or something that is configurable at run-time by the service publisher or as a connection option. Maybe an option flag that waves this limit at runtime?
This is a scary issue for us b/c now we have to write a segmenter/desegmenter in our client? Fear. Please let consumers of pb tech not have to do this. Everyone will just have to write their own segmenter/desegmenter hurting the pb’s rep as the one stop shop for interoperabilty.