kubernetes: Bug: ResourceExhausted desc = grpc: received message larger than max (4195017 vs. 4194304)

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

  • Nodes in NotReady status
  • Pods in Unknown status

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Debian 8 (jessie)
  • Kernel (e.g. uname -a): Linux ... 4.4.115-k8s #1 SMP Thu Feb 8 15:37:40 UTC 2018 x86_64 GNU/Linux
  • Install tools: kops
  • Others:
    • Logs:
==> /var/log/daemon.log <==
May 15 12:02:19 ip-10-133-93-99 kubelet[2524]: E0515 12:02:19.150508    2524 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4195017 vs. 4194304)

May 15 12:02:19 ip-10-133-93-99 kubelet[2524]: E0515 12:02:19.150586    2524 kuberuntime_container.go:323] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4195017 vs. 4194304)

May 15 12:02:19 ip-10-133-93-99 kubelet[2524]: E0515 12:02:19.150613    2524 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4195017 vs. 4194304)


==> /var/log/syslog <==
May 15 12:02:21 ip-10-133-93-99 kubelet[2524]: E0515 12:02:21.554728    2524 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4195017 vs. 4194304)

May 15 12:02:21 ip-10-133-93-99 kubelet[2524]: E0515 12:02:21.554842    2524 kuberuntime_container.go:323] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4195017 vs. 4194304)

May 15 12:02:21 ip-10-133-93-99 kubelet[2524]: E0515 12:02:21.554873    2524 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4195017 vs. 4194304)

May 15 12:02:21 ip-10-133-93-99 kubelet[2524]: I0515 12:02:21.788754    2524 kubelet.go:1794] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 1h3m27.338264092s ago; threshold is 3m0s]
  • Possible prerequisites: a lot of jobs (1657) that failed (state: Error) because of internal problems
  • Cluster size: 3 masters, 4 worker nodes
  • Workaround: how to get the node back
docker system prune
...
service docker restart && service kubelet restart

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 31 (16 by maintainers)

Commits related to this issue

Most upvoted comments

For anyone operating a k8s cluster, running into this error and not being sure what the actual problem is… If you run this docker inspect $(docker ps -a | cut -d ' ' -f1 | tail -n+2) | wc -c, you’ll see that a message describing all containers currently known to docker is bigger than the limit the Kubelet has set for gRPC messages (16MB at the time of writing).

This either means you have a LOT of (possibly dead) containers, or there’s a LOT of metadata attached to your containers. This metadata probably comes in the form of custom labels attached to the container image by whatever tool you’re using to create the container images. Check out docker describe for a couple of your containers and the issue should become apparent.

@dims yes it would work but it’s 1GB so we are in a runaway situation here; have 1GB today and you’ll need 2GB tomorrow. 16MB is definitely an improvement and your change to 16MB is LGTM (thanks!), but the root problem still remains. The 2 solutions I can imagine are (1) an API change to include paging (or streaming), or (2) make the kubelet ensure there’s less than a total number of containers (running+exited).

Since the issue also impacts ctr I’m participating in the following upstream issue: https://github.com/containerd/containerd/issues/2320.

@dims I can already tell that it probably won’t solve the problem (with containerd at least):

# ctr -n k8s.io c ls
ctr: grpc: trying to send message larger than max (925727095 vs. 16777216): unknown

The solution here for me, install nerdctl, then:

sudo nerdctl -n k8s.io container prune
sudo systemctl restart containerd

Once containers are pruned, you need to restart containerd for the node to show ready again.

@oldthreefeng I’m glad it helped.

In my case, after adding --maximum-dead-containers=1000, it still happened once after.

So I have to add a docker-prune service for double insurance.

Just for reference

/etc/systemd/system/docker-prune.service

[Unit]
Description=Docker System prune everything

[Service]
Type=oneshot
ExecStart=/usr/bin/sh -c '/usr/bin/docker system prune -a --force'

[Install]
WantedBy=multi-user.target

/etc/systemd/system/docker-prune.timer

[Unit]
Description=Run docker-prune Weekly on Sunday

[Timer]
OnCalendar=Sun

[Install]
WantedBy=multi-user.target

It will run once a week on Sunday Using systemctl enable and start both. check the status with systemctl list-timers.

My case was caused by too many tekton dead pod (over 3000 dead containers). I will try to add --maximum-dead-containers=1000 to the kubelet to solve the issue https://kubernetes.io/docs/concepts/architecture/garbage-collection/#container-image-garbage-collection https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

Thanks @MG2R your comment was really helpful!! However, is there any systematic solution to resolve this? rather than docker system prune on such node or remove dead containers manually? I mean, is there any chance to increase limit of gRPC message size?

@dims yes it would work but it’s 1GB so we are in a runaway situation here; have 1GB today and you’ll need 2GB tomorrow. 16MB is definitely an improvement and your change to 16MB is LGTM (thanks!), but the root problem still remains. The 2 solutions I can imagine are (1) an API change to include paging (or streaming), or (2) make the kubelet ensure there’s less than a total number of containers (running+exited).

Since the issue also impacts ctr I’m participating in the following upstream issue: containerd/containerd#2320.

Really? There’s an absolute universe more information in 1GB that 4MB, just as it made sense to not make the limit 4096. Although the above is true, can we agree there is huge difference, in human terms, between 4MB and say 1GB.

Pretty please change the limit to a round 1GB or something that is configurable at run-time by the service publisher or as a connection option. Maybe an option flag that waves this limit at runtime?

This is a scary issue for us b/c now we have to write a segmenter/desegmenter in our client? Fear. Please let consumers of pb tech not have to do this. Everyone will just have to write their own segmenter/desegmenter hurting the pb’s rep as the one stop shop for interoperabilty.