helm: Tiller causing Kubelet to leak memory
The node that is running Tiller causes the Kubelet to leak memory. Depending on how frequently Tiller is polled by the Helm client, influences how much memory is leaked.
This has been verified on 3 different clusters running Kubernetes 1.6.8. Talking to other users, we have identified this problem also existing with 1.5.x, 1.7.x and 1.8.x clusters.
On a cluster where Helm commands were being issued 20 times per minute (over 100 releases), in a period of a day we caused the Kubelet to consume over 70GB of memory.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13268 root 20 0 78.295g 4.521g 48336 S 24.6 30.3 12:53.83 kubelet
Further investigation identified an extremely large number of socat processes running on the node (over 1000). Here is a sample of the process list.
root 27500 0.0 0.0 21740 2976 ? S 07:52 0:00 /usr/bin/socat - TCP4:localhost:44134
root 27501 0.0 0.0 21740 3004 ? S 07:52 0:00 /usr/bin/socat - TCP4:localhost:44134
root 27502 0.0 0.0 21740 3004 ? S 07:52 0:00 /usr/bin/socat - TCP4:localhost:44134
root 27503 0.0 0.0 21740 2920 ? S 07:52 0:00 /usr/bin/socat - TCP4:localhost:44134
root 27504 0.0 0.0 21740 2812 ? S 07:52 0:00 /usr/bin/socat - TCP4:localhost:44134
root 27505 0.0 0.0 21740 2976 ? S 07:52 0:00 /usr/bin/socat - TCP4:localhost:44134
I reached out to @justinsb who verified this on a 1.8.x cluster on a fresh installation of Helm.
Justin found the likely cause of the problem being the following block of code (and comment explaining the edge case). https://github.com/kubernetes/kubernetes/blame/master/pkg/kubelet/dockershim/docker_streaming.go#L191-L192
// If we use Stdin, command.Run() won't return until the goroutine that's copying
// from stream finishes. Unfortunately, if you have a client like telnet connected
// via port forwarding, as long as the user's telnet client is connected to the user's
// local listener that port forwarding sets up, the telnet session never exits. This
// means that even if socat has finished running, command.Run() won't ever return
// (because the client still has the connection and stream open).
//
// The work around is to use StdinPipe(), as Wait() (called by Run()) closes the pipe
// when the command (socat) exits.
This code was added from the following PR: https://github.com/kubernetes/kubernetes/pull/12283
Which was designed to fix: https://github.com/kubernetes/kubernetes/issues/8766
Further discussion about this issue: https://groups.google.com/forum/#!topic/grpc-io/69k_6HKVai0
It seems likely there are 2 issues.
- Tiller is keeping its connections open.
- Kubelet is not handling bad actors who don’t close their connection.
kubectl logs and kubectl exec use this capability but are clearly closing the connection which is why we do not see this problem for every cluster. Only large scale Helm users would likely causing enough queries to make this a noticeable problem.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 8
- Comments: 40 (19 by maintainers)
Commits related to this issue
- fix(tiller): Forces close of idle gRPC connections Possibly fixes #3121. This forces idle connections to drop after 10 minutes — committed to thomastaylor312/helm by thomastaylor312 7 years ago
- fix(tiller): Forces close of idle gRPC connections Possibly fixes #3121. This forces idle connections to drop after 10 minutes — committed to thomastaylor312/helm by thomastaylor312 7 years ago
- Fix for HELM so it can't impact other nodes/pods HELM has a bug that makes kubelet use a lot of memory https://github.com/kubernetes/helm/issues/3121 To mitigate this until it is solved, we will run ... — committed to skyscrapers/terraform-kubernetes by MattiasGees 7 years ago
- Fix for HELM so it can't impact other nodes/pods (#33) HELM has a bug that makes kubelet use a lot of memory https://github.com/kubernetes/helm/issues/3121 To mitigate this until it is solved, we wi... — committed to skyscrapers/terraform-kubernetes by MattiasGees 7 years ago
created https://github.com/kubernetes/kubernetes/issues/57992 for the port-forwarding memory leak.
@SamClinckspoor I think multiple people have seen the same behavior with a similar command, so I think it is a port forwarding issue. Do you think you can file an issue against kubernetes with the details from your last comment? (cc @mikelorant as well since he has been deep in this before and might be able to help)
I’ve just restarted the kubelet and am running fresh checks again. I’ll see where I am at after 24 hours again.