kubernetes: Memory leak in controller manager

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

Apologies for this not being more specific from “in the controller manager” – I’ve attached a heap dump which hopefully will clarify but I haven’t figured out yet how to track this down further to a specific controller. Right now we’re restarting the controller manager every hour to mitigate this issue.

What happened:

The controller manager leaked 32GB of memory before being OOM killed. Here’s a screenshot from our monitoring tool:

screen shot 2017-09-18 at 9 00 18 am

We have quite a lot of pod churn in our cluster because we run cron jobs, which I believe is related to this. There are not very many active pods in our cluster at any given time – at most ~300.

Here’s a heap profile from pprof:

I’m not super experienced with reading pprof files but this looks to me like pods are being leaked somewhere. There are about 1.2GB of pods, which is way more than the 300 active pods in the cluster should account for.

Here’s a summary of the pprof file above:

(pprof) top
Showing nodes accounting for 1914.61MB, 96.89% of 1976.01MB total
Dropped 328 nodes (cum <= 9.88MB)
Showing top 10 nodes out of 49
      flat  flat%   sum%        cum   cum%
  778.09MB 39.38% 39.38%   778.09MB 39.38%  runtime.rawstringtmp
  597.51MB 30.24% 69.62%  1211.27MB 61.30%  k8s.io/kubernetes/pkg/api/v1.(*PodSpec).Unmarshal
  183.11MB  9.27% 78.88%   494.25MB 25.01%  k8s.io/kubernetes/pkg/api/v1.(*Container).Unmarshal
  160.57MB  8.13% 87.01%   170.78MB  8.64%  runtime.mapassign
   88.07MB  4.46% 91.46%    88.07MB  4.46%  reflect.unsafe_New
      34MB  1.72% 93.19%       56MB  2.83%  k8s.io/kubernetes/pkg/api/v1.(*VolumeSource).Unmarshal
   27.01MB  1.37% 94.55%    66.01MB  3.34%  k8s.io/kubernetes/pkg/api/v1.(*PodStatus).Unmarshal
   19.05MB  0.96% 95.52%    19.05MB  0.96%  runtime.makemap
      17MB  0.86% 96.38%   541.01MB 27.38%  k8s.io/apimachinery/pkg/apis/meta/v1.(*ObjectMeta).Unmarshal
   10.21MB  0.52% 96.89%    10.21MB  0.52%  runtime.hashGrow

Environment:

  • Kubernetes version: 1.7.3
  • Cloud provider: AWS
  • OS: Ubuntu 16.04
  • Kernel: 4.4

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 6
  • Comments: 36 (25 by maintainers)

Commits related to this issue

Most upvoted comments

If you are rapidly creating and deleting pods containing tolerations, a memory leak was just found in the node lifecycle taint manager. See https://github.com/kubernetes/kubernetes/pull/65339 for details

running a daemonset with an unhealthy node will trigger this issue, where the kublelet rejects/deletes the pod because it doesn’t have capacity, then the daemonset immediately creates a replacement pod for that node.

v1.9.8 -> v1.9.9 seems to have fixed kube-controller-manager RSS (latest ~flat green line since upgrade): k dev_kube-controller-manager_rss 20180718

Our use case creates in the order of 1k~ish per hour Pods from app-driven scheduled Kubernetes Jobs, with tolerations indeed (to land these at specific nodes).

@julia-stripe can you confirm if the pods that were being created/deleted contained .spec.tolerations?

@justinsb How can we put restrictions using kops for controller manager. We are also facing the leak.

$ kubectl top po -n kube-system | grep controller-man
kube-controller-manager-ip-A.ap-southeast-1.compute.internal   0m           15Mi
kube-controller-manager-ip-B-ap-southeast-1.compute.internal    0m           15Mi
kube-controller-manager-ip-C-ap-southeast-1.compute.internal   108m         3478Mi

As a workaround for this we put a 16Gi memory limit in the kube-controller-manager config and that has worked well for over a month now. Hope this helps.

@dims

We do have successfulJobsHistoryLimit and failedJobsHistoryLimit set.

If the problem were due to the actual number of jobs in the cluster, I’d expect the memory usage to immediately go back up once the controller manager is restarted. The fact that restarting it seems to fix the problem makes me think it’s a leak. Does that reasoning make sense to you?