source-controller: source-controller pod restarting (OOMKilled)

I have noticed that the source-controller pod of my gotk deployment restarting a huge number of times over the weekend (148 times – version 0.1.1). I’ve re-deployed a newer version (0.2.1) but the restarts keep happening (about 2 every half hour).

$> k describe po -n gotk-system source-controller-5cc54c757c-ccwz8
Name:         source-controller-5cc54c757c-ccwz8
Namespace:    gotk-system
Priority:     0
Node:         my-node/10.0.10.11
Start Time:   Mon, 02 Nov 2020 13:57:18 +0000
Labels:       app=source-controller
              pod-template-hash=5cc54c757c
Annotations:  prometheus.io/port: 8080
              prometheus.io/scrape: true
Status:       Running
IP:           10.0.10.12
IPs:
  IP:           10.0.10.12
Controlled By:  ReplicaSet/source-controller-5cc54c757c
Containers:
  manager:
    Container ID:  docker://6b4a1a89311360cb832fe1d540b4f4cb96c9b8a6591fb01349390ffcdfc99b90
    Image:         my-registry.com/fluxcd/source-controller:v0.2.1
    Image ID:      docker-pullable://my-registry.com/fluxcd/source-controller@sha256:e8b708159f6d651a9577695af14bf3291ef844ca5cd7e85f182416b76561d27c
    Ports:         9090/TCP, 8080/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --events-addr=
      --watch-all-namespaces=true
      --log-level=info
      --log-json
      --enable-leader-election
      --storage-path=/data
    State:          Running
      Started:      Mon, 02 Nov 2020 14:34:16 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 02 Nov 2020 14:13:59 +0000
      Finished:     Mon, 02 Nov 2020 14:34:15 +0000
    Ready:          True
    Restart Count:  2
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      50m
      memory:   64Mi
    Liveness:   http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      RUNTIME_NAMESPACE:  gotk-system (v1:metadata.namespace)
      HTTPS_PROXY:        http://http.my-proxy.com:8000
      NO_PROXY:           10.0.0.0/8,172.0.0.0/8
    Mounts:
      /data from data (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-bfr47 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  default-token-bfr47:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-bfr47
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/arch=amd64
                 kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age                 From                                       Message
  ----    ------     ----                ----                                       -------
  Normal  Scheduled  <unknown>           default-scheduler                          Successfully assigned gotk-system/source-controller-5cc54c757c-ccwz8 to my-node
  Normal  Pulled     4m6s (x3 over 41m)  kubelet, my-node  Container image "my-registry.com/fluxcd/source-controller:v0.2.1" already present on machine
  Normal  Created    4m6s (x3 over 41m)  kubelet, my-node  Created container manager
  Normal  Started    4m6s (x3 over 41m)  kubelet, my-node  Started container manager

This causes the helm-controller to not be able to reconcile HelmReleases:

$> k get hr --all-namespaces
NAMESPACE                NAME                                       READY   STATUS                                                                                                                                                                                                                                      AGE
namespace1           chart1        False   Get "http://source-controller.gotk-system/helmchart/namespace1/chart1/chart1-v0.15.5.tgz": dial tcp 172.20.225.87:80: connect: connection refused              2d18h
( . . .)
( . . .)
( . . .)
namespace2          chart11       False   Get "http://source-controller.gotk-system/helmchart/namespace2/chart11/chart11-v0.1.3.tgz": dial tcp 172.20.225.87:80: connect: connection refused              2d18h

The source controller manages one GitRepository and two HelmRepositories. The helm controller takes care of 11 HelmReleases, each with similar configuration:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: my-release
  namespace: namespace1
spec:
  install:
    remediation:
      retries: -1
  upgrade:
    remediation:
      retries: -1
  interval: 1m0s
  releaseName: my-release
  chart:
    spec:
      version: 1.0.2
      chart: my-chart
      sourceRef:
        kind: HelmRepository
        name: my-repository
        namespace: namespace1
  valuesFrom:
  - kind: ConfigMap
    name: my-values
    valuesKey: environment
    targetPath: myEnv
  values:
    my-value: 30

While writing up this issue the source-controller restarted 3 more times Logs from the source controller don’t indicate any errors:

{"level":"info","ts":"2020-11-02T15:00:17.646Z","logger":"controllers.HelmChart","msg":"Reconciliation finished in 364.572799ms, next run in 1m0s","controller":"helmchart","request":"namespace1/chart1"}
( . . . )
( . . . )
( . . . )
{"level":"info","ts":"2020-11-02T15:00:17.646Z","logger":"controllers.HelmChart","msg":"Reconciliation finished in 364.572799ms, next run in 1m0s","controller":"helmchart","request":"namespace1/chart11"}
{"level":"info","ts":"2020-11-02T15:01:12.527Z","logger":"controllers.GitRepository","msg":"Reconciliation finished in 1.631398488s, next run in 3m0s","controller":"gitrepository","request":"namespace1/my-git-repo"}
{"level":"info","ts":"2020-11-02T15:01:12.870Z","logger":"controllers.HelmRepository","msg":"Reconciliation finished in 1.165803995s, next run in 3m0s","controller":"helmrepository","request":"namespace1/my-repository"}

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 4
Comments: 47 (22 by maintainers)

Most upvoted comments

You can change any field of Flux manifests with Kustomize patches without interfering with bootstrap, please read the docs https://toolkit.fluxcd.io/guides/installation/#customize-flux-manifests

stefanprodan on Jan 12, 2021

@hiddeco Thanks for the update! We’ve been running 0.19.0 in 3 of our environments for a few days now and can report no OOM issues. We’ve even reverted the memory requirements back to default from Max 2Gi to 1Gi.

matt-woodruff-f3 on Dec 8, 2021

The changes have indeed been released in 0.19.x, but I would like to see a confirmation from e.g. @matt-woodruff-f3 around resource usage reduction before I think this can be closed.

hiddeco on Dec 8, 2021

@brianpham make sure you use .sourceignore and you exclude everything else but the yaml manifests or consider having the manifests in a dedicated branch.

stefanprodan on Jan 12, 2021