source-controller: source-controller pod restarting (OOMKilled)

I have noticed that the source-controller pod of my gotk deployment restarting a huge number of times over the weekend (148 times – version 0.1.1). I’ve re-deployed a newer version (0.2.1) but the restarts keep happening (about 2 every half hour).

$> k describe po -n gotk-system source-controller-5cc54c757c-ccwz8
Name:         source-controller-5cc54c757c-ccwz8
Namespace:    gotk-system
Priority:     0
Node:         my-node/10.0.10.11
Start Time:   Mon, 02 Nov 2020 13:57:18 +0000
Labels:       app=source-controller
              pod-template-hash=5cc54c757c
Annotations:  prometheus.io/port: 8080
              prometheus.io/scrape: true
Status:       Running
IP:           10.0.10.12
IPs:
  IP:           10.0.10.12
Controlled By:  ReplicaSet/source-controller-5cc54c757c
Containers:
  manager:
    Container ID:  docker://6b4a1a89311360cb832fe1d540b4f4cb96c9b8a6591fb01349390ffcdfc99b90
    Image:         my-registry.com/fluxcd/source-controller:v0.2.1
    Image ID:      docker-pullable://my-registry.com/fluxcd/source-controller@sha256:e8b708159f6d651a9577695af14bf3291ef844ca5cd7e85f182416b76561d27c
    Ports:         9090/TCP, 8080/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --events-addr=
      --watch-all-namespaces=true
      --log-level=info
      --log-json
      --enable-leader-election
      --storage-path=/data
    State:          Running
      Started:      Mon, 02 Nov 2020 14:34:16 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 02 Nov 2020 14:13:59 +0000
      Finished:     Mon, 02 Nov 2020 14:34:15 +0000
    Ready:          True
    Restart Count:  2
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      50m
      memory:   64Mi
    Liveness:   http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      RUNTIME_NAMESPACE:  gotk-system (v1:metadata.namespace)
      HTTPS_PROXY:        http://http.my-proxy.com:8000
      NO_PROXY:           10.0.0.0/8,172.0.0.0/8
    Mounts:
      /data from data (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-bfr47 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  default-token-bfr47:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-bfr47
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/arch=amd64
                 kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age                 From                                       Message
  ----    ------     ----                ----                                       -------
  Normal  Scheduled  <unknown>           default-scheduler                          Successfully assigned gotk-system/source-controller-5cc54c757c-ccwz8 to my-node
  Normal  Pulled     4m6s (x3 over 41m)  kubelet, my-node  Container image "my-registry.com/fluxcd/source-controller:v0.2.1" already present on machine
  Normal  Created    4m6s (x3 over 41m)  kubelet, my-node  Created container manager
  Normal  Started    4m6s (x3 over 41m)  kubelet, my-node  Started container manager

This causes the helm-controller to not be able to reconcile HelmReleases:

$> k get hr --all-namespaces
NAMESPACE                NAME                                       READY   STATUS                                                                                                                                                                                                                                      AGE
namespace1           chart1        False   Get "http://source-controller.gotk-system/helmchart/namespace1/chart1/chart1-v0.15.5.tgz": dial tcp 172.20.225.87:80: connect: connection refused              2d18h
( . . .)
( . . .)
( . . .)
namespace2          chart11       False   Get "http://source-controller.gotk-system/helmchart/namespace2/chart11/chart11-v0.1.3.tgz": dial tcp 172.20.225.87:80: connect: connection refused              2d18h

The source controller manages one GitRepository and two HelmRepositories. The helm controller takes care of 11 HelmReleases, each with similar configuration:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: my-release
  namespace: namespace1
spec:
  install:
    remediation:
      retries: -1
  upgrade:
    remediation:
      retries: -1
  interval: 1m0s
  releaseName: my-release
  chart:
    spec:
      version: 1.0.2
      chart: my-chart
      sourceRef:
        kind: HelmRepository
        name: my-repository
        namespace: namespace1
  valuesFrom:
  - kind: ConfigMap
    name: my-values
    valuesKey: environment
    targetPath: myEnv
  values:
    my-value: 30

While writing up this issue the source-controller restarted 3 more times Logs from the source controller don’t indicate any errors:

{"level":"info","ts":"2020-11-02T15:00:17.646Z","logger":"controllers.HelmChart","msg":"Reconciliation finished in 364.572799ms, next run in 1m0s","controller":"helmchart","request":"namespace1/chart1"}
( . . . )
( . . . )
( . . . )
{"level":"info","ts":"2020-11-02T15:00:17.646Z","logger":"controllers.HelmChart","msg":"Reconciliation finished in 364.572799ms, next run in 1m0s","controller":"helmchart","request":"namespace1/chart11"}
{"level":"info","ts":"2020-11-02T15:01:12.527Z","logger":"controllers.GitRepository","msg":"Reconciliation finished in 1.631398488s, next run in 3m0s","controller":"gitrepository","request":"namespace1/my-git-repo"}
{"level":"info","ts":"2020-11-02T15:01:12.870Z","logger":"controllers.HelmRepository","msg":"Reconciliation finished in 1.165803995s, next run in 3m0s","controller":"helmrepository","request":"namespace1/my-repository"}

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 4
  • Comments: 47 (22 by maintainers)

Most upvoted comments

You can change any field of Flux manifests with Kustomize patches without interfering with bootstrap, please read the docs https://toolkit.fluxcd.io/guides/installation/#customize-flux-manifests

@hiddeco Thanks for the update! We’ve been running 0.19.0 in 3 of our environments for a few days now and can report no OOM issues. We’ve even reverted the memory requirements back to default from Max 2Gi to 1Gi.

The changes have indeed been released in 0.19.x, but I would like to see a confirmation from e.g. @matt-woodruff-f3 around resource usage reduction before I think this can be closed.

@brianpham make sure you use .sourceignore and you exclude everything else but the yaml manifests or consider having the manifests in a dedicated branch.