source-controller: source-controller pod restarting (OOMKilled)
I have noticed that the source-controller pod of my gotk deployment restarting a huge number of times over the weekend (148 times – version 0.1.1). I’ve re-deployed a newer version (0.2.1) but the restarts keep happening (about 2 every half hour).
$> k describe po -n gotk-system source-controller-5cc54c757c-ccwz8
Name: source-controller-5cc54c757c-ccwz8
Namespace: gotk-system
Priority: 0
Node: my-node/10.0.10.11
Start Time: Mon, 02 Nov 2020 13:57:18 +0000
Labels: app=source-controller
pod-template-hash=5cc54c757c
Annotations: prometheus.io/port: 8080
prometheus.io/scrape: true
Status: Running
IP: 10.0.10.12
IPs:
IP: 10.0.10.12
Controlled By: ReplicaSet/source-controller-5cc54c757c
Containers:
manager:
Container ID: docker://6b4a1a89311360cb832fe1d540b4f4cb96c9b8a6591fb01349390ffcdfc99b90
Image: my-registry.com/fluxcd/source-controller:v0.2.1
Image ID: docker-pullable://my-registry.com/fluxcd/source-controller@sha256:e8b708159f6d651a9577695af14bf3291ef844ca5cd7e85f182416b76561d27c
Ports: 9090/TCP, 8080/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--events-addr=
--watch-all-namespaces=true
--log-level=info
--log-json
--enable-leader-election
--storage-path=/data
State: Running
Started: Mon, 02 Nov 2020 14:34:16 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 02 Nov 2020 14:13:59 +0000
Finished: Mon, 02 Nov 2020 14:34:15 +0000
Ready: True
Restart Count: 2
Limits:
cpu: 1
memory: 1Gi
Requests:
cpu: 50m
memory: 64Mi
Liveness: http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
RUNTIME_NAMESPACE: gotk-system (v1:metadata.namespace)
HTTPS_PROXY: http://http.my-proxy.com:8000
NO_PROXY: 10.0.0.0/8,172.0.0.0/8
Mounts:
/data from data (rw)
/tmp from tmp (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-bfr47 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
default-token-bfr47:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-bfr47
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/arch=amd64
kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned gotk-system/source-controller-5cc54c757c-ccwz8 to my-node
Normal Pulled 4m6s (x3 over 41m) kubelet, my-node Container image "my-registry.com/fluxcd/source-controller:v0.2.1" already present on machine
Normal Created 4m6s (x3 over 41m) kubelet, my-node Created container manager
Normal Started 4m6s (x3 over 41m) kubelet, my-node Started container manager
This causes the helm-controller to not be able to reconcile HelmReleases:
$> k get hr --all-namespaces
NAMESPACE NAME READY STATUS AGE
namespace1 chart1 False Get "http://source-controller.gotk-system/helmchart/namespace1/chart1/chart1-v0.15.5.tgz": dial tcp 172.20.225.87:80: connect: connection refused 2d18h
( . . .)
( . . .)
( . . .)
namespace2 chart11 False Get "http://source-controller.gotk-system/helmchart/namespace2/chart11/chart11-v0.1.3.tgz": dial tcp 172.20.225.87:80: connect: connection refused 2d18h
The source controller manages one GitRepository and two HelmRepositories. The helm controller takes care of 11 HelmReleases, each with similar configuration:
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: my-release
namespace: namespace1
spec:
install:
remediation:
retries: -1
upgrade:
remediation:
retries: -1
interval: 1m0s
releaseName: my-release
chart:
spec:
version: 1.0.2
chart: my-chart
sourceRef:
kind: HelmRepository
name: my-repository
namespace: namespace1
valuesFrom:
- kind: ConfigMap
name: my-values
valuesKey: environment
targetPath: myEnv
values:
my-value: 30
While writing up this issue the source-controller restarted 3 more times Logs from the source controller don’t indicate any errors:
{"level":"info","ts":"2020-11-02T15:00:17.646Z","logger":"controllers.HelmChart","msg":"Reconciliation finished in 364.572799ms, next run in 1m0s","controller":"helmchart","request":"namespace1/chart1"}
( . . . )
( . . . )
( . . . )
{"level":"info","ts":"2020-11-02T15:00:17.646Z","logger":"controllers.HelmChart","msg":"Reconciliation finished in 364.572799ms, next run in 1m0s","controller":"helmchart","request":"namespace1/chart11"}
{"level":"info","ts":"2020-11-02T15:01:12.527Z","logger":"controllers.GitRepository","msg":"Reconciliation finished in 1.631398488s, next run in 3m0s","controller":"gitrepository","request":"namespace1/my-git-repo"}
{"level":"info","ts":"2020-11-02T15:01:12.870Z","logger":"controllers.HelmRepository","msg":"Reconciliation finished in 1.165803995s, next run in 3m0s","controller":"helmrepository","request":"namespace1/my-repository"}
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 4
- Comments: 47 (22 by maintainers)
You can change any field of Flux manifests with Kustomize patches without interfering with bootstrap, please read the docs https://toolkit.fluxcd.io/guides/installation/#customize-flux-manifests
@hiddeco Thanks for the update! We’ve been running 0.19.0 in 3 of our environments for a few days now and can report no OOM issues. We’ve even reverted the memory requirements back to default from Max 2Gi to 1Gi.
The changes have indeed been released in
0.19.x, but I would like to see a confirmation from e.g. @matt-woodruff-f3 around resource usage reduction before I think this can be closed.@brianpham make sure you use .sourceignore and you exclude everything else but the yaml manifests or consider having the manifests in a dedicated branch.