argo-workflows: workflow-controller: invalid config map object received in config watcher. Ignored processing
Pre-requisites
- I have double-checked my configuration
- I can confirm the issues exists when I tested with
:latest
- I’d like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
We’re migrating our Argo workflow from v3.0.2 hosted in an old Rancher cluster to the last version in an AWS EKS cluster.
Ramdomically the workflow controller increases de CPU usage and starts to show this error.
level=error msg="invalid config map object received in config watcher. Ignored processing"
If we remove the pod, the new one starts well, but after 1 or 2 days (random), the error reappears.
The last one started yesterday at the moment without workflow running nor scheduled tasks on the infra side.
We don’t have an error in the k8s API log to indicate some EKS outage.
The Argo server component doesn’t have errors.
We don’t have this error in the old version.
Env: Argo: v3.4.10 Installation method: Official Helm Chart (0.33.1) K8s Cluster: AWS EKS 1.27
Workflow-controller k8s config map:
Name: argo-workflow-controller-configmap Namespace: argo Labels: app.kubernetes.io/component=workflow-controller app.kubernetes.io/instance=argo app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=argo-workflows-cm app.kubernetes.io/part-of=argo-workflows helm.sh/chart=argo-workflows-0.33.1 Annotations: meta.helm.sh/release-name: argo meta.helm.sh/release-namespace: argo
Data
config:
metricsConfig: enabled: true path: /metrics port: 9090 ignoreErrors: false secure: false persistence: archive: true archiveTTL: 60d postgresql: database: argo host: <REDACTED> passwordSecret: key: password name: argo-postgres-config port: 5432 ssl: false sslMode: disabled tableName: argo_workflows userNameSecret: key: username name: argo-postgres-config workflowDefaults: spec: ttlStrategy: secondsAfterFailure: 604800 sso: issuer: https://<REDACTED> clientId: name: argo-server-clientid key: clientid clientSecret: name: argo-server-clientsecret key: clientsecret redirectUrl: https://<REDACTED>/oauth2/callback rbac: enabled: true scopes: - email - profile - openid issuerAlias: https://<REDACTED> sessionExpiry: 240h nodeEvents: enabled: false`
Resource Usage
Version
v3.4.10
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don’t enter a workflows that uses private images.
N/A
Since it starts randomly, we usually don't have any workflows running
Logs from the workflow controller
Before:
----
After (Lines like that appear several times - I only have it in the container output):
time="2023-08-23T09:41:01.370Z" level=error msg="invalid config map object received in config watcher. Ignored processing"
Logs from in your workflow’s wait container
N/A
About this issue
- Original URL
- State: open
- Created 10 months ago
- Reactions: 15
- Comments: 29 (13 by maintainers)
Commits related to this issue
- fix: Refactor the func runConfigMapWatcher to use Informers. Fixes #11657 Signed-off-by: juranir <juranir.santos@gmail.com> — committed to juranir/argo-workflows by juranir 9 months ago
- fix: Refactor newConfigMapInformer to handle ctrl cm. Fixes #11657 Signed-off-by: juranir <juranir.santos@gmail.com> — committed to juranir/argo-workflows by juranir 9 months ago
- feat(controller): option to not watch configmap Fixes https://github.com/argoproj/argo-workflows/issues/11657 Without setting this env variable argo-workflows is unusable with k8s >= 1.27 as logs ... — committed to tooptoop4/argo-workflows by tooptoop4 6 months ago
It seems this is only happening with kubernetes version 1.27 or later. Does anyone have evidence of it happening on earlier versions?
We are still using client-go from 0.24, which should not be compatible with 1.27, so perhaps that’s the cause here. https://github.com/argoproj/argo-workflows/blob/6c0697928f92160f84e2fc6581551d26e13c0dc5/go.mod#L70
There is some problem with the local environment (I don’t know why). I downloaded the latest version code and made the same change. now the core is working fine without the weird message.
I’ll wait some time and create a PR.
I’m starting to think I’m doing something wrong in the image construction process.
I tried cloning the repo and running the command
make workflow-controller-image
(without file changes), and the error messagelisters.go:79] can not retrieve list of objects using index : Index with name namespace does not exist
keeps appearing, but if I use the official image it doesn’t.I tried using Windows and Mac M1(using --platform linux/amd64 parameter) as host.
PS. The build task ends without error.
Not sure about the exact error message, but maybe the resync period duration of
0
from your previous code comment is too short? I think most of the existing informers in the codebase are longer (and also it seems like the resync period parameter is very confusing to informer users from a quick search – it is not the same as a re-list). Also there is a separate issue mentioned above re: managed namespace that potentially impacts it. If you change it to use regular NS and that fixes it, then the error would be due to #11463I waited a bit to see if the error would appear again. So far this code has solved the initial problem, but a new message has started to appear in the log (every 1min):
listers.go:79] can not retrieve list of objects using index : Index with name namespace does not exist
Even with this message, everything still works well, but I believe I made some mistakes that generated this error.
@agilgur5 or does anyone have an idea about that?
PS. Without my change this message will not appear
@agilgur5 I’m testing the code below, so far everything is fine, but I’m not sure if it’s the best approach:
We’re running 3.4.9 and have seen this happen twice in the last month, where the controller starts logging multiple times per second the line
level=error msg="invalid config map object received in config watcher. Ignored processing"
and starts consuming a lot more resources, and the problems do not go away until we delete the pod. The config map itself is properly configured, and after a few days of running ok, it suddenly begins to spam that error log.For context, the watcher that logs that error is actually watching the wrong namespace when both
--namespaced
and--managed-namespace=<other-namespace>
are used (see https://github.com/argoproj/argo-workflows/issues/11463). I have no idea whether that is related to the nil event or not, but it leads to the watcher actually watching a non-existent config map.Maybe this issue helps explain this behavior (I also found other complaints similar to this for projects other than Argo), if so I think we would add a conditional check before casting the object:
I’m assuming with this change the defer will be executed and a new instance of the watcher will be executed.
Does it make sense to you @agilgur5 ?
I updated the code to try to register the “invalid object” that is causing the error. What is happening is that the event variable is null. I don’t have another error log. I’m using AWS EKS, so I also checked its log and I have no errors.
So I have no idea about this 😦