argo-workflows: workflow-controller: invalid config map object received in config watcher. Ignored processing

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I’d like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

We’re migrating our Argo workflow from v3.0.2 hosted in an old Rancher cluster to the last version in an AWS EKS cluster.

Ramdomically the workflow controller increases de CPU usage and starts to show this error. level=error msg="invalid config map object received in config watcher. Ignored processing"

If we remove the pod, the new one starts well, but after 1 or 2 days (random), the error reappears.

The last one started yesterday at the moment without workflow running nor scheduled tasks on the infra side.

We don’t have an error in the k8s API log to indicate some EKS outage.

The Argo server component doesn’t have errors.

We don’t have this error in the old version.

Env: Argo: v3.4.10 Installation method: Official Helm Chart (0.33.1) K8s Cluster: AWS EKS 1.27


Workflow-controller k8s config map:

Name: argo-workflow-controller-configmap Namespace: argo Labels: app.kubernetes.io/component=workflow-controller app.kubernetes.io/instance=argo app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=argo-workflows-cm app.kubernetes.io/part-of=argo-workflows helm.sh/chart=argo-workflows-0.33.1 Annotations: meta.helm.sh/release-name: argo meta.helm.sh/release-namespace: argo

Data

config:

metricsConfig: enabled: true path: /metrics port: 9090 ignoreErrors: false secure: false persistence: archive: true archiveTTL: 60d postgresql: database: argo host: <REDACTED> passwordSecret: key: password name: argo-postgres-config port: 5432 ssl: false sslMode: disabled tableName: argo_workflows userNameSecret: key: username name: argo-postgres-config workflowDefaults: spec: ttlStrategy: secondsAfterFailure: 604800 sso: issuer: https://<REDACTED> clientId: name: argo-server-clientid key: clientid clientSecret: name: argo-server-clientsecret key: clientsecret redirectUrl: https://<REDACTED>/oauth2/callback rbac: enabled: true scopes: - email - profile - openid issuerAlias: https://<REDACTED> sessionExpiry: 240h nodeEvents: enabled: false`


Resource Usage image

Version

v3.4.10

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don’t enter a workflows that uses private images.

N/A
Since it starts randomly, we usually don't have any workflows running

Logs from the workflow controller

Before:



----

After (Lines like that appear several times - I only have it in the container output):

time="2023-08-23T09:41:01.370Z" level=error msg="invalid config map object received in config watcher. Ignored processing"

Logs from in your workflow’s wait container

N/A

About this issue

  • Original URL
  • State: open
  • Created 10 months ago
  • Reactions: 15
  • Comments: 29 (13 by maintainers)

Commits related to this issue

Most upvoted comments

It seems this is only happening with kubernetes version 1.27 or later. Does anyone have evidence of it happening on earlier versions?

We are still using client-go from 0.24, which should not be compatible with 1.27, so perhaps that’s the cause here. https://github.com/argoproj/argo-workflows/blob/6c0697928f92160f84e2fc6581551d26e13c0dc5/go.mod#L70

There is some problem with the local environment (I don’t know why). I downloaded the latest version code and made the same change. now the core is working fine without the weird message.

I’ll wait some time and create a PR.

I’m starting to think I’m doing something wrong in the image construction process.

I tried cloning the repo and running the command make workflow-controller-image (without file changes), and the error message listers.go:79] can not retrieve list of objects using index : Index with name namespace does not exist keeps appearing, but if I use the official image it doesn’t.

I tried using Windows and Mac M1(using --platform linux/amd64 parameter) as host.

PS. The build task ends without error.

Not sure about the exact error message, but maybe the resync period duration of 0 from your previous code comment is too short? I think most of the existing informers in the codebase are longer (and also it seems like the resync period parameter is very confusing to informer users from a quick search – it is not the same as a re-list). Also there is a separate issue mentioned above re: managed namespace that potentially impacts it. If you change it to use regular NS and that fixes it, then the error would be due to #11463

I waited a bit to see if the error would appear again. So far this code has solved the initial problem, but a new message has started to appear in the log (every 1min): listers.go:79] can not retrieve list of objects using index : Index with name namespace does not exist

Even with this message, everything still works well, but I believe I made some mistakes that generated this error.

@agilgur5 or does anyone have an idea about that?

PS. Without my change this message will not appear

So in that issue they https://github.com/kubernetes/client-go/issues/334#issuecomment-370561966 using a Reflector or Informer instead to handle several kinds of edge cases including this one. The Argo codebase does use Informers very frequently already, so maybe we should just convert this one. The RetryWatcher source code does mention using Informers instead as well for certain edge cases.

@agilgur5 I’m testing the code below, so far everything is fine, but I’m not sure if it’s the best approach:

func (wfc *WorkflowController) runConfigMapWatcher() {
defer runtimeutil.HandleCrash(runtimeutil.PanicHandlers...)
ctx := context.Background()

factory := informers.NewSharedInformerFactoryWithOptions(
	wfc.kubeclientset,
	0,
	informers.WithNamespace(wfc.managedNamespace),
)

informer := factory.Core().V1().ConfigMaps().Informer()

stopper := make(chan struct{})

defer close(stopper)

informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
	UpdateFunc: func(obj interface{}, newObj interface{}) {
		cm := newObj.(*apiv1.ConfigMap)

		log.Infof("CM Name: %s", cm.GetName())

		if cm.GetName() == wfc.configController.GetName() && wfc.namespace == cm.GetNamespace() {
			log.Infof("Received Workflow Controller config map %s/%s update", cm.Namespace, cm.Name)
			wfc.UpdateConfig(ctx)
		}
		wfc.notifySemaphoreConfigUpdate(cm)
	},
})
go informer.Run(stopper)
if !cache.WaitForCacheSync(stopper, informer.HasSynced) {
	log.Error(fmt.Errorf("Timed out waiting for caches to sync"))
	return
}
<-stopper
}

We’re running 3.4.9 and have seen this happen twice in the last month, where the controller starts logging multiple times per second the line level=error msg="invalid config map object received in config watcher. Ignored processing" and starts consuming a lot more resources, and the problems do not go away until we delete the pod. The config map itself is properly configured, and after a few days of running ok, it suddenly begins to spam that error log.

For context, the watcher that logs that error is actually watching the wrong namespace when both --namespaced and --managed-namespace=<other-namespace> are used (see https://github.com/argoproj/argo-workflows/issues/11463). I have no idea whether that is related to the nil event or not, but it leads to the watcher actually watching a non-existent config map.

Maybe this issue helps explain this behavior (I also found other complaints similar to this for projects other than Argo), if so I think we would add a conditional check before casting the object:

[...]
select {
	case event := <-retryWatcher.ResultChan():
		if event.Object == nil {
			return
		}
		cm, ok := event.Object.(*apiv1.ConfigMap)
[...]

I’m assuming with this change the defer will be executed and a new instance of the watcher will be executed.

Does it make sense to you @agilgur5 ?

I updated the code to try to register the “invalid object” that is causing the error. What is happening is that the event variable is null. I don’t have another error log. I’m using AWS EKS, so I also checked its log and I have no errors.

So I have no idea about this 😦