prometheus-operator: Problem with secret refresh v0.42.1

What happened?

We were running v0.39.0 with additionalScrapeConfigs: Runtime changes to the referenced secret were being updated by prometheus-operator and the config changes loaded into Prometheus. After upgrading to 0.42.1 when prom-op starts the config is correctly loaded from the additionalScrapeConfigs secret and Prometheus updated. However after the initial update the configuration in Prometheus doesn’t get updated when the configuration in the secret later changes.

when the secret is updated and resourceVersion changes then the prom-op logs show

level=debug ts=2020-09-28T13:08:47.038331903Z caller=operator.go:1583 component=prometheusoperator msg="updating Prometheus configuration secret"
level=debug ts=2020-09-28T13:08:47.042560064Z caller=operator.go:909 component=prometheusoperator msg="Secret updated"

But /etc/prometheus/config/prometheus.yaml.gz in the prometheus-config-reloader is not updated, nor is it updated after the resyncPeriod, however restarting the prometheus-operator binary does update the configuration, but only when it first initialises, after that it goes back to the behaviour described above.

Did you expect to see something different?

I expected changes to the additionalScrapeConfigs secret to continue to be loaded into Prometheus.

How to reproduce it (as minimally and precisely as possible):

Run v0.42.0 prom-op with additionalScrapeConfigs. Once everything is running, alter the configuration in the secret, the changes will not be loaded into Prometheus

Environment

Prometheus-operator v0.42.1, Prometheus v2.20.1, K8s v1.18.6

Prometheus Operator version:

Prometheus-operator v0.42.1

Kubernetes version information:

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T21:52:18Z", GoVersion:"go1.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:51:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes cluster kind:

insert how you created your cluster: kops, bootkube, etc.
Manifests:

I can send full configs if relevant, but I think there is enough in the description.

Prometheus Operator Logs:

see above

Anything else we need to know?:

Assuming the problem was something to do with this https://github.com/prometheus-operator/prometheus-operator/pull/3355 I tried setting --secret-field-selector='type=Opaque'

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 4
Comments: 15 (7 by maintainers)

Most upvoted comments

@kirkharris1 yes thanks! I think I understand the issue now. Instead of reconciling all Prometheus objects that may be affected by a secret, service/pod monitor or probe update, we stop after the first Prometheus that matches. The regression has been introduced by #3440 and I guess that we don’t have an end-to-end test that validates this scenario.

https://github.com/prometheus-operator/prometheus-operator/blob/4484e4961d8dd92cbff06516228388bdc9a36a78/pkg/prometheus/operator.go#L960-L967

Compare this with the code before #3441 which uses cache.ListAll() instead

https://github.com/prometheus-operator/prometheus-operator/blob/312d675008306b13c24d241bf4f0a882dbfa90d8/pkg/prometheus/operator.go#L1003-L1010

simonpasquier on Oct 14, 2020