prometheus-engine: TLS Error configuring a PodMonitoring resource in GKE Autopilot cluster
Hi,
I am attempting to follow the steps here to configure managed collection on GKE Autopilot.
When attempting to apply any PodMonitoring resource, I get the following error:
Error from server (InternalError): error when creating "https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.3.1/examples/pod-monitoring.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gke-gmp-system.monitoring.googleapis.com": failed to call webhook: Post "https://gmp-operator.gke-gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": x509: certificate is valid for gmp-operator, gmp-operator.gmp-system, gmp-operator.gmp-system.svc, not gmp-operator.gke-gmp-system.svc
In the logs for the gmp-operator in the gke-gmp-system namespace I see the following errors:
"validatingwebhookconfigurations.admissionregistration.k8s.io "gmp-operator.gmp-system.monitoring.googleapis.com" is forbidden: User "system:serviceaccount:gke-gmp-system:operator" cannot get resource "validatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope""Setting CA bundle for ValidatingWebhookConfiguration failed"
"mutatingwebhookconfigurations.admissionregistration.k8s.io "gmp-operator.gmp-system.monitoring.googleapis.com" is forbidden: User "system:serviceaccount:gke-gmp-system:operator" cannot get resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope""Setting CA bundle for MutatingWebhookConfiguration failed"
This seems in some ways similar to the following issues:
- https://github.com/GoogleCloudPlatform/prometheus-engine/issues/151
- https://github.com/GoogleCloudPlatform/prometheus-engine/issues/178
- https://github.com/GoogleCloudPlatform/prometheus-engine/issues/186
but it is notably different since it is a certificate error and not a timeout.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 33
I understand - fixing this on AP is our top priority.
Update here: we have 1.25+ support on Autopilot clusters.
Will keep this issue open as we work on 1.24…
Yes, still working on this. AP is tricky, as you’ve encountered. We’re working on making this on by default so all this struggle goes away.
This is still broken in GKE autopilot 1.24. Not sure why this was closed?
Autopilot support is now released and working in production! I’ve tested it using the Rapid channel and confirmed it works. https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed#gke-autopilot
It’s on by default in all clusters running 1.25 and greater. 1.25 is slated to enter the Regular channel next week, and clusters are slated to be upgraded by end of March. We aren’t able to backport to 1.24, but given that AP clusters get auto-updated, this will resolve itself in due time.
Closing this as fixed.
AP still not supported - the latest news is we are almost done with 1.25 support, and then will make it work on 1.24. Stay tuned.
It is especially nefarious in GKE Autopilot 1.24 because workload metrics have been deprecated.
Is this still being worked on? I have the same error.
I’m running v1.23.12-gke.100 on my cluster, with a few workloads that follow this template
Is there anything I can do to workaround this and get prometheus to read the metrics from /metrics at port 80?
What I already tried:
Keeping an eye on this thread to use GMP on a regular channel private AP cluster
It’s not fully rolled out yet.
It is a bug - that checkbox doesn’t do anything on AP clusters besides put you in a broken state. We’re going to disable it.
All AP clusters >=1.23 should have GMP on by default by end of this week.
The checkbox to enable GMP in Autopilot isn’t functional. When the rollout is completed, GMP will be enabled by default.