prometheus-engine: TLS Error configuring a PodMonitoring resource in GKE Autopilot cluster

Hi,

I am attempting to follow the steps here to configure managed collection on GKE Autopilot.

When attempting to apply any PodMonitoring resource, I get the following error: Error from server (InternalError): error when creating "https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.3.1/examples/pod-monitoring.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gke-gmp-system.monitoring.googleapis.com": failed to call webhook: Post "https://gmp-operator.gke-gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": x509: certificate is valid for gmp-operator, gmp-operator.gmp-system, gmp-operator.gmp-system.svc, not gmp-operator.gke-gmp-system.svc

In the logs for the gmp-operator in the gke-gmp-system namespace I see the following errors:

"validatingwebhookconfigurations.admissionregistration.k8s.io "gmp-operator.gmp-system.monitoring.googleapis.com" is forbidden: User "system:serviceaccount:gke-gmp-system:operator" cannot get resource "validatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope"
- "Setting CA bundle for ValidatingWebhookConfiguration failed"
"mutatingwebhookconfigurations.admissionregistration.k8s.io "gmp-operator.gmp-system.monitoring.googleapis.com" is forbidden: User "system:serviceaccount:gke-gmp-system:operator" cannot get resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope"
- "Setting CA bundle for MutatingWebhookConfiguration failed"

This seems in some ways similar to the following issues:

but it is notably different since it is a certificate error and not a timeout.

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 1
Comments: 33

Most upvoted comments

I understand - fixing this on AP is our top priority.

lyanco on Sep 20, 2022

Update here: we have 1.25+ support on Autopilot clusters.

Will keep this issue open as we work on 1.24…

pintohutch on Nov 11, 2022

Yes, still working on this. AP is tricky, as you’ve encountered. We’re working on making this on by default so all this struggle goes away.

lyanco on Oct 11, 2022

This is still broken in GKE autopilot 1.24. Not sure why this was closed?

philip-harvey on Sep 20, 2022

Autopilot support is now released and working in production! I’ve tested it using the Rapid channel and confirmed it works. https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed#gke-autopilot

It’s on by default in all clusters running 1.25 and greater. 1.25 is slated to enter the Regular channel next week, and clusters are slated to be upgraded by end of March. We aren’t able to backport to 1.24, but given that AP clusters get auto-updated, this will resolve itself in due time.

Closing this as fixed.

lyanco on Jan 4, 2023

AP still not supported - the latest news is we are almost done with 1.25 support, and then will make it work on 1.24. Stay tuned.

lyanco on Nov 3, 2022

It is especially nefarious in GKE Autopilot 1.24 because workload metrics have been deprecated.

brokenjacobs on Sep 20, 2022

Is this still being worked on? I have the same error.

I’m running v1.23.12-gke.100 on my cluster, with a few workloads that follow this template

---
apiVersion: v1
kind: Service
metadata:
  name: accounts-api
  namespace: app
  labels:
    component: accounts-api
  annotations:
    networking.gke.io/load-balancer-type: "Internal"
    cloud.google.com/neg: '{"ingress": true}'
spec:
  type: LoadBalancer
  selector:
    component: accounts-api
  ports:
    - port: 80
      targetPort: 80
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: accounts-api
  namespace: app
  annotations:
    kubernetes.io/ingress.class: "gce-internal"
spec:
  rules:
    - host: aaaa-accounts-api.clg.nos.internal
      http:
        paths:
          - pathType: Prefix
            path: "/"
            backend:
              service:
                name: aaaa-accounts-api
                port:
                  number: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: accounts-api
  namespace: app
  labels:
    app: accounts-api
spec:
  selector:
    matchLabels:
      component: accounts-api
  template:
    metadata:
      labels:
        component: accounts-api
        istio-injection: enabled
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "80"
        prometheus.io/scrape: "true"
        prometheus.io/alarmgroup: "users"
    spec:
      volumes:
        - name: accounts-configmap
          configMap:
            name: accounts-configmap
      containers:
        - name: accounts-api
          image: my-image
          ports:
            - name: tcp80
              containerPort: 80
              protocol: TCP
          volumeMounts:
            - name: accounts-configmap
              mountPath: /app/appsettings.json
              subPath: appsettings.json
              readOnly: true
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: accounts-api
  namespace: app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: accounts-api
  minReplicas: 1 
  maxReplicas: 10 
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 80
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: accounts-api
  namespace: app
spec:
  selector:
    matchLabels:
      component: accounts-api
  endpoints:
    - port: 80
      interval: 30s
      path: /metrics

Is there anything I can do to workaround this and get prometheus to read the metrics from /metrics at port 80?

What I already tried:

Recreate the cluster
Change services namespace to gke-gmp-system
Create component label in the deployments
Disable managed prometheus and run as manual service.

phenriques740 on Oct 11, 2022

Keeping an eye on this thread to use GMP on a regular channel private AP cluster

mimizone on Aug 22, 2022

It’s not fully rolled out yet.

lyanco on Aug 12, 2022

It is a bug - that checkbox doesn’t do anything on AP clusters besides put you in a broken state. We’re going to disable it.

All AP clusters >=1.23 should have GMP on by default by end of this week.

lyanco on Aug 9, 2022

The checkbox to enable GMP in Autopilot isn’t functional. When the rollout is completed, GMP will be enabled by default.

lyanco on Aug 9, 2022