prometheus-engine: TLS Error configuring a PodMonitoring resource in GKE Autopilot cluster

Hi,

I am attempting to follow the steps here to configure managed collection on GKE Autopilot.

When attempting to apply any PodMonitoring resource, I get the following error: Error from server (InternalError): error when creating "https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.3.1/examples/pod-monitoring.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gke-gmp-system.monitoring.googleapis.com": failed to call webhook: Post "https://gmp-operator.gke-gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": x509: certificate is valid for gmp-operator, gmp-operator.gmp-system, gmp-operator.gmp-system.svc, not gmp-operator.gke-gmp-system.svc

In the logs for the gmp-operator in the gke-gmp-system namespace I see the following errors:

  • "validatingwebhookconfigurations.admissionregistration.k8s.io "gmp-operator.gmp-system.monitoring.googleapis.com" is forbidden: User "system:serviceaccount:gke-gmp-system:operator" cannot get resource "validatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope"
    • "Setting CA bundle for ValidatingWebhookConfiguration failed"
  • "mutatingwebhookconfigurations.admissionregistration.k8s.io "gmp-operator.gmp-system.monitoring.googleapis.com" is forbidden: User "system:serviceaccount:gke-gmp-system:operator" cannot get resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope"
    • "Setting CA bundle for MutatingWebhookConfiguration failed"

This seems in some ways similar to the following issues:

but it is notably different since it is a certificate error and not a timeout.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 33

Most upvoted comments

I understand - fixing this on AP is our top priority.

Update here: we have 1.25+ support on Autopilot clusters.

Will keep this issue open as we work on 1.24…

Yes, still working on this. AP is tricky, as you’ve encountered. We’re working on making this on by default so all this struggle goes away.

This is still broken in GKE autopilot 1.24. Not sure why this was closed?

Autopilot support is now released and working in production! I’ve tested it using the Rapid channel and confirmed it works. https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed#gke-autopilot

It’s on by default in all clusters running 1.25 and greater. 1.25 is slated to enter the Regular channel next week, and clusters are slated to be upgraded by end of March. We aren’t able to backport to 1.24, but given that AP clusters get auto-updated, this will resolve itself in due time.

Closing this as fixed.

AP still not supported - the latest news is we are almost done with 1.25 support, and then will make it work on 1.24. Stay tuned.

It is especially nefarious in GKE Autopilot 1.24 because workload metrics have been deprecated.

Is this still being worked on? I have the same error.

I’m running v1.23.12-gke.100 on my cluster, with a few workloads that follow this template

---
apiVersion: v1
kind: Service
metadata:
  name: accounts-api
  namespace: app
  labels:
    component: accounts-api
  annotations:
    networking.gke.io/load-balancer-type: "Internal"
    cloud.google.com/neg: '{"ingress": true}'
spec:
  type: LoadBalancer
  selector:
    component: accounts-api
  ports:
    - port: 80
      targetPort: 80
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: accounts-api
  namespace: app
  annotations:
    kubernetes.io/ingress.class: "gce-internal"
spec:
  rules:
    - host: aaaa-accounts-api.clg.nos.internal
      http:
        paths:
          - pathType: Prefix
            path: "/"
            backend:
              service:
                name: aaaa-accounts-api
                port:
                  number: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: accounts-api
  namespace: app
  labels:
    app: accounts-api
spec:
  selector:
    matchLabels:
      component: accounts-api
  template:
    metadata:
      labels:
        component: accounts-api
        istio-injection: enabled
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "80"
        prometheus.io/scrape: "true"
        prometheus.io/alarmgroup: "users"
    spec:
      volumes:
        - name: accounts-configmap
          configMap:
            name: accounts-configmap
      containers:
        - name: accounts-api
          image: my-image
          ports:
            - name: tcp80
              containerPort: 80
              protocol: TCP
          volumeMounts:
            - name: accounts-configmap
              mountPath: /app/appsettings.json
              subPath: appsettings.json
              readOnly: true
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: accounts-api
  namespace: app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: accounts-api
  minReplicas: 1 
  maxReplicas: 10 
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 80
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: accounts-api
  namespace: app
spec:
  selector:
    matchLabels:
      component: accounts-api
  endpoints:
    - port: 80
      interval: 30s
      path: /metrics

Is there anything I can do to workaround this and get prometheus to read the metrics from /metrics at port 80?

What I already tried:

  • Recreate the cluster
  • Change services namespace to gke-gmp-system
  • Create component label in the deployments
  • Disable managed prometheus and run as manual service.

Keeping an eye on this thread to use GMP on a regular channel private AP cluster

It’s not fully rolled out yet.

It is a bug - that checkbox doesn’t do anything on AP clusters besides put you in a broken state. We’re going to disable it.

All AP clusters >=1.23 should have GMP on by default by end of this week.

The checkbox to enable GMP in Autopilot isn’t functional. When the rollout is completed, GMP will be enabled by default.