istio: Metrics on workload certificate expiry

Describe the feature request

Since istio 1.7 workload certificates are not stored as kubernetes secrets anymore, but rather on the sidecar proxy container. Viewing them works with the istioctl pc secret <pod> command. They are also exposed over the admin api under localhost:15000/certs, where the istioctl command seams to use that endpoint in combination with port-forwarding.

We want to be able to monitor the workload certificates for possible expiry, e.g. by scraping that information from istiod.

Describe alternatives you’ve considered

An option we considered was defining values.sidecarInjectorWebhook.templates: certificate-monitor and maybe set the values.sidecarInjectorWebhook.defaultTemplates: [sidecar, certificate-monitor]. An app in the container would basically query localhost:15000/certs and extract and exposes the certificate expiry, being scraped by prometheus on a regular basis. Got a working proof of concept, but the overhead on both prometheus and running a second sidecar for every istio enabled workload is unproportional, especially when having thousands of workloads.

Also one could think about opening the admin endpoint for a specific monitoring application that collects the information and exposes them to prometheus. This was discarded because of the obvious security concerns.

Affected product area (please put an X in all that apply)

[ ] Docs [ ] Installation [x] Networking [ ] Performance and Scalability [x] Extensions and Telemetry [x] Security [ ] Test and Release [] User Experience [ ] Developer Infrastructure

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 16 (12 by maintainers)

Most upvoted comments

this’s already in master branch, nothing need to do.

Yeah, and I believe we have something similar at istiod for root cert expiry?

a metric(named citadel_server_root_cert_expiry_timestamp) exists in pilot-discovery right now, but the format is not friendly to ready.

# HELP citadel_server_root_cert_expiry_timestamp The unix timestamp, in seconds, when Citadel root cert will expire. A negative time indicates the cert is expired.
# TYPE citadel_server_root_cert_expiry_timestamp gauge
citadel_server_root_cert_expiry_timestamp 1.959728195e+09

@bianpengyuan IIRC, we can add metric here?

This only happens when generating the cert? But we’d want periodic report of the current workload cert expiry?

This looks more like a metric at istio proxy instead of istiod? Envoy has a metric exposed envoy_server_days_until_first_cert_expiring although it will almost always be 0 since the default expiry 24 hours. I think we can make istio-agent generate this stats.