tidb-operator: TiCDC missing metrics in prometheus

Missing Prometheus metrics

We’re currently trying to find the best way to monitor the replication using TiCDC and we noticed that the generated Prometheus config is probably missing something

What version of Kubernetes are you using? v1.15.9

What version of TiDB Operator are you using? v1.1.6 (tidb version v4.0.8)

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods? StorageClasses provided by portworx

What’s the status of the TiDB cluster pods?

kubectl get po -l app.kubernetes.io/instance=internaltools-tidb -o wide
NAME                                            READY   STATUS    RESTARTS   AGE     IP               NODE                    NOMINATED NODE   READINESS GATES
internaltools-tidb-discovery-7f4789cf45-9txwc   1/1     Running   0          3d11h   10.168.35.173    orscale-01-01.adm.dc3   <none>           <none>
internaltools-tidb-monitor-779d5569f6-mk667     3/3     Running   0          18h     10.168.234.5     orscale-01-07.adm.dc3   <none>           <none>
internaltools-tidb-pd-0                         1/1     Running   0          18h     10.168.231.128   orscale-01-31.adm.dc3   <none>           <none>
internaltools-tidb-pd-1                         1/1     Running   0          18h     10.168.61.231    orscale-01-30.adm.dc3   <none>           <none>
internaltools-tidb-pd-2                         1/1     Running   0          18h     10.168.59.96     orscale-01-29.adm.dc3   <none>           <none>
internaltools-tidb-ticdc-0                      1/1     Running   1          18h     10.168.59.93     orscale-01-29.adm.dc3   <none>           <none>
internaltools-tidb-ticdc-1                      1/1     Running   1          18h     10.168.129.98    orscale-01-33.adm.dc3   <none>           <none>
internaltools-tidb-ticdc-2                      1/1     Running   1          18h     10.168.3.247     orscale-01-32.adm.dc3   <none>           <none>
internaltools-tidb-tidb-0                       2/2     Running   0          18h     10.168.61.233    orscale-01-30.adm.dc3   <none>           <none>
internaltools-tidb-tidb-1                       2/2     Running   0          18h     10.168.3.245     orscale-01-32.adm.dc3   <none>           <none>
internaltools-tidb-tikv-0                       1/1     Running   0          18h     10.168.59.89     orscale-01-29.adm.dc3   <none>           <none>
internaltools-tidb-tikv-1                       1/1     Running   0          18h     10.168.231.129   orscale-01-31.adm.dc3   <none>           <none>
internaltools-tidb-tikv-2                       1/1     Running   0          18h     10.168.3.244     orscale-01-32.adm.dc3   <none>           <none>
internaltools-tidb-tikv-3                       1/1     Running   0          18h     10.168.61.232    orscale-01-30.adm.dc3   <none>           <none>
internaltools-tidb-tikv-4                       1/1     Running   0          18h     10.168.129.94    orscale-01-33.adm.dc3   <none>           <none>

What did you do?

Check that the metric is available through the pod exporter

kubectl port-forward internaltools-tidb-ticdc-0 8301 &

curl -sk https://localhost:8301/metrics | grep -i lag
Handling connection for 8301
# HELP ticdc_processor_checkpoint_ts_lag global checkpoint ts lag of processor
# TYPE ticdc_processor_checkpoint_ts_lag gauge
ticdc_processor_checkpoint_ts_lag{capture="internaltools-tidb-ticdc-0.internaltools-tidb-ticdc-peer.internaltools.svc:8301",changefeed="350ef73f-a472-419c-a46d-b89c1043d71b"} 1.224
# HELP ticdc_processor_resolved_ts_lag local resolved ts lag of processor
# TYPE ticdc_processor_resolved_ts_lag gauge
ticdc_processor_resolved_ts_lag{capture="internaltools-tidb-ticdc-0.internaltools-tidb-ticdc-peer.internaltools.svc:8301",changefeed="350ef73f-a472-419c-a46d-b89c1043d71b"} 0.879

Checking for the same metrics in prometheus

kubectl port-forward svc/internaltools-tidb-prometheus 9090 &

And look for the metric ticdc_processor_checkpoint_ts_lag

What did you expect to see?

We were expecting to find the metric ticdc_processor_checkpoint_ts_lag in prometheus

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

Thanks for the update, we’ll test the v1.1.8 as soon as it is available

It seems the TLS config of TiCDC, importer and lightning is not honored. (Following configs are provided by OP in Slack discuss)

The OP’s cluster has TLS enabled, and for TiDB, TiKV, and PD, the prometheus config is like:

- job_name: tikv
  honor_labels: true
  scrape_interval: 15s
  scheme: https
  kubernetes_sd_configs:
  - api_server: null
    role: pod
    namespaces:
      names:
      - frak
  tls_config:
    ca_file: /var/lib/cluster-client-tls/ca.crt
    cert_file: /var/lib/cluster-client-tls/tls.crt
    key_file: /var/lib/cluster-client-tls/tls.key
    insecure_skip_verify: false
  relabel_configs:
  - source_labels: 
...

But for ticdc, importer, and lightning it is:

- job_name: ticdc
  honor_labels: true
  scrape_interval: 15s
  scheme: http
  kubernetes_sd_configs:
  - api_server: null
    role: pod
    namespaces:
      names:
      - frak
  tls_config:
    insecure_skip_verify: true
  relabel_configs:
...