pipeline: Too much tekton metrics which makes our monitoring system doesn't work well

Expected Behavior

Just have some high level scope metric collecting, so that admin can know the whole tekton overview in my monitoring system. Or allow end user to choose which metric we want to collect.

Actual Behavior

Currently, each metric is very fine-grained that have multiple labels: https://github.com/tektoncd/pipeline/blob/master/docs/metrics.md

pipeline=<pipeline_name>
pipelinerun=<pipelinerun_name>
status=<status>
task=<task_name>
taskrun=<taskrun_name>
namespace=<pipelineruns-taskruns-namespace>

And tekton pipeline also have Histogram type, which include lots of data like:

tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="failed",task="anonymous",taskrun="perf-br-test-10-fwmbg",le="10"} 0
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="failed",task="anonymous",taskrun="perf-br-test-10-fwmbg",le="30"} 0
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="failed",task="anonymous",taskrun="perf-br-test-10-fwmbg",le="60"} 0
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="failed",task="anonymous",taskrun="perf-br-test-10-fwmbg",le="300"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="failed",task="anonymous",taskrun="perf-br-test-10-fwmbg",le="900"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="failed",task="anonymous",taskrun="perf-br-test-10-fwmbg",le="1800"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="failed",task="anonymous",taskrun="perf-br-test-10-fwmbg",le="3600"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="failed",task="anonymous",taskrun="perf-br-test-10-fwmbg",le="5400"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="failed",task="anonymous",taskrun="perf-br-test-10-fwmbg",le="10800"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="failed",task="anonymous",taskrun="perf-br-test-10-fwmbg",le="21600"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="failed",task="anonymous",taskrun="perf-br-test-10-fwmbg",le="43200"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="failed",task="anonymous",taskrun="perf-br-test-10-fwmbg",le="86400"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="failed",task="anonymous",taskrun="perf-br-test-10-fwmbg",le="+Inf"} 1
tekton_taskrun_duration_seconds_sum{namespace="2213a7c1-a282",status="failed",task="anonymous",taskrun="perf-br-test-10-fwmbg"} 75
tekton_taskrun_duration_seconds_count{namespace="2213a7c1-a282",status="failed",task="anonymous",taskrun="perf-br-test-10-fwmbg"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="success",task="anonymous",taskrun="perf-br-k-1-tw4d4",le="10"} 0
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="success",task="anonymous",taskrun="perf-br-k-1-tw4d4",le="30"} 0
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="success",task="anonymous",taskrun="perf-br-k-1-tw4d4",le="60"} 0
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="success",task="anonymous",taskrun="perf-br-k-1-tw4d4",le="300"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="success",task="anonymous",taskrun="perf-br-k-1-tw4d4",le="900"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="success",task="anonymous",taskrun="perf-br-k-1-tw4d4",le="1800"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="success",task="anonymous",taskrun="perf-br-k-1-tw4d4",le="3600"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="success",task="anonymous",taskrun="perf-br-k-1-tw4d4",le="5400"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="success",task="anonymous",taskrun="perf-br-k-1-tw4d4",le="10800"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="success",task="anonymous",taskrun="perf-br-k-1-tw4d4",le="21600"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="success",task="anonymous",taskrun="perf-br-k-1-tw4d4",le="43200"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="success",task="anonymous",taskrun="perf-br-k-1-tw4d4",le="86400"} 1
tekton_taskrun_duration_seconds_bucket{namespace="2213a7c1-a282",status="success",task="anonymous",taskrun="perf-br-k-1-tw4d4",le="+Inf"} 1
tekton_taskrun_duration_seconds_sum{namespace="2213a7c1-a282",status="success",task="anonymous",taskrun="perf-br-k-1-tw4d4"} 96
tekton_taskrun_duration_seconds_count{namespace="2213a7c1-a282",status="success",task="anonymous",taskrun="perf-br-k-1-tw4d4"} 1

...

tekton_reconcile_latency_bucket{key="037c84c8-c864/perf-br-05-1bp-1-2sb4x",reconciler="TaskRun",success="true",le="10"} 0
tekton_reconcile_latency_bucket{key="037c84c8-c864/perf-br-05-1bp-1-2sb4x",reconciler="TaskRun",success="true",le="100"} 0
tekton_reconcile_latency_bucket{key="037c84c8-c864/perf-br-05-1bp-1-2sb4x",reconciler="TaskRun",success="true",le="1000"} 1
tekton_reconcile_latency_bucket{key="037c84c8-c864/perf-br-05-1bp-1-2sb4x",reconciler="TaskRun",success="true",le="10000"} 1
tekton_reconcile_latency_bucket{key="037c84c8-c864/perf-br-05-1bp-1-2sb4x",reconciler="TaskRun",success="true",le="30000"} 1
tekton_reconcile_latency_bucket{key="037c84c8-c864/perf-br-05-1bp-1-2sb4x",reconciler="TaskRun",success="true",le="60000"} 1
tekton_reconcile_latency_bucket{key="037c84c8-c864/perf-br-05-1bp-1-2sb4x",reconciler="TaskRun",success="true",le="+Inf"} 1
tekton_reconcile_latency_sum{key="037c84c8-c864/perf-br-05-1bp-1-2sb4x",reconciler="TaskRun",success="true"} 399
tekton_reconcile_latency_count{key="037c84c8-c864/perf-br-05-1bp-1-2sb4x",reconciler="TaskRun",success="true"} 1
tekton_reconcile_latency_bucket{key="037c84c8-c864/perf-br-05-2bp-1-qf9zc",reconciler="TaskRun",success="true",le="10"} 0
tekton_reconcile_latency_bucket{key="037c84c8-c864/perf-br-05-2bp-1-qf9zc",reconciler="TaskRun",success="true",le="100"} 0
tekton_reconcile_latency_bucket{key="037c84c8-c864/perf-br-05-2bp-1-qf9zc",reconciler="TaskRun",success="true",le="1000"} 1
tekton_reconcile_latency_bucket{key="037c84c8-c864/perf-br-05-2bp-1-qf9zc",reconciler="TaskRun",success="true",le="10000"} 1
tekton_reconcile_latency_bucket{key="037c84c8-c864/perf-br-05-2bp-1-qf9zc",reconciler="TaskRun",success="true",le="30000"} 1
tekton_reconcile_latency_bucket{key="037c84c8-c864/perf-br-05-2bp-1-qf9zc",reconciler="TaskRun",success="true",le="60000"} 1
tekton_reconcile_latency_bucket{key="037c84c8-c864/perf-br-05-2bp-1-qf9zc",reconciler="TaskRun",success="true",le="+Inf"} 1
tekton_reconcile_latency_sum{key="037c84c8-c864/perf-br-05-2bp-1-qf9zc",reconciler="TaskRun",success="true"} 458
tekton_reconcile_latency_count{key="037c84c8-c864/perf-br-05-2bp-1-qf9zc",reconciler="TaskRun",success="true"} 1

It mean, each taskrun will all collect metrics: tekton_pipelinerun_duration_seconds_[bucket, sum, count], tekton_pipelinerun_taskrun_duration_seconds_[bucket, sum, count], tekton_taskrun_duration_seconds_[bucket, sum, count], etc…

And each data won’t be removed from metrics after any taskrun is completed or removed.

And we are using Prometheus and sysdig, we can only forward 3000 metrics each times (10 seconds), you can find that the info here: https://docs.sysdig.com/en/limit-prometheus-metric-collection.html#UUID-0c740bd2-4666-061b-c224-9694e12e2276_section-idm231791561676295

max_metrics

The maximum number of Prometheus metrics that the agent can consume from the target. 
The default is 1000. 
The maximum limit is 10,000 on agent v10.0.0 and above. 
The maximum limit is 3000 on agent versions below 10.0.0.

if we have more and more tekton taskrun metrics, which will make the metric count full and the other metrics after 3000 will be dropped and we cannot receive other anymore.

Steps to Reproduce the Problem

Enable the tekton pipeline metrics
Continue creating some taskrun
see the metrics from local or sysdig console

Tekton Pipeline version: v0.11.3

So I am wondering, do we really need so fine-grained metric for each taskrun? Can we select which metric we want to use?

If we have more and more taskruns be created, more and more metrics will be generated, the monitoring system will be full or slow or crash. I think in the end, on one would like to use this metric, because it is so fine-grained and so much info for a cluster or tekton admin and make other metrics don’t work.

I suggest, it is better that each metric is just for some high level data, for example, the duration seconds for all taskruns, not record for each one.

Thanks!

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 3
Comments: 23 (14 by maintainers)

Most upvoted comments

+1 to a ConfigMap option to only report namespace-level metrics.

I’d love to get more data about how users are using these metrics, since I suspect based on the relative lack of urgency on this issue that people don’t actually depend on the task and taskRun keys, and we can just remove them via feature flag or config-observability option. I’d prefer not to plan to have the option exist indefinitely, and instead use it to phase out support for fine-grained metrics.

In general I think metrics are most useful as a high-level view, and shouldn’t include information about the specific workflow (pipeline name, task name, etc.) – if you want to track the latency of a pipeline execution over time, you can do that by scraping kubectl get pipelinerun and eventually by querying Tekton Results – and maybe emitting your own metrics based on that aggregation, that’s up to you.

So I think my proposal would be to add a feature flag to only report at the namespace level, default it to false for 1+ release, mention this in release notes, and if there’s no significant user pushback default it to true for 1+ release, then remove it entirely. If there is user pushback, we can use their input to guide future decisions, but my belief is that nobody depends on this today.

wdyt?

imjasonh on Nov 17, 2020

Hi all that might be interested in, I wrote a TEP https://github.com/tektoncd/community/pull/286 on this issue. It has integrated some thoughts from your comments. Love to hear your feedback!

yaoxiaoqi on Dec 7, 2020

Hi all,

Our colleague already fixed the same problem in Knative/pkg, and I heard the Tekton also uses the same knati/pkg for metrics.

Can you please help check if it is helpful after update the knative/pkg in Tekton? https://github.com/knative/pkg/pull/1494

Thanks!

zhangtbj on Jul 17, 2020