serving: queue-proxy 9091 metrics Internal Error , collected metric xxx was collected before with the same name and label values

In what area(s)?

/area monitoring

What version of Knative?

0.19.0

Expected Behavior

queue-proxy 9091 metrics serve error: 500 Internal Server Error and metrics data with some error of collected metric xxx was collected before with the same name and label values

metrics request as follows

root@admin03.sjzcm:~# kubectl get po -n test7 -owide
NAME                                                        READY   STATUS    RESTARTS   AGE     IP             NODE            NOMINATED NODE   READINESS GATES
tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5      2/2     Running   0          3d22h   172.24.1.131   admin03.sjzcm   <none>           <none>

root@admin03.sjzcm:~# curl 172.24.1.131:9091/metrics -v
*   Trying 172.24.1.131...
* TCP_NODELAY set
* Connected to 172.24.1.131 (172.24.1.131) port 9091 (#0)
> GET /metrics HTTP/1.1
> Host: 172.24.1.131:9091
> User-Agent: curl/7.52.1
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Tue, 22 Dec 2020 06:13:50 GMT
< Transfer-Encoding: chunked
<
An error has occurred while serving metrics:

10 error(s) occurred:
* collected metric "revision_app_request_count" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > counter:<value:284 > } was collected before with the same name and label values
* collected metric "revision_app_request_latencies" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > histogram:<sample_count:284 sample_sum:345.99999999999966 bucket:<cumulative_count:284 upper_bound:5 > bucket:<cumulative_count:284 upper_bound:10 > bucket:<cumulative_count:284 upper_bound:20 > bucket:<cumulative_count:284 upper_bound:40 > bucket:<cumulative_count:284 upper_bound:60 > bucket:<cumulative_count:284 upper_bound:80 > bucket:<cumulative_count:284 upper_bound:100 > bucket:<cumulative_count:284 upper_bound:150 > bucket:<cumulative_count:284 upper_bound:200 > bucket:<cumulative_count:284 upper_bound:250 > bucket:<cumulative_count:284 upper_bound:300 > bucket:<cumulative_count:284 upper_bound:350 > bucket:<cumulative_count:284 upper_bound:400 > bucket:<cumulative_count:284 upper_bound:450 > bucket:<cumulative_count:284 upper_bound:500 > bucket:<cumulative_count:284 upper_bound:600 > bucket:<cumulative_count:284 upper_bound:700 > bucket:<cumulative_count:284 upper_bound:800 > bucket:<cumulative_count:284 upper_bound:900 > bucket:<cumulative_count:284 upper_bound:1000 > bucket:<cumulative_count:284 upper_bound:2000 > bucket:<cumulative_count:284 upper_bound:5000 > bucket:<cumulative_count:284 upper_bound:10000 > bucket:<cumulative_count:284 upper_bound:20000 > bucket:<cumulative_count:284 upper_bound:50000 > bucket:<cumulative_count:284 upper_bound:100000 > > } was collected before with the same name and label values
* collected metric "revision_queue_depth" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"" > label:<name:"response_code_class" value:"" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > gauge:<value:1 > } was collected before with the same name and label values
* collected metric "revision_request_count" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > counter:<value:284 > } was collected before with the same name and label values
* collected metric "revision_request_latencies" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > histogram:<sample_count:284 sample_sum:357.99999999999994 bucket:<cumulative_count:284 upper_bound:5 > bucket:<cumulative_count:284 upper_bound:10 > bucket:<cumulative_count:284 upper_bound:20 > bucket:<cumulative_count:284 upper_bound:40 > bucket:<cumulative_count:284 upper_bound:60 > bucket:<cumulative_count:284 upper_bound:80 > bucket:<cumulative_count:284 upper_bound:100 > bucket:<cumulative_count:284 upper_bound:150 > bucket:<cumulative_count:284 upper_bound:200 > bucket:<cumulative_count:284 upper_bound:250 > bucket:<cumulative_count:284 upper_bound:300 > bucket:<cumulative_count:284 upper_bound:350 > bucket:<cumulative_count:284 upper_bound:400 > bucket:<cumulative_count:284 upper_bound:450 > bucket:<cumulative_count:284 upper_bound:500 > bucket:<cumulative_count:284 upper_bound:600 > bucket:<cumulative_count:284 upper_bound:700 > bucket:<cumulative_count:284 upper_bound:800 > bucket:<cumulative_count:284 upper_bound:900 > bucket:<cumulative_count:284 upper_bound:1000 > bucket:<cumulative_count:284 upper_bound:2000 > bucket:<cumulative_count:284 upper_bound:5000 > bucket:<cumulative_count:284 upper_bound:10000 > bucket:<cumulative_count:284 upper_bound:20000 > bucket:<cumulative_count:284 upper_bound:50000 > bucket:<cumulative_count:284 upper_bound:100000 > > } was collected before with the same name and label values
* collected metric "revision_queue_depth" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"" > label:<name:"response_code_class" value:"" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > gauge:<value:1 > } was collected before with the same name and label values
* collected metric "revision_request_count" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > counter:<value:96 > } was collected before with the same name and label values
* collected metric "revision_request_latencies" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > histogram:<sample_count:96 sample_sum:114.00000000000004 bucket:<cumulative_count:96 upper_bound:5 > bucket:<cumulative_count:96 upper_bound:10 > bucket:<cumulative_count:96 upper_bound:20 > bucket:<cumulative_count:96 upper_bound:40 > bucket:<cumulative_count:96 upper_bound:60 > bucket:<cumulative_count:96 upper_bound:80 > bucket:<cumulative_count:96 upper_bound:100 > bucket:<cumulative_count:96 upper_bound:150 > bucket:<cumulative_count:96 upper_bound:200 > bucket:<cumulative_count:96 upper_bound:250 > bucket:<cumulative_count:96 upper_bound:300 > bucket:<cumulative_count:96 upper_bound:350 > bucket:<cumulative_count:96 upper_bound:400 > bucket:<cumulative_count:96 upper_bound:450 > bucket:<cumulative_count:96 upper_bound:500 > bucket:<cumulative_count:96 upper_bound:600 > bucket:<cumulative_count:96 upper_bound:700 > bucket:<cumulative_count:96 upper_bound:800 > bucket:<cumulative_count:96 upper_bound:900 > bucket:<cumulative_count:96 upper_bound:1000 > bucket:<cumulative_count:96 upper_bound:2000 > bucket:<cumulative_count:96 upper_bound:5000 > bucket:<cumulative_count:96 upper_bound:10000 > bucket:<cumulative_count:96 upper_bound:20000 > bucket:<cumulative_count:96 upper_bound:50000 > bucket:<cumulative_count:96 upper_bound:100000 > > } was collected before with the same name and label values
* collected metric "revision_app_request_count" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > counter:<value:96 > } was collected before with the same name and label values
* collected metric "revision_app_request_latencies" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > histogram:<sample_count:96 sample_sum:108.00000000000006 bucket:<cumulative_count:96 upper_bound:5 > bucket:<cumulative_count:96 upper_bound:10 > bucket:<cumulative_count:96 upper_bound:20 > bucket:<cumulative_count:96 upper_bound:40 > bucket:<cumulative_count:96 upper_bound:60 > bucket:<cumulative_count:96 upper_bound:80 > bucket:<cumulative_count:96 upper_bound:100 > bucket:<cumulative_count:96 upper_bound:150 > bucket:<cumulative_count:96 upper_bound:200 > bucket:<cumulative_count:96 upper_bound:250 > bucket:<cumulative_count:96 upper_bound:300 > bucket:<cumulative_count:96 upper_bound:350 > bucket:<cumulative_count:96 upper_bound:400 > bucket:<cumulative_count:96 upper_bound:450 > bucket:<cumulative_count:96 upper_bound:500 > bucket:<cumulative_count:96 upper_bound:600 > bucket:<cumulative_count:96 upper_bound:700 > bucket:<cumulative_count:96 upper_bound:800 > bucket:<cumulative_count:96 upper_bound:900 > bucket:<cumulative_count:96 upper_bound:1000 > bucket:<cumulative_count:96 upper_bound:2000 > bucket:<cumulative_count:96 upper_bound:5000 > bucket:<cumulative_count:96 upper_bound:10000 > bucket:<cumulative_count:96 upper_bound:20000 > bucket:<cumulative_count:96 upper_bound:50000 > bucket:<cumulative_count:96 upper_bound:100000 > > } was collected before with the same name and label values
* Curl_http_done: called premature == 0
* Connection #0 to host 172.24.1.131 left intact

Actual Behavior

Steps to Reproduce the Problem

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 20 (19 by maintainers)

Most upvoted comments

This issue has been fixed by this pr -> https://github.com/knative/pkg/pull/2005

Rootcause

Metrics get invalidated if they are not used for 10 min. But prometheus exporter does not implement flush, hence the old metric always stays.

Reproduce

So to reproduce this issue now in activator, invoke a ksvc, dont invoke for 10 min or so, then invoke again. Now check the http://activator-service.knative-serving:9090/metrics you will see the error. This issue has been fixed now in 0.20 release of knative/pkg, but available only in the Head of knative/serving, so you can reproduce if you want to in release 0.21 and back of knative/serving

cc @evankanderson @skonto

Thanks @itsmurugappan I have not tried to reproduce this yet but good to know it is solved. 😃 I think we can backport it if we need it.

This definitely looks like a bug; it also looks intermittent, so it may be tricky to track down. For the logs in https://github.com/knative/serving/issues/10421#issuecomment-749527745, how were they recorded (so I can try to reproduce).

@zhaojizhuang after the 24th request your counter was duplicated, and then only one of the two kept being incremented. What matters here is that, after a metric has been duplicated, the Prometheus endpoint will remain broken until you restart the entire process (queue-proxy).

After a few hours, I even sometimes see 3 or 4 occurrences of the same metric:

* collected metric "mt_broker_ingress_event_count" { label:<name:"broker_name" value:"default" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-579fd5948b-920d6c5398701cf378bb2a0ed912fb0002" > counter:<value:350 > } was collected before with the same name and label values
* collected metric "mt_broker_ingress_event_count" { label:<name:"broker_name" value:"default" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-579fd5948b-920d6c5398701cf378bb2a0ed912fb0002" > counter:<value:1 > } was collected before with the same name and label values
* collected metric "mt_broker_ingress_event_count" { label:<name:"broker_name" value:"default" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-579fd5948b-920d6c5398701cf378bb2a0ed912fb0002" > counter:<value:1 > } was collected before with the same name and label values
* collected metric "mt_broker_ingress_event_count" { label:<name:"broker_name" value:"default" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-579fd5948b-920d6c5398701cf378bb2a0ed912fb0002" > counter:<value:20 > } was collected before with the same name and label values

Notice the values: 350, 20, 1, 1