serving: queue-proxy 9091 metrics Internal Error , collected metric xxx was collected before with the same name and label values
In what area(s)?
/area monitoring
What version of Knative?
0.19.0
Expected Behavior
queue-proxy 9091 metrics serve error: 500 Internal Server Error and metrics data with some error of collected metric xxx was collected before with the same name and label values
metrics request as follows
root@admin03.sjzcm:~# kubectl get po -n test7 -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5 2/2 Running 0 3d22h 172.24.1.131 admin03.sjzcm <none> <none>
root@admin03.sjzcm:~# curl 172.24.1.131:9091/metrics -v
* Trying 172.24.1.131...
* TCP_NODELAY set
* Connected to 172.24.1.131 (172.24.1.131) port 9091 (#0)
> GET /metrics HTTP/1.1
> Host: 172.24.1.131:9091
> User-Agent: curl/7.52.1
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Tue, 22 Dec 2020 06:13:50 GMT
< Transfer-Encoding: chunked
<
An error has occurred while serving metrics:
10 error(s) occurred:
* collected metric "revision_app_request_count" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > counter:<value:284 > } was collected before with the same name and label values
* collected metric "revision_app_request_latencies" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > histogram:<sample_count:284 sample_sum:345.99999999999966 bucket:<cumulative_count:284 upper_bound:5 > bucket:<cumulative_count:284 upper_bound:10 > bucket:<cumulative_count:284 upper_bound:20 > bucket:<cumulative_count:284 upper_bound:40 > bucket:<cumulative_count:284 upper_bound:60 > bucket:<cumulative_count:284 upper_bound:80 > bucket:<cumulative_count:284 upper_bound:100 > bucket:<cumulative_count:284 upper_bound:150 > bucket:<cumulative_count:284 upper_bound:200 > bucket:<cumulative_count:284 upper_bound:250 > bucket:<cumulative_count:284 upper_bound:300 > bucket:<cumulative_count:284 upper_bound:350 > bucket:<cumulative_count:284 upper_bound:400 > bucket:<cumulative_count:284 upper_bound:450 > bucket:<cumulative_count:284 upper_bound:500 > bucket:<cumulative_count:284 upper_bound:600 > bucket:<cumulative_count:284 upper_bound:700 > bucket:<cumulative_count:284 upper_bound:800 > bucket:<cumulative_count:284 upper_bound:900 > bucket:<cumulative_count:284 upper_bound:1000 > bucket:<cumulative_count:284 upper_bound:2000 > bucket:<cumulative_count:284 upper_bound:5000 > bucket:<cumulative_count:284 upper_bound:10000 > bucket:<cumulative_count:284 upper_bound:20000 > bucket:<cumulative_count:284 upper_bound:50000 > bucket:<cumulative_count:284 upper_bound:100000 > > } was collected before with the same name and label values
* collected metric "revision_queue_depth" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"" > label:<name:"response_code_class" value:"" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > gauge:<value:1 > } was collected before with the same name and label values
* collected metric "revision_request_count" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > counter:<value:284 > } was collected before with the same name and label values
* collected metric "revision_request_latencies" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > histogram:<sample_count:284 sample_sum:357.99999999999994 bucket:<cumulative_count:284 upper_bound:5 > bucket:<cumulative_count:284 upper_bound:10 > bucket:<cumulative_count:284 upper_bound:20 > bucket:<cumulative_count:284 upper_bound:40 > bucket:<cumulative_count:284 upper_bound:60 > bucket:<cumulative_count:284 upper_bound:80 > bucket:<cumulative_count:284 upper_bound:100 > bucket:<cumulative_count:284 upper_bound:150 > bucket:<cumulative_count:284 upper_bound:200 > bucket:<cumulative_count:284 upper_bound:250 > bucket:<cumulative_count:284 upper_bound:300 > bucket:<cumulative_count:284 upper_bound:350 > bucket:<cumulative_count:284 upper_bound:400 > bucket:<cumulative_count:284 upper_bound:450 > bucket:<cumulative_count:284 upper_bound:500 > bucket:<cumulative_count:284 upper_bound:600 > bucket:<cumulative_count:284 upper_bound:700 > bucket:<cumulative_count:284 upper_bound:800 > bucket:<cumulative_count:284 upper_bound:900 > bucket:<cumulative_count:284 upper_bound:1000 > bucket:<cumulative_count:284 upper_bound:2000 > bucket:<cumulative_count:284 upper_bound:5000 > bucket:<cumulative_count:284 upper_bound:10000 > bucket:<cumulative_count:284 upper_bound:20000 > bucket:<cumulative_count:284 upper_bound:50000 > bucket:<cumulative_count:284 upper_bound:100000 > > } was collected before with the same name and label values
* collected metric "revision_queue_depth" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"" > label:<name:"response_code_class" value:"" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > gauge:<value:1 > } was collected before with the same name and label values
* collected metric "revision_request_count" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > counter:<value:96 > } was collected before with the same name and label values
* collected metric "revision_request_latencies" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > histogram:<sample_count:96 sample_sum:114.00000000000004 bucket:<cumulative_count:96 upper_bound:5 > bucket:<cumulative_count:96 upper_bound:10 > bucket:<cumulative_count:96 upper_bound:20 > bucket:<cumulative_count:96 upper_bound:40 > bucket:<cumulative_count:96 upper_bound:60 > bucket:<cumulative_count:96 upper_bound:80 > bucket:<cumulative_count:96 upper_bound:100 > bucket:<cumulative_count:96 upper_bound:150 > bucket:<cumulative_count:96 upper_bound:200 > bucket:<cumulative_count:96 upper_bound:250 > bucket:<cumulative_count:96 upper_bound:300 > bucket:<cumulative_count:96 upper_bound:350 > bucket:<cumulative_count:96 upper_bound:400 > bucket:<cumulative_count:96 upper_bound:450 > bucket:<cumulative_count:96 upper_bound:500 > bucket:<cumulative_count:96 upper_bound:600 > bucket:<cumulative_count:96 upper_bound:700 > bucket:<cumulative_count:96 upper_bound:800 > bucket:<cumulative_count:96 upper_bound:900 > bucket:<cumulative_count:96 upper_bound:1000 > bucket:<cumulative_count:96 upper_bound:2000 > bucket:<cumulative_count:96 upper_bound:5000 > bucket:<cumulative_count:96 upper_bound:10000 > bucket:<cumulative_count:96 upper_bound:20000 > bucket:<cumulative_count:96 upper_bound:50000 > bucket:<cumulative_count:96 upper_bound:100000 > > } was collected before with the same name and label values
* collected metric "revision_app_request_count" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > counter:<value:96 > } was collected before with the same name and label values
* collected metric "revision_app_request_latencies" { label:<name:"configuration_name" value:"tosboetest" > label:<name:"container_name" value:"queue-proxy" > label:<name:"namespace_name" value:"test7" > label:<name:"pod_name" value:"tosboetest-version-1-0-0-1-deployment-768bdc6969-cbhg5" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"revision_name" value:"tosboetest-version-1-0-0-1" > label:<name:"service_name" value:"tosboetest" > histogram:<sample_count:96 sample_sum:108.00000000000006 bucket:<cumulative_count:96 upper_bound:5 > bucket:<cumulative_count:96 upper_bound:10 > bucket:<cumulative_count:96 upper_bound:20 > bucket:<cumulative_count:96 upper_bound:40 > bucket:<cumulative_count:96 upper_bound:60 > bucket:<cumulative_count:96 upper_bound:80 > bucket:<cumulative_count:96 upper_bound:100 > bucket:<cumulative_count:96 upper_bound:150 > bucket:<cumulative_count:96 upper_bound:200 > bucket:<cumulative_count:96 upper_bound:250 > bucket:<cumulative_count:96 upper_bound:300 > bucket:<cumulative_count:96 upper_bound:350 > bucket:<cumulative_count:96 upper_bound:400 > bucket:<cumulative_count:96 upper_bound:450 > bucket:<cumulative_count:96 upper_bound:500 > bucket:<cumulative_count:96 upper_bound:600 > bucket:<cumulative_count:96 upper_bound:700 > bucket:<cumulative_count:96 upper_bound:800 > bucket:<cumulative_count:96 upper_bound:900 > bucket:<cumulative_count:96 upper_bound:1000 > bucket:<cumulative_count:96 upper_bound:2000 > bucket:<cumulative_count:96 upper_bound:5000 > bucket:<cumulative_count:96 upper_bound:10000 > bucket:<cumulative_count:96 upper_bound:20000 > bucket:<cumulative_count:96 upper_bound:50000 > bucket:<cumulative_count:96 upper_bound:100000 > > } was collected before with the same name and label values
* Curl_http_done: called premature == 0
* Connection #0 to host 172.24.1.131 left intact
Actual Behavior
Steps to Reproduce the Problem
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 20 (19 by maintainers)
This issue has been fixed by this pr -> https://github.com/knative/pkg/pull/2005
Rootcause
Metrics get invalidated if they are not used for 10 min. But prometheus exporter does not implement flush, hence the old metric always stays.
Reproduce
So to reproduce this issue now in activator, invoke a ksvc, dont invoke for 10 min or so, then invoke again. Now check the http://activator-service.knative-serving:9090/metrics you will see the error. This issue has been fixed now in 0.20 release of knative/pkg, but available only in the Head of knative/serving, so you can reproduce if you want to in release 0.21 and back of knative/serving
cc @evankanderson @skonto
Thanks @itsmurugappan I have not tried to reproduce this yet but good to know it is solved. 😃 I think we can backport it if we need it.
This definitely looks like a bug; it also looks intermittent, so it may be tricky to track down. For the logs in https://github.com/knative/serving/issues/10421#issuecomment-749527745, how were they recorded (so I can try to reproduce).
@zhaojizhuang after the 24th request your counter was duplicated, and then only one of the two kept being incremented. What matters here is that, after a metric has been duplicated, the Prometheus endpoint will remain broken until you restart the entire process (queue-proxy).
After a few hours, I even sometimes see 3 or 4 occurrences of the same metric:
Notice the values: 350, 20, 1, 1