envoy: Inconsistent gauges after hot-restart on endpoints /clusters and /stats/prometheus
Description: After hot-restart, gauges, especially upstream_cx_active on /stats/prometheus endpoint, do not reset themselves. Values from previous instances are added to current. Sending POST to /reset_counters does not reset gauges either.
Should envoy behave like this?
It can be easily reproduced by checking /stats/prometheus metrics, before and after hot-restart and comparing them to cx_active returned /clusters admin endpoint
curl http://localhost:8001/stats/prometheus 2>/dev/null | grep 'upstream_cx_active
Repro steps:
-
checkout branch hot_restart_container of respository https://github.com/andrzejwaw/envoy/tree/hot_restart_container
-
in envoy/examples/front-proxy execute
docker-compose-up -
send some request to services
wrk -t12 -c400 -d30s http://localhost:8000/service/1 -
check metrics:
$ curl http://localhost:8001/stats/prometheus 2>/dev/null | grep 'upstream_cx_active' | grep service1 envoy_cluster_upstream_cx_active{envoy_cluster_name="service1"} 6 $ curl http://localhost:8001/clusters 2>/dev/null | grep 'cx_active' | grep service1 service1::192.168.128.3:80::cx_active::6 -
make a hot-restart (send SIGHUP signal to hot-restarter.py process) and wait for it (up to one minute):
-
check metrics:
After hot-restart gauge upstream_cx from /stats/prometheus is inconsistent with cx_active from /clusters endpoint:$ curl http://localhost:8001/stats/prometheus 2>/dev/null | grep 'upstream_cx_active' | grep service1 envoy_cluster_upstream_cx_active{envoy_cluster_name="service1"} 6 $ curl http://localhost:8001/clusters 2>/dev/null | grep 'cx_active' | grep service1 service1::192.168.128.3:80::cx_active::0this makes it difficult to analyze the metrics
-
send more requests
wrk -t12 -c400 -d30s http://localhost:8000/service/1 -
check metrics:
$ curl http://localhost:8001/stats/prometheus 2>/dev/null | grep 'upstream_cx_active' | grep service1 envoy_cluster_upstream_cx_active{envoy_cluster_name="service1"} 12 $ curl http://localhost:8001/clusters 2>/dev/null | grep 'cx_active' | grep service1 service1::192.168.128.3:80::cx_active::6
I was expecting that after hot-restart both cx_active values from /stats/prometheus endpoint and /clusters endpoint will be equal to 0.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 5
- Comments: 15 (15 by maintainers)
Commits related to this issue
- stats: clear hot-restart parent contributions from child gauges when parent terminates. (#11301) Commit Message: A typical gauge tracks some in-progress count within an Envoy process, which is expect... — committed to envoyproxy/envoy by jmarantz 4 years ago
- stats: clear hot-restart parent contributions from child gauges when parent terminates. (#11301) Commit Message: A typical gauge tracks some in-progress count within an Envoy process, which is expect... — committed to yashwant121/envoy by jmarantz 4 years ago
- stats: clear hot-restart parent contributions from child gauges when parent terminates. (#11301) Commit Message: A typical gauge tracks some in-progress count within an Envoy process, which is expect... — committed to yashwant121/envoy by jmarantz 4 years ago
- stats: clear hot-restart parent contributions from child gauges when parent terminates. (#11301) Commit Message: A typical gauge tracks some in-progress count within an Envoy process, which is expect... — committed to songhu/envoy by jmarantz 4 years ago
- stats: clear hot-restart parent contributions from child gauges when parent terminates. (#11301) Commit Message: A typical gauge tracks some in-progress count within an Envoy process, which is expect... — committed to yashwant121/envoy by jmarantz 4 years ago
@andrzejwaw thanks, I think we know what the issue is. Hopefully @jmarantz can work on a fix.