envoy: Inconsistent gauges after hot-restart on endpoints /clusters and /stats/prometheus

Description: After hot-restart, gauges, especially upstream_cx_active on /stats/prometheus endpoint, do not reset themselves. Values from previous instances are added to current. Sending POST to /reset_counters does not reset gauges either.

Should envoy behave like this?

It can be easily reproduced by checking /stats/prometheus metrics, before and after hot-restart and comparing them to cx_active returned /clusters admin endpoint

curl http://localhost:8001/stats/prometheus 2>/dev/null | grep 'upstream_cx_active

Repro steps:

checkout branch hot_restart_container of respository https://github.com/andrzejwaw/envoy/tree/hot_restart_container
in envoy/examples/front-proxy execute docker-compose-up

send some request to services

wrk -t12 -c400 -d30s http://localhost:8000/service/1

check metrics:

$ curl http://localhost:8001/stats/prometheus 2>/dev/null | grep 'upstream_cx_active' | grep service1
envoy_cluster_upstream_cx_active{envoy_cluster_name="service1"} 6
$ curl http://localhost:8001/clusters 2>/dev/null | grep 'cx_active' | grep service1
service1::192.168.128.3:80::cx_active::6

make a hot-restart (send SIGHUP signal to hot-restarter.py process) and wait for it (up to one minute):

check metrics:
After hot-restart gauge upstream_cx from /stats/prometheus is inconsistent with cx_active from /clusters endpoint:

$ curl http://localhost:8001/stats/prometheus 2>/dev/null | grep 'upstream_cx_active' | grep service1
envoy_cluster_upstream_cx_active{envoy_cluster_name="service1"} 6

$ curl http://localhost:8001/clusters 2>/dev/null | grep 'cx_active' | grep service1
service1::192.168.128.3:80::cx_active::0

this makes it difficult to analyze the metrics

send more requests

wrk -t12 -c400 -d30s http://localhost:8000/service/1

check metrics:

$ curl http://localhost:8001/stats/prometheus 2>/dev/null | grep 'upstream_cx_active' | grep service1
envoy_cluster_upstream_cx_active{envoy_cluster_name="service1"} 12

$ curl http://localhost:8001/clusters 2>/dev/null | grep 'cx_active' | grep service1
service1::192.168.128.3:80::cx_active::6

I was expecting that after hot-restart both cx_active values from /stats/prometheus endpoint and /clusters endpoint will be equal to 0.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 5
Comments: 15 (15 by maintainers)

Commits related to this issue

stats: clear hot-restart parent contributions from child gauges when parent terminates. (#11301) Commit Message: A typical gauge tracks some in-progress count within an Envoy process, which is expect... — committed to envoyproxy/envoy by jmarantz 4 years ago
stats: clear hot-restart parent contributions from child gauges when parent terminates. (#11301) Commit Message: A typical gauge tracks some in-progress count within an Envoy process, which is expect... — committed to yashwant121/envoy by jmarantz 4 years ago
stats: clear hot-restart parent contributions from child gauges when parent terminates. (#11301) Commit Message: A typical gauge tracks some in-progress count within an Envoy process, which is expect... — committed to yashwant121/envoy by jmarantz 4 years ago
stats: clear hot-restart parent contributions from child gauges when parent terminates. (#11301) Commit Message: A typical gauge tracks some in-progress count within an Envoy process, which is expect... — committed to songhu/envoy by jmarantz 4 years ago
stats: clear hot-restart parent contributions from child gauges when parent terminates. (#11301) Commit Message: A typical gauge tracks some in-progress count within an Envoy process, which is expect... — committed to yashwant121/envoy by jmarantz 4 years ago

Most upvoted comments

@andrzejwaw thanks, I think we know what the issue is. Hopefully @jmarantz can work on a fix.

mattklein123 on Apr 21, 2020