envoy: Inconsistent gauges after hot-restart on endpoints /clusters and /stats/prometheus

Description: After hot-restart, gauges, especially upstream_cx_active on /stats/prometheus endpoint, do not reset themselves. Values from previous instances are added to current. Sending POST to /reset_counters does not reset gauges either.

Should envoy behave like this?

It can be easily reproduced by checking /stats/prometheus metrics, before and after hot-restart and comparing them to cx_active returned /clusters admin endpoint

curl http://localhost:8001/stats/prometheus 2>/dev/null | grep 'upstream_cx_active

Repro steps:

  • checkout branch hot_restart_container of respository https://github.com/andrzejwaw/envoy/tree/hot_restart_container

  • in envoy/examples/front-proxy execute docker-compose-up

  • send some request to services

    wrk -t12 -c400 -d30s http://localhost:8000/service/1
    
  • check metrics:

    $ curl http://localhost:8001/stats/prometheus 2>/dev/null | grep 'upstream_cx_active' | grep service1
    envoy_cluster_upstream_cx_active{envoy_cluster_name="service1"} 6
    $ curl http://localhost:8001/clusters 2>/dev/null | grep 'cx_active' | grep service1
    service1::192.168.128.3:80::cx_active::6
    
  • make a hot-restart (send SIGHUP signal to hot-restarter.py process) and wait for it (up to one minute):

  • check metrics:
    After hot-restart gauge upstream_cx from /stats/prometheus is inconsistent with cx_active from /clusters endpoint:

    $ curl http://localhost:8001/stats/prometheus 2>/dev/null | grep 'upstream_cx_active' | grep service1
    envoy_cluster_upstream_cx_active{envoy_cluster_name="service1"} 6
    
    $ curl http://localhost:8001/clusters 2>/dev/null | grep 'cx_active' | grep service1
    service1::192.168.128.3:80::cx_active::0
    

    this makes it difficult to analyze the metrics

  • send more requests

    wrk -t12 -c400 -d30s http://localhost:8000/service/1
    
  • check metrics:

    $ curl http://localhost:8001/stats/prometheus 2>/dev/null | grep 'upstream_cx_active' | grep service1
    envoy_cluster_upstream_cx_active{envoy_cluster_name="service1"} 12
    
    $ curl http://localhost:8001/clusters 2>/dev/null | grep 'cx_active' | grep service1
    service1::192.168.128.3:80::cx_active::6
    

I was expecting that after hot-restart both cx_active values from /stats/prometheus endpoint and /clusters endpoint will be equal to 0.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 5
  • Comments: 15 (15 by maintainers)

Commits related to this issue

Most upvoted comments

@andrzejwaw thanks, I think we know what the issue is. Hopefully @jmarantz can work on a fix.