cloud_controller_ng: Killing `cloud_controller_ng` can cause counter metrics like `requests.outstanding` to get stuck
Issue
Logging this a Github issue for discoverability.
The cc.requests.outstanding counter metric can get stuck if Cloud Controller is stopped/killed in the middle of handling requests.
Context
Cloud Controller emits the cc.requests.outstanding to give operators insight into the number of active requests being handled by the cloud_controller_ng Ruby process.
Number of requests that are currently being processed. Emitted for each Cloud Controller request. More information on emitted metrics: https://docs.cloudfoundry.org/running/all_metrics.html
Cloud Controller manages this metric in https://github.com/cloudfoundry/cloud_controller_ng/blob/90bbddabf983a353297a11aa8a2bcc1f98b81503/middleware/request_metrics.rb#L14.
So if the Cloud Controller is stopped/killed before it has a chance to call complete_request here, none of the active requests can decrement the counter in statsd. This is similar to #1294 .
Steps to Reproduce
-
Install the CF Loggregator Firehose Plugin
-
Connect to the firehose and grep out the metric:
cf nozzle -f ValueMetric | grep --line-buffered cc.*outstanding
- Hammer the Cloud Controllers with requests:
hey -n 1000000 -c 50 -H "Authorization: $(cf oauth-token)" http://api.arya.capi.land/v2/apps
- Observe something similar to the following from
cf nozzle
origin:"cc" eventType:ValueMetric timestamp:1553034399010006107 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:19 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034399011308663 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:20 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034398981292501 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:20 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034399098849677 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:19 unit:"counter" >
- On an
apivm, stop thecloud_controller_ngprocess:
monit stop cloud_controller_ng
Feel free to substitute a kill -9 for the monit stop if monit is being too gentle in stopping the Cloud Controller.
- Start it back up again:
monit start cloud_controller_ng
Expected result
An accurate count for requests.outstanding continues to be emitted, capped at 20 (the default number of EventMachine threads).
Current result
Since the Cloud Controller was stopped in the middle of handling all of the requests from hey it was not able to decrement the statsd counter, so now there is a baseline of 20 for all of the cc.requests.outstanding metrics from the cloud_controller_ng job on the VM that was stopped (in this test environment that was b55807af-8ebb-4a54-8bd7-1edc7b8264f9).
origin:"cc" eventType:ValueMetric timestamp:1553034630853046264 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:3 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630865535939 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:2 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630917230329 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:3 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630849778556 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:40 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630888060296 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:39 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630889439373 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:40 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630986481710 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:39 unit:"counter" >
Possible Fix
Current mitigation: restarting the statsd_injector by running monit restart statsd_injector resets the counter.
Longer term: Cloud Controller could potentially clear these statsd counters on startup so it always starts with a clean slate.
Related
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 28 (24 by maintainers)
Commits related to this issue
- Absolute metric for outstanding requests * add a new metric "cc.requests.outstanding.total" and emit as absolute gauge value to statsd * avoids inconsistent metrics when cloud_controller_ng is rest... — committed to jochenehret/cloud_controller_ng by jochenehret 3 years ago
- Deprecate cc.requests.outstanding in favor of cc.requests.outstanding.gauge This is needed after introducing the new metric as part of this issue: https://github.com/cloudfoundry/cloud_controller_ng/... — committed to MerricdeLauney/docs-running-cf by MerricdeLauney 3 years ago
- Deprecate cc.requests.outstanding in favor of cc.requests.outstanding.gauge This is needed after introducing the new metric as part of this issue: https://github.com/cloudfoundry/cloud_controller_ng/... — committed to MerricdeLauney/docs-running-cf by MerricdeLauney 3 years ago
We need to update these docs: https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html https://docs.cloudfoundry.org/running/all_metrics.html
Merged!
Not closing this issue yet until we complete documentation
We considered that, but we were worried that people might think it only applied to the v2 API.