cloud_controller_ng: Killing `cloud_controller_ng` can cause counter metrics like `requests.outstanding` to get stuck

Issue

Logging this a Github issue for discoverability.

The cc.requests.outstanding counter metric can get stuck if Cloud Controller is stopped/killed in the middle of handling requests.

Context

Cloud Controller emits the cc.requests.outstanding to give operators insight into the number of active requests being handled by the cloud_controller_ng Ruby process.

Number of requests that are currently being processed. Emitted for each Cloud Controller request. More information on emitted metrics: https://docs.cloudfoundry.org/running/all_metrics.html

Cloud Controller manages this metric in https://github.com/cloudfoundry/cloud_controller_ng/blob/90bbddabf983a353297a11aa8a2bcc1f98b81503/middleware/request_metrics.rb#L14.

So if the Cloud Controller is stopped/killed before it has a chance to call complete_request here, none of the active requests can decrement the counter in statsd. This is similar to #1294 .

Steps to Reproduce

  1. Install the CF Loggregator Firehose Plugin

  2. Connect to the firehose and grep out the metric:

cf nozzle -f ValueMetric | grep --line-buffered cc.*outstanding
  1. Hammer the Cloud Controllers with requests:
hey -n 1000000 -c 50 -H "Authorization: $(cf oauth-token)" http://api.arya.capi.land/v2/apps
  1. Observe something similar to the following from cf nozzle
origin:"cc" eventType:ValueMetric timestamp:1553034399010006107 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:19 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034399011308663 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:20 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034398981292501 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:20 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034399098849677 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:19 unit:"counter" >
  1. On an api vm, stop the cloud_controller_ng process:
monit stop cloud_controller_ng

Feel free to substitute a kill -9 for the monit stop if monit is being too gentle in stopping the Cloud Controller.

  1. Start it back up again:
monit start cloud_controller_ng

Expected result

An accurate count for requests.outstanding continues to be emitted, capped at 20 (the default number of EventMachine threads).

Current result

Since the Cloud Controller was stopped in the middle of handling all of the requests from hey it was not able to decrement the statsd counter, so now there is a baseline of 20 for all of the cc.requests.outstanding metrics from the cloud_controller_ng job on the VM that was stopped (in this test environment that was b55807af-8ebb-4a54-8bd7-1edc7b8264f9).

origin:"cc" eventType:ValueMetric timestamp:1553034630853046264 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:3 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630865535939 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:2 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630917230329 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:3 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630849778556 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:40 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630888060296 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:39 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630889439373 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:40 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630986481710 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:39 unit:"counter" >

Possible Fix

Current mitigation: restarting the statsd_injector by running monit restart statsd_injector resets the counter.

Longer term: Cloud Controller could potentially clear these statsd counters on startup so it always starts with a clean slate.

Related

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 28 (24 by maintainers)

Commits related to this issue

Most upvoted comments

Merged!

Not closing this issue yet until we complete documentation

We considered that, but we were worried that people might think it only applied to the v2 API.