cloud_controller_ng: Killing `cloud_controller_ng` can cause counter metrics like `requests.outstanding` to get stuck

Issue

Logging this a Github issue for discoverability.

The cc.requests.outstanding counter metric can get stuck if Cloud Controller is stopped/killed in the middle of handling requests.

Context

Cloud Controller emits the cc.requests.outstanding to give operators insight into the number of active requests being handled by the cloud_controller_ng Ruby process.

Number of requests that are currently being processed. Emitted for each Cloud Controller request. More information on emitted metrics: https://docs.cloudfoundry.org/running/all_metrics.html

Cloud Controller manages this metric in https://github.com/cloudfoundry/cloud_controller_ng/blob/90bbddabf983a353297a11aa8a2bcc1f98b81503/middleware/request_metrics.rb#L14.

So if the Cloud Controller is stopped/killed before it has a chance to call complete_request here, none of the active requests can decrement the counter in statsd. This is similar to #1294 .

Steps to Reproduce

Install the CF Loggregator Firehose Plugin
Connect to the firehose and grep out the metric:

cf nozzle -f ValueMetric | grep --line-buffered cc.*outstanding

Hammer the Cloud Controllers with requests:

hey -n 1000000 -c 50 -H "Authorization: $(cf oauth-token)" http://api.arya.capi.land/v2/apps

Observe something similar to the following from cf nozzle

origin:"cc" eventType:ValueMetric timestamp:1553034399010006107 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:19 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034399011308663 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:20 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034398981292501 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:20 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034399098849677 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:19 unit:"counter" >

On an api vm, stop the cloud_controller_ng process:

monit stop cloud_controller_ng

Feel free to substitute a kill -9 for the monit stop if monit is being too gentle in stopping the Cloud Controller.

Start it back up again:

monit start cloud_controller_ng

Expected result

An accurate count for requests.outstanding continues to be emitted, capped at 20 (the default number of EventMachine threads).

Current result

Since the Cloud Controller was stopped in the middle of handling all of the requests from hey it was not able to decrement the statsd counter, so now there is a baseline of 20 for all of the cc.requests.outstanding metrics from the cloud_controller_ng job on the VM that was stopped (in this test environment that was b55807af-8ebb-4a54-8bd7-1edc7b8264f9).

origin:"cc" eventType:ValueMetric timestamp:1553034630853046264 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:3 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630865535939 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:2 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630917230329 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:3 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630849778556 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:40 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630888060296 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:39 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630889439373 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:40 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630986481710 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:39 unit:"counter" >