k8s-config-connector: cnrm-resource-stats-recorder crash loop triggered by recorder container; port 8888 already in use

Checklist

Bug Description

A new revision for cnrm-resource-stats-recorder that tried to start yesterday is failing in a crash loop in one of my clusters, complaining that port 8888 is already in use (in the ‘recorder’ container).

It’s trying to run recorder: gcr.io/gke-release/cnrm/recorder:d399cc9, prom-to-sd: k8s.gcr.io/prometheus-to-sd:v0.9.1. The previous version is still running fine, on versions: recorder: gcr.io/gke-release/cnrm/recorder:2081072, prom-to-sd: k8s.gcr.io/prometheus-to-sd:v0.9.1.

Any insight into this issue would be appreciated! Should I kill this revision and try re-applying the config connector manifests?

Additional Diagnostic Information

Kubernetes Cluster Version

1.17.17-gke.2800

Config Connector Version

1.39.0

Config Connector Mode

cluster

Log Output

The logs from the “recorder” container are (repeated with each crash):

{ "msg": "Recording the stats of Config Connector resources" }
{ "error": "listen tcp :8888: bind: address already in use", "msg": "error registering the Prometheus HTTP handler" }

Steps to Reproduce

Steps to reproduce the issue

I’m not sure exactly what triggered this issue, but it seems that it occurred when the recorder container was trying to upgrade to a new version while another revision was already running in the cluster.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 23 (8 by maintainers)

Most upvoted comments

Thanks for all the info! as an update: we discussed internally and have decided to update the deployment strategy to Recreate, as well as clarify the exposed port in the Deployment spec to help the scheduler.

I’ll update on a fix, goal is to get it in by next release.

Unfortunately even after the kubectl delete pod [recorder_pod_name] -n cnrm-system I’m still having that issue.

@mathieu-benoit this should work. Could you try deleting all the recorder pods, rather than just a single one?

@toumorokoshi I just realized that we’re also seeing the same issue that @mathieu-benoit is seeing, but only on our rapid GKE cluster running version v1.19.8-gke.2000. On other clusters in the regular channel with version v1.18.16-gke.502 this is not a problem.

On the 1.19 nodes there’s software running which is bound to port 8888:

# netstat -tulpen |grep 8888
tcp        0      0 127.0.0.1:8888          0.0.0.0:*               LISTEN      1000       27836      2222/otelsvc 

# ps ax |grep 2222
   2222 ?        Ssl    0:11 /otelsvc --config=/conf/gke-metrics-agent-config.yaml --metrics-prefix=

in this case the cnrm-resource-stats-recorder is crash looping and can never recover.

On the 1.18 nodes otelsvc is still running, but it doesn’t seem like it’s binding any port. Digging deeper, this is the gke-metrics-agent daemonset, on 1.18 it’s running version 0.3.5-gke.0 and on 1.19 0.3.8-gke.0. In both cases it’s using the host network, but there is a difference in the gke-metrics-agent-conf configmap (in the kube-system NS) - there’s actually quite a bit of difference, but the critical part is this:

$ diff /tmp/cm-1.19.yaml /tmp/cm-1.18.yaml |grep -B1 8888 
<         static_configs:
<         - targets: ["127.0.0.1:8888"]

I can’t find any info regarding that in the GKE changelog, but this means that a GKE cluster running 1.19 cannot work with both the metrics addon and the config connector addon… I would suggest to solve this internally and possibly change the port on either of these workloads.

Is there any ETA yet on when this fix will be available through the add-on?

Unfortunately we don’t have a lot of control around the add-on availability, it can be up to 8 weeks.

We are currently working on a project to try to reduce that time. At this point manual installation is your best choice to be on the edge.

I have the exact same issue:

k get po -n cnrm-system
NAME                                           READY   STATUS             RESTARTS   AGE
cnrm-controller-manager-0                      2/2     Running            0          9h
cnrm-deletiondefender-0                        1/1     Running            0          9h
cnrm-resource-stats-recorder-9f4c5ccfb-dznxz   1/2     CrashLoopBackOff   111        9h
cnrm-webhook-manager-5ccc747594-9clsv          1/1     Running            0          9h
cnrm-webhook-manager-5ccc747594-9lngc          1/1     Running            0          9h

Unfortunately even after the kubectl delete pod [recorder_pod_name] -n cnrm-system I’m still having that issue.

FYI:

  • GKE version 1.19.9-gke.100, Rapid channel
  • Config Connector version 1.45.0 - installation via its Operator