k8s-config-connector: cnrm-resource-stats-recorder crash loop triggered by recorder container; port 8888 already in use

Checklist

I did not find a related open issue.
I did not find a solution in the troubleshooting guide: (https://cloud.google.com/config-connector/docs/troubleshooting)
If this issue is time-sensitive, I have submitted a corresponding issue with GCP support.

Bug Description

A new revision for cnrm-resource-stats-recorder that tried to start yesterday is failing in a crash loop in one of my clusters, complaining that port 8888 is already in use (in the ‘recorder’ container).

It’s trying to run recorder: gcr.io/gke-release/cnrm/recorder:d399cc9, prom-to-sd: k8s.gcr.io/prometheus-to-sd:v0.9.1. The previous version is still running fine, on versions: recorder: gcr.io/gke-release/cnrm/recorder:2081072, prom-to-sd: k8s.gcr.io/prometheus-to-sd:v0.9.1.

Any insight into this issue would be appreciated! Should I kill this revision and try re-applying the config connector manifests?

Additional Diagnostic Information

Kubernetes Cluster Version

1.17.17-gke.2800

Config Connector Version

1.39.0

Config Connector Mode

cluster

Log Output

The logs from the “recorder” container are (repeated with each crash):

{ "msg": "Recording the stats of Config Connector resources" }
{ "error": "listen tcp :8888: bind: address already in use", "msg": "error registering the Prometheus HTTP handler" }

Steps to Reproduce

Steps to reproduce the issue

I’m not sure exactly what triggered this issue, but it seems that it occurred when the recorder container was trying to upgrade to a new version while another revision was already running in the cluster.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 23 (8 by maintainers)

Links to this issue

Config Connector release notes | Config Connector Documentation | Google Cloud

Most upvoted comments

Thanks for all the info! as an update: we discussed internally and have decided to update the deployment strategy to Recreate, as well as clarify the exposed port in the Deployment spec to help the scheduler.

I’ll update on a fix, goal is to get it in by next release.

Unfortunately even after the kubectl delete pod [recorder_pod_name] -n cnrm-system I’m still having that issue.

@mathieu-benoit this should work. Could you try deleting all the recorder pods, rather than just a single one?

toumorokoshi on Apr 13, 2021

@toumorokoshi I just realized that we’re also seeing the same issue that @mathieu-benoit is seeing, but only on our rapid GKE cluster running version v1.19.8-gke.2000. On other clusters in the regular channel with version v1.18.16-gke.502 this is not a problem.

On the 1.19 nodes there’s software running which is bound to port 8888:

# netstat -tulpen |grep 8888
tcp        0      0 127.0.0.1:8888          0.0.0.0:*               LISTEN      1000       27836      2222/otelsvc 

# ps ax |grep 2222
   2222 ?        Ssl    0:11 /otelsvc --config=/conf/gke-metrics-agent-config.yaml --metrics-prefix=

in this case the cnrm-resource-stats-recorder is crash looping and can never recover.

On the 1.18 nodes otelsvc is still running, but it doesn’t seem like it’s binding any port. Digging deeper, this is the gke-metrics-agent daemonset, on 1.18 it’s running version 0.3.5-gke.0 and on 1.19 0.3.8-gke.0. In both cases it’s using the host network, but there is a difference in the gke-metrics-agent-conf configmap (in the kube-system NS) - there’s actually quite a bit of difference, but the critical part is this:

$ diff /tmp/cm-1.19.yaml /tmp/cm-1.18.yaml |grep -B1 8888 
<         static_configs:
<         - targets: ["127.0.0.1:8888"]

I can’t find any info regarding that in the GKE changelog, but this means that a GKE cluster running 1.19 cannot work with both the metrics addon and the config connector addon… I would suggest to solve this internally and possibly change the port on either of these workloads.

eyalzek on Apr 14, 2021

Is there any ETA yet on when this fix will be available through the add-on?

Unfortunately we don’t have a lot of control around the add-on availability, it can be up to 8 weeks.

We are currently working on a project to try to reduce that time. At this point manual installation is your best choice to be on the edge.

toumorokoshi on May 25, 2021

I have the exact same issue:

k get po -n cnrm-system
NAME                                           READY   STATUS             RESTARTS   AGE
cnrm-controller-manager-0                      2/2     Running            0          9h
cnrm-deletiondefender-0                        1/1     Running            0          9h
cnrm-resource-stats-recorder-9f4c5ccfb-dznxz   1/2     CrashLoopBackOff   111        9h
cnrm-webhook-manager-5ccc747594-9clsv          1/1     Running            0          9h
cnrm-webhook-manager-5ccc747594-9lngc          1/1     Running            0          9h

Unfortunately even after the kubectl delete pod [recorder_pod_name] -n cnrm-system I’m still having that issue.

FYI:

GKE version 1.19.9-gke.100, Rapid channel
Config Connector version 1.45.0 - installation via its Operator

mathieu-benoit on Apr 14, 2021