k8s-config-connector: cnrm-resource-stats-recorder crash loop triggered by recorder container; port 8888 already in use
Checklist
- I did not find a related open issue.
- I did not find a solution in the troubleshooting guide: (https://cloud.google.com/config-connector/docs/troubleshooting)
- If this issue is time-sensitive, I have submitted a corresponding issue with GCP support.
Bug Description
A new revision for cnrm-resource-stats-recorder
that tried to start yesterday is failing in a crash loop in one of my clusters, complaining that port 8888 is already in use (in the ‘recorder’ container).
It’s trying to run recorder: gcr.io/gke-release/cnrm/recorder:d399cc9, prom-to-sd: k8s.gcr.io/prometheus-to-sd:v0.9.1
.
The previous version is still running fine, on versions: recorder: gcr.io/gke-release/cnrm/recorder:2081072, prom-to-sd: k8s.gcr.io/prometheus-to-sd:v0.9.1
.
Any insight into this issue would be appreciated! Should I kill this revision and try re-applying the config connector manifests?
Additional Diagnostic Information
Kubernetes Cluster Version
1.17.17-gke.2800
Config Connector Version
1.39.0
Config Connector Mode
cluster
Log Output
The logs from the “recorder” container are (repeated with each crash):
{ "msg": "Recording the stats of Config Connector resources" }
{ "error": "listen tcp :8888: bind: address already in use", "msg": "error registering the Prometheus HTTP handler" }
Steps to Reproduce
Steps to reproduce the issue
I’m not sure exactly what triggered this issue, but it seems that it occurred when the recorder
container was trying to upgrade to a new version while another revision was already running in the cluster.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 2
- Comments: 23 (8 by maintainers)
Thanks for all the info! as an update: we discussed internally and have decided to update the deployment strategy to
Recreate
, as well as clarify the exposed port in the Deployment spec to help the scheduler.I’ll update on a fix, goal is to get it in by next release.
@mathieu-benoit this should work. Could you try deleting all the recorder pods, rather than just a single one?
@toumorokoshi I just realized that we’re also seeing the same issue that @mathieu-benoit is seeing, but only on our rapid GKE cluster running version
v1.19.8-gke.2000
. On other clusters in the regular channel with versionv1.18.16-gke.502
this is not a problem.On the 1.19 nodes there’s software running which is bound to port 8888:
in this case the
cnrm-resource-stats-recorder
is crash looping and can never recover.On the 1.18 nodes
otelsvc
is still running, but it doesn’t seem like it’s binding any port. Digging deeper, this is thegke-metrics-agent
daemonset, on 1.18 it’s running version0.3.5-gke.0
and on 1.190.3.8-gke.0
. In both cases it’s using the host network, but there is a difference in thegke-metrics-agent-conf
configmap (in thekube-system
NS) - there’s actually quite a bit of difference, but the critical part is this:I can’t find any info regarding that in the GKE changelog, but this means that a GKE cluster running 1.19 cannot work with both the metrics addon and the config connector addon… I would suggest to solve this internally and possibly change the port on either of these workloads.
Unfortunately we don’t have a lot of control around the add-on availability, it can be up to 8 weeks.
We are currently working on a project to try to reduce that time. At this point manual installation is your best choice to be on the edge.
I have the exact same issue:
Unfortunately even after the
kubectl delete pod [recorder_pod_name] -n cnrm-system
I’m still having that issue.FYI:
1.19.9-gke.100
,Rapid
channel1.45.0
- installation via its Operator