rancher: Sporadic driver error during K8s upgrade: listen tcp 127.0.0.1:0: bind: address already in use
What kind of request is this (question/bug/enhancement/feature request): bug
Steps to reproduce (least amount of steps as possible):
Very similar issue to https://github.com/rancher/rancher/issues/19742
On two occasions now I’ve encountered separate customer issues where during a K8s upgrade, through the Rancher UI, the rolling upgrade will stall with this error:
Error starting driver: failed retrieving port for driver: listen tcp 127.0.0.1:0: bind: address already in
Getting stuck here: https://github.com/rancher/kontainer-engine/blob/65780a8839e471d3842580d5d9075ae9a3b70a02/service/service.go#L400
This issue causes the cluster controller inside Rancher to try to bind to a port that is already in use. Some customers report that eventually it self heals and upgrade completes. The quick fix for this is to cycle the Rancher pods.
It’s sporadic in that it doesn’t appear to happen with each downstream cluster. The commonality I’ve seen with customers is this:
- Custom cluster
- Ubuntu 18.04
- K8s version < 1.118.x
Result:
Other details that may be helpful:
Environment information
- Rancher version (
rancher/rancher/rancher/serverimage tag or shown bottom left in the UI): v2.4.5 - Installation option (single install/HA): HA
Cluster information
- Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom
- Machine type (cloud/VM/metal) and specifications (CPU/memory): cloud and VMware
- Kubernetes version (use
kubectl version):
v1.17.6 --> v1.18.6
- Docker version (use
docker version):
(paste the output here)
gz#11024
gz#12126
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 21 (16 by maintainers)
PR’s merged at 2.5 and master
Confirmed @Oats87 diagnose.
Telemetry client is not releasing nor reusing connections to Rancher API on every
report()execution, every 6h by default. Depending on the size of the system, telemetry client is doing a bunch of connections: 1 rancher API, 1 every cluster, 1 every project/cluster,… The number of connections is growing up until they are exhausted, generatinglisten tcp 127.0.0.1:0: bind: address already in use errors..As workaround, telemetry client may be killed at Rancher docker/pod to release connections. Rancher should restart telemetry client automatically.
Submitted PR https://github.com/rancher/telemetry/pull/47 to address the issue
To test it:
report()execution:curl -X POST http://localhost:8114/v1-telemetry/reportnetstat -natup |grep telemetry | wc -lreport()execution:curl -X POST http://localhost:8114/v1-telemetry/reportnetstat -natup |grep telemetry | wc -l.Telemetry connections should grow up on current version, but shouldn’t grow up on patched version